GPU-accelerated BERT deployment on AWS

Boole Room

The advent of large-scale language models has improved the state-of-the-art by a significant margin for NLP tasks such as Question Answering, Sentence Classification, Named Entity Recognition, Sentiment Analysis, etc. The deployment of these tasks in a production environment could benefit from acceleration with hardware as well as optimization of the network via software. Nvidia GPUs paired with TensorRT optimizations make for a perfect production deployment scenario. One of the most popular language models, BERT, has been optimized by Nvidia with TensorRT to support all operations and meet low latency production requirements. In this talk we will demonstrate how to use a pre-trained and fine-tuned language model for a specific task, optimize it with TensorRT, and then deploy on an AWS GPU-enabled instance.