Retrieval-Augmented Language Model and Its Application for Question-Answering and Image Captioning
, Principal Research Scientist, NVIDIA
Language models (LMs) can be largely improved by retrieving from a large-scale text corpus. In particular, augmenting generative language models with a retrieval module at the pre-training stage (e.g., RETRO) can significantly reduce perplexity on the held-out dataset. However, apart from lower perplexity, it remains unknown whether the model like RETRO can obtain similar success in terms of downstream task accuracy and text-generation quality. I'll present a comprehensive study on RETRO compared with the standard GPT model. Specifically, we pre-train RETRO of parameters ranging from 148 million up to 9.5 billion by retrieving over 330 billion tokens. Extensive experimental results show that RETRO, or our proposed variant models, outperforms the standard GPT on (1) open-ended text generation with higher factual accuracy, lower toxicity, and less repetition; (2) LM Evaluation Harness benchmark under both zero-shot and fine-tuning settings; and (3) open-domain question-answering benchmarks. Furthermore, we also augmented the RETRO model for image-to-text generation. The proposed Retrieval-augmented Visual Language Model (Re-ViLM) obtained state-of-the-art zero-and few-shot image captioning results.