SynGatorTron: A Large Clinical Natural Language Generation Model for Synthetic Data Generation and Zero-shot Tasks
, Senior Data Scientist, NVIDIA
, Director of Natural Language Processing, UF CTSI/OneFlorida, University of Florida
We propose to develop SynGatorTron using a GPT-3-based architecture implemented in the NVIDIA Megatron framework and the HiPerGator AI cluster deployed at the University of Florida (with 140 NVIDIA DGX A100 SuperPods) to generate naturally de-identified, pre-training scale synthetic clinical text as a surrogate for training large clinical transformers. Synthetic clinical text generation offers a route to building large, naturally de-identified clinical corpora at a scale that's practically impossible through manual labeling, de-identification, and other privacy-preserving methods. Such a model could preserve the knowledge of medical language but mitigate the risks caused by the sensitive nature of clinical text, and provide few- and zero-shot encoder task capabilities without the need for extensive labeled datasets and structured clinical ontologies.