Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Accelerating Linguistic Diversity: GPU-Powered Corpus Curation With NVIDIA NeMo Curator for Spanish, French, and Other Non-English Languages
, Sr. Deep Learning Solution Architect, NVIDIA
, Sr. Deep Learning Data Scientist, NVIDIA
, Sr. Deep Learning Data Scientist, NVIDIA
Developing LLMs for non-English languages often faces the challenge of limited or imbalanced datasets. We present a cutting-edge approach to curating high-quality text corpora for languages like Spanish, French, and beyond, leveraging the power of GPU acceleration. We'll delve into the methodologies for constructing a comprehensive corpus, starting with basic heuristics for text selection and cleaning through more sophisticated techniques, such as semantic deduplication. Finally, we'll explore the innovative strategies for synthetic data generation, a crucial step in augmenting the dataset where real-world examples are scarce.
We'll show how NVIDIA's NeMo Curator can perform these tasks with remarkable speed and efficiency. You'll leave with practical insights into corpus curation workflows that can be applied to a variety of non-English languages, paving the way for more inclusive and representative language technologies. Prerequisite(s):