Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected

      Accelerating Linguistic Diversity: GPU-Powered Corpus Curation With NVIDIA NeMo Curator for Spanish, French, and Other Non-English Languages

      , Sr. Deep Learning Solution Architect, NVIDIA
      , Sr. Deep Learning Data Scientist, NVIDIA
      , Sr. Deep Learning Data Scientist, NVIDIA
      Developing LLMs for non-English languages often faces the challenge of limited or imbalanced datasets. We present a cutting-edge approach to curating high-quality text corpora for languages like Spanish, French, and beyond, leveraging the power of GPU acceleration. We'll delve into the methodologies for constructing a comprehensive corpus, starting with basic heuristics for text selection and cleaning through more sophisticated techniques, such as semantic deduplication. Finally, we'll explore the innovative strategies for synthetic data generation, a crucial step in augmenting the dataset where real-world examples are scarce.

      We'll show how NVIDIA's NeMo Curator can perform these tasks with remarkable speed and efficiency. You'll leave with practical insights into corpus curation workflows that can be applied to a variety of non-English languages, paving the way for more inclusive and representative language technologies.
      Prerequisite(s):

      A solid programming background.
      活动: GTC 25
      日期: March 2025
      行业: 所有行业
      级别: 通用
      话题: Generative AI - Text Generation
      NVIDIA 技术: NeMo
      语言: 英语
      所在地: