{"id":29822556,"url":"https://github.com/NVIDIA/NeMo-Curator","last_synced_at":"2025-07-29T01:03:52.333Z","repository":{"id":227968333,"uuid":"772255271","full_name":"NVIDIA-NeMo/Curator","owner":"NVIDIA-NeMo","description":"Scalable data pre processing and curation toolkit for LLMs","archived":false,"fork":false,"pushed_at":"2025-07-24T01:31:23.000Z","size":11408,"stargazers_count":1031,"open_issues_count":122,"forks_count":151,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-07-24T11:37:26.810Z","etag":null,"topics":["data","data-curation","data-prep","data-preparation","data-processing","data-processing-pipelines","data-quality","datacuration","datarecipes","deduplication","fast-data-processing","fine-tuning","large-language-models","large-scale-data-processing","llm","llm-data-quality","llmapps","python","semantic-deduplication"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-NeMo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-14T20:41:51.000Z","updated_at":"2025-07-24T10:50:01.000Z","dependencies_parsed_at":"2024-05-21T23:28:40.873Z","dependency_job_id":"38698fa3-e369-413e-a338-faecd83a5d64","html_url":"https://github.com/NVIDIA-NeMo/Curator","commit_stats":null,"previous_names":["nvidia/nemo-curator","nvidia-nemo/curator"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA-NeMo/Curator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FCurator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FCurator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FCurator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FCurator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-NeMo","download_url":"https://codeload.github.com/NVIDIA-NeMo/Curator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FCurator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267414364,"owners_count":24083599,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-curation","data-prep","data-preparation","data-processing","data-processing-pipelines","data-quality","datacuration","datarecipes","deduplication","fast-data-processing","fine-tuning","large-language-models","large-scale-data-processing","llm","llm-data-quality","llmapps","python","semantic-deduplication"],"created_at":"2025-07-29T01:02:00.877Z","updated_at":"2025-07-29T01:03:52.324Z","avatar_url":"https://github.com/NVIDIA-NeMo.png","language":"Python","funding_links":[],"categories":["Data Annotation and Synthesis","5. **Data Generation, Processing and Management**","NVIDIA","LLM Data Preprocessing"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n  \u003ca href=\"https://github.com/NVIDIA-NeMo/Curator/blob/main/LICENSE\"\u003e![https://pypi.org/project/nemo-curator](https://img.shields.io/github/license/NVIDIA-NeMo/Curator)\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/nemo-curator/\"\u003e![https://pypi.org/project/nemo-curator/](https://img.shields.io/pypi/pyversions/nemo-curator.svg)\u003c/a\u003e\n  \u003ca href=\"https://github.com/NVIDIA-NeMo/Curator/graphs/contributors\"\u003e![NVIDIA-NeMo/Curator](https://img.shields.io/github/contributors/NVIDIA-NeMo/Curator)\u003c/a\u003e\n  \u003ca href=\"https://github.com/NVIDIA-NeMo/Curator/releases\"\u003e![https://github.com/NVIDIA-NeMo/Curator/releases](https://img.shields.io/github/release/NVIDIA-NeMo/Curator)\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/nemo-curator/\"\u003e![https://github.com/Naereen/badges/](https://badgen.net/badge/open%20source/❤/blue?icon=github)\u003c/a\u003e\n\n\u003c/div\u003e\n\n# Accelerate Data Processing and Streamline Synthetic Data Generation with NVIDIA NeMo Curator\n\nNeMo Curator is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).\n\nIt greatly accelerates data processing and curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.\n\nNeMo Curator also provides pre-built pipelines for synthetic data generation for customization and evaluation of generative AI systems. You can use any OpenAI API compatible model and plug it in NeMo Curator's synthetic data generation pipelines to process and curate high-quality synthetic data for various use cases.\n\n## Getting Started\n\nNew to NeMo Curator? Start with our quickstart guides for hands-on experience:\n\n- **[Text Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html)** - Set up your environment and run your first text curation pipeline in under 30 minutes\n- **[Image Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/image.html)** - Learn to curate large-scale image-text datasets for generative model training\n\nFor production deployments and advanced configurations, see our [Setup \u0026 Deployment documentation](https://docs.nvidia.com/nemo/curator/latest/admin/index.html).\n\n---\n\n## Key Features\n\nWith NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.\n\n### Text Curation\nAll of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data curation pipelines. Text curation follows a three-stage workflow: **Load** → **Process** → **Generate**. A typical pipeline starts by downloading raw data from public resources, then applies cleaning and filtering steps, and optionally generates synthetic data for training enhancement.\n\n#### Load Data\n- **[Download and Extraction](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html)** - Default implementations for Common Crawl, Wikipedia, and ArXiv sources with easy customization for other sources\n\n#### Process Data  \n- **Quality Assessment \u0026 Filtering**\n  - [Heuristic Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) - 30+ heuristic filters for punctuation density, length, and repetition analysis\n  - [fastText Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) - Fast language and quality classification\n  - [GPU-Accelerated Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) - Domain, Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification\n\n- **Deduplication**\n  - [Exact Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - Remove identical documents efficiently\n  - [Fuzzy Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - MinHash Locality Sensitive Hashing with optional False Positive Check\n  - [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - GPU-accelerated semantic deduplication using RAPIDS cuML, cuDF, and PyTorch\n\n- **Content Processing \u0026 Cleaning**\n  - [Text Cleaning](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) - Remove improperly decoded Unicode characters, inconsistent line spacing, and excessive URLs\n  - [PII Redaction](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/pii.html) - Identify and remove personally identifiable information from training datasets\n\n- **Specialized Processing**\n  - [Language Identification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/language-management/index.html) - Accurate language detection using fastText\n  - [Task Decontamination](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/specialized-processing/task-decontamination.html) - Remove potential evaluation data leakage from training datasets\n\n#### Generate Data\n- **[Synthetic Data Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/index.html)** - Pre-built pipelines for generating high-quality synthetic training data:\n  - [Open Q\u0026A Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/open-qa.html) - Create question-answer pairs for instruction tuning\n  - [Math Problem Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/math.html) - Generate mathematical problems for educational content\n  - [Coding Tasks](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/python.html) - Create programming challenges and code examples\n  - [Writing Prompts](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/writing-task.html) - Generate creative writing and content creation tasks\n  - [Dialogue Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/dialogue.html) - Create conversational data for chat models\n  - [Nemotron Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/wikipedia.html) - Wikipedia-style rewriting and knowledge distillation\n\n---\n\n### Image Curation\n\nNeMo Curator provides powerful image curation features to curate high-quality image data for training generative AI models such as LLMs, VLMs, and WFMs. Image curation follows a **Load** → **Process** workflow: download datasets in WebDataset format, create embeddings, apply quality filters (NSFW and Aesthetic), and remove duplicates using semantic deduplication.\n\n#### Load Data\n- **[WebDataset Loading](https://docs.nvidia.com/nemo/curator/latest/curate-images/load-data/index.html)** - Load large-scale image-text datasets in WebDataset format\n\n#### Process Data\n- **Embeddings \u0026 Feature Extraction**\n  - [Image Embedding Creation](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/embeddings/index.html) - Generate CLIP embeddings for image analysis\n\n- **Quality Assessment \u0026 Filtering**\n  - [Aesthetic Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Filter images based on aesthetic quality\n  - [NSFW Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Remove inappropriate content from datasets\n\n- **Deduplication**\n  - [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - Remove visually similar images using embedding-based clustering\n\n---\n\n## Module Ablation and Compute Performance\n\nThe modules within NeMo Curator were primarily designed to process and curate high-quality documents at scale.  To evaluate the quality of the data, we curated Common Crawl documents and conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator.\n\nThe following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./docs/user-guide/assets/readme/chart.png\" alt=\"drawing\" width=\"700\"/\u003e\n\u003c/p\u003e\n\nNeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers can achieve 16X faster processing for text. Refer to the chart below to learn more details.\n\nNeMo Curator scales near linearly which means that developers can accelerate their data processing by adding more compute. For  deduplicating the 1.96 Trillion token subset of the RedPajama V2 dataset, NeMo Curator took  0.5 hours with 32 NVIDIA H100 GPUs. Refer to the scaling chart below to learn more\n\n## Contribute to NeMo Curator\n\nWe welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for the process.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2FNeMo-Curator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2FNeMo-Curator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2FNeMo-Curator/lists"}