{"id":27974237,"url":"https://github.com/danish-foundation-models/dfm-processing","last_synced_at":"2025-07-02T14:09:04.257Z","repository":{"id":276180214,"uuid":"925378756","full_name":"danish-foundation-models/dfm-processing","owner":"danish-foundation-models","description":"Toolkit for processing data in the danish foundation models project.","archived":false,"fork":false,"pushed_at":"2025-03-28T16:07:32.000Z","size":585,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-02T14:09:02.326Z","etag":null,"topics":["data","text-processing"],"latest_commit_sha":null,"homepage":"https://www.foundationmodels.dk/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danish-foundation-models.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-31T19:06:32.000Z","updated_at":"2025-03-28T16:07:34.000Z","dependencies_parsed_at":"2025-03-07T12:26:57.213Z","dependency_job_id":"78009d68-af11-48ed-b950-cc36284446cb","html_url":"https://github.com/danish-foundation-models/dfm-processing","commit_stats":null,"previous_names":["danish-foundation-models/dfm-processing"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/danish-foundation-models/dfm-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danish-foundation-models%2Fdfm-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danish-foundation-models%2Fdfm-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danish-foundation-models%2Fdfm-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danish-foundation-models%2Fdfm-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danish-foundation-models","download_url":"https://codeload.github.com/danish-foundation-models/dfm-processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danish-foundation-models%2Fdfm-processing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263154351,"owners_count":23422009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","text-processing"],"created_at":"2025-05-08T00:13:18.346Z","updated_at":"2025-07-02T14:09:04.227Z","avatar_url":"https://github.com/danish-foundation-models.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv id=\"top\"\u003e\n\n\u003c!-- HEADER STYLE: CLASSIC --\u003e\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"static/logo.png\" width=\"100%\" style=\"position: relative; top: 0; right: 0;\" alt=\"Project Logo\"/\u003e\n\n\u003cbr\u003e\n\n# DFM-PROCESSING\n\n\u003cem\u003eEffortlessly Deduplicate and Process Data at Scale\u003c/em\u003e\n\n\u003c!-- BADGES --\u003e\n\u003cimg src=\"https://img.shields.io/github/license/danish-foundation-models/dfm-processing?style=default\u0026logo=opensourceinitiative\u0026logoColor=white\u0026color=0080ff\" alt=\"license\"\u003e\n\u003cimg src=\"https://img.shields.io/github/last-commit/danish-foundation-models/dfm-processing?style=default\u0026logo=git\u0026logoColor=white\u0026color=0080ff\" alt=\"last-commit\"\u003e\n\u003cimg src=\"https://img.shields.io/github/languages/top/danish-foundation-models/dfm-processing?style=default\u0026color=0080ff\" alt=\"repo-top-language\"\u003e\n\u003cimg src=\"https://img.shields.io/github/languages/count/danish-foundation-models/dfm-processing?style=default\u0026color=0080ff\" alt=\"repo-language-count\"\u003e\n\n\u003c!-- default option, no dependency badges. --\u003e\n\n\n\u003c!-- default option, no dependency badges. --\u003e\n\n\u003c/div\u003e\n\u003cbr\u003e\n\n---\n\n## Table of Contents\n\n- [Table of Contents](#table-of-contents)\n- [Overview](#overview)\n- [Project Structure](#project-structure)\n- [Getting Started](#getting-started)\n    - [Prerequisites](#prerequisites)\n    - [Installation](#installation)\n    - [Usage](#cli-usage)\n- [More information](#more-information)\n- [Wish to contribute?](#wish-to-contribute)\n\n---\n\n## Overview\n\nDanish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:\n\n- Develop and maintain **state-of-the-art models** for Danish,\n- which are **well-validated** across a wide range of tasks.\n- Furthermore, we wish to **ensure good documentation**, which allows users to assess the model for their use-case critically\n- **Open-source**, both model and source code\n\n*Note*: This repository is intended for the data processing of DFM.\n\n\n---\n\n## Project Structure\n\n```sh\n└── dfm-processing/\n    ├── .github\n    │   └── workflows\n    ├── LICENSE\n    ├── README.md\n    ├── config\n    │   └── example.yaml\n    ├── pyproject.toml\n    ├── src\n    │   └── dfm_processing\n    ├── tests\n    │   ├── cli\n    │   ├── data_pipeline\n    │   └── document_processing\n    └── uv.lock\n```\n\n---\n\n## Getting Started\n\n### Prerequisites\n\nThis project requires the following dependencies:\n\n- **Programming Language:** Python\n- **Package Manager:** Uv\n\n### Installation\n\nBuild dfm-processing from the source and intsall dependencies:\n\n1. **Clone the repository:**\n\n    ```sh\n    ❯ git clone https://github.com/danish-foundation-models/dfm-processing\n    ```\n\n2. **Navigate to the project directory:**\n\n    ```sh\n    ❯ cd dfm-processing\n    ```\n\n3. **Install the dependencies:**\n\t\u003c!-- SHIELDS BADGE CURRENTLY DISABLED --\u003e\n\t\u003c!-- [![uv][uv-shield]][uv-link] --\u003e\n\t\u003c!-- REFERENCE LINKS --\u003e\n\t\u003c!-- [uv-shield]: https://img.shields.io/badge/uv-DE5FE9.svg?style=for-the-badge\u0026logo=uv\u0026logoColor=white --\u003e\n\t\u003c!-- [uv-link]: https://docs.astral.sh/uv/ --\u003e\n\t**Using [uv](https://docs.astral.sh/uv/):**\n\n\t```sh\n\t❯ uv sync --all-extras\n\t```\n\n### CLI Usage\n\nThe CLI is divided into two sections, \"document\" and \"pipeline\". Each section contains specific commands for different tasks.\n\n#### Document Processing (`document`)\n\n1. **Process Directory:**\n   - **Purpose:** Extract text data from various file types in a directory.\n   - **Usage:**\n     ```bash\n     uv run dfm-processing document process-directory path_to_dir output_dir dataset_name\n     ```\n   - **Example:**\n     ```bash\n     uv run dfm-processing document process-directory ./data ./output my_dataset\n     ```\n\n2. **Process Web Crawl:**\n   - **Purpose:** Extract text data from a web crawl.\n   - **Usage:**\n     ```bash\n     uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name\n     ```\n   - **Example:**\n     ```bash\n     uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com\n     ```\n\n### Data Pipeline (`pipeline`)\n\n1. **Filter:**\n   - **Purpose:** Run a filtering pipeline on a dataset to filter out \"poor\" quality data.\n   - **Usage:**\n     ```bash\n     uv run dfm-processing pipeline filter yaml_config\n     ```\n   - **Example:**\n     ```bash\n     uv run dfm-processing pipeline filter ./config/example.yaml\n     ```\n\n2. **Sentence Deduplication (`sent_dedup`):**\n   - **Purpose:** Perform sentence deduplication on a given dataset.\n   - **Usage:**\n     ```bash\n     uv run dfm-processing pipeline sent_dedup yaml_config\n     ```\n   - **Example:**\n     ```bash\n     uv run dfm-processing pipeline sent_dedup ./config/example.yaml\n     ```\n\n3. **MinHash Deduplication (`minhash-dedup`):**\n   - **Purpose:** Perform MinHash Deduplication on a given dataset.\n   - **Usage:**\n     ```bash\n     uv run dfm-processing pipeline minhash-dedup yaml_config\n     ```\n   - **Example:**\n     ```bash\n     uv run dfm-processing pipeline minhash-dedup ./config/example.yaml\n     ```\n\n---\n\n## More information:\nFor more information please check out the following links:\n\n|                                                                                                         |                                                                                                         |\n| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |\n| 📑 [**About**](https://foundationmodels.dk/)              | A overview of the DFM project                                                                           |\n| [**Research Paper**](https://arxiv.org/abs/2311.07264)                                                  | An paper introducing DFM and its rationale                                                              |\n| 🚀 [**Models**](https://www.foundationmodels.dk/models/) | A overview of current models available through the DFM project                                          |\n| 💽 [**Datasets**](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword)       | Includes datasheets about the datasets which includes preprocessing, reason for constructions and more. |\n\n\n\n## Wish to contribute?\nDFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:\n\n|                                                                                                                      |                                                               |\n| -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |\n| 🗣 [**DDSC Slack**](https://join.slack.com/t/danskdatascie-o8m9638/shared_invite/zt-1jh2dwmj4-D_mjywfXERvVP75n9O0ykg) | Join the discussion in the \"danish-foundation-models\"-channel |\n| 💬 [**GitHub Discussion**](https://github.com/danish-foundation-models/dfm-processing/discussions)   | Ask questions or start a discussion                           |\n| 🚨 [**GitHub Issues**](https://github.com/danish-foundation-models/dfm-processing/issues)            | Notices a bug in the code? Please create an issue             |\n\nYou can contribute both:\n\n-  Developer time, the lifeblood of any open-source project\n-  Pre-training datasets you wish to include in the model training\n-  Validation tasks can even be private benchmarks where you only wish to share the performance metrics.\n- And probably in many other ways\n\n\u003cdiv align=\"right\"\u003e\n\n[![][back-to-top]](#top)\n\n\u003c/div\u003e\n\n\n[back-to-top]: https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square\n\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanish-foundation-models%2Fdfm-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanish-foundation-models%2Fdfm-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanish-foundation-models%2Fdfm-processing/lists"}