https://github.com/danish-foundation-models/dfm-processing

Toolkit for processing data in the danish foundation models project.
https://github.com/danish-foundation-models/dfm-processing

data text-processing

Last synced: 12 months ago
JSON representation

Toolkit for processing data in the danish foundation models project.

Host: GitHub
URL: https://github.com/danish-foundation-models/dfm-processing
Owner: danish-foundation-models
License: mit
Created: 2025-01-31T19:06:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-28T16:07:32.000Z (about 1 year ago)
Last Synced: 2025-07-02T14:09:02.326Z (12 months ago)
Topics: data, text-processing
Language: Python
Homepage: https://www.foundationmodels.dk/
Size: 571 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DFM-PROCESSING

Effortlessly Deduplicate and Process Data at Scale

repo-top-language
repo-language-count

---

## Table of Contents

- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Project Structure](#project-structure)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#cli-usage)
- [More information](#more-information)
- [Wish to contribute?](#wish-to-contribute)

---

## Overview

Danish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:

- Develop and maintain **state-of-the-art models** for Danish,
- which are **well-validated** across a wide range of tasks.
- Furthermore, we wish to **ensure good documentation**, which allows users to assess the model for their use-case critically
- **Open-source**, both model and source code

*Note*: This repository is intended for the data processing of DFM.

---

## Project Structure

```sh
└── dfm-processing/
├── .github
│ └── workflows
├── LICENSE
├── README.md
├── config
│ └── example.yaml
├── pyproject.toml
├── src
│ └── dfm_processing
├── tests
│ ├── cli
│ ├── data_pipeline
│ └── document_processing
└── uv.lock
```

---

## Getting Started

### Prerequisites

This project requires the following dependencies:

- **Programming Language:** Python
- **Package Manager:** Uv

### Installation

Build dfm-processing from the source and intsall dependencies:

1. **Clone the repository:**

```sh
❯ git clone https://github.com/danish-foundation-models/dfm-processing
```

2. **Navigate to the project directory:**

```sh
❯ cd dfm-processing
```

3. **Install the dependencies:**

**Using [uv](https://docs.astral.sh/uv/):**

```sh
❯ uv sync --all-extras
```

### CLI Usage

The CLI is divided into two sections, "document" and "pipeline". Each section contains specific commands for different tasks.

#### Document Processing (`document`)

1. **Process Directory:**
- **Purpose:** Extract text data from various file types in a directory.
- **Usage:**
```bash
uv run dfm-processing document process-directory path_to_dir output_dir dataset_name
```
- **Example:**
```bash
uv run dfm-processing document process-directory ./data ./output my_dataset
```

2. **Process Web Crawl:**
- **Purpose:** Extract text data from a web crawl.
- **Usage:**
```bash
uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name
```
- **Example:**
```bash
uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com
```

### Data Pipeline (`pipeline`)

1. **Filter:**
- **Purpose:** Run a filtering pipeline on a dataset to filter out "poor" quality data.
- **Usage:**
```bash
uv run dfm-processing pipeline filter yaml_config
```
- **Example:**
```bash
uv run dfm-processing pipeline filter ./config/example.yaml
```

2. **Sentence Deduplication (`sent_dedup`):**
- **Purpose:** Perform sentence deduplication on a given dataset.
- **Usage:**
```bash
uv run dfm-processing pipeline sent_dedup yaml_config
```
- **Example:**
```bash
uv run dfm-processing pipeline sent_dedup ./config/example.yaml
```

3. **MinHash Deduplication (`minhash-dedup`):**
- **Purpose:** Perform MinHash Deduplication on a given dataset.
- **Usage:**
```bash
uv run dfm-processing pipeline minhash-dedup yaml_config
```
- **Example:**
```bash
uv run dfm-processing pipeline minhash-dedup ./config/example.yaml
```

---

## More information:
For more information please check out the following links:

| | |
| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| 📑 [**About**](https://foundationmodels.dk/) | A overview of the DFM project |
| [**Research Paper**](https://arxiv.org/abs/2311.07264) | An paper introducing DFM and its rationale |
| 🚀 [**Models**](https://www.foundationmodels.dk/models/) | A overview of current models available through the DFM project |
| 💽 [**Datasets**](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) | Includes datasheets about the datasets which includes preprocessing, reason for constructions and more. |

## Wish to contribute?
DFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:

| | |
| -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| 🗣 [**DDSC Slack**](https://join.slack.com/t/danskdatascie-o8m9638/shared_invite/zt-1jh2dwmj4-D_mjywfXERvVP75n9O0ykg) | Join the discussion in the "danish-foundation-models"-channel |
| 💬 [**GitHub Discussion**](https://github.com/danish-foundation-models/dfm-processing/discussions) | Ask questions or start a discussion |
| 🚨 [**GitHub Issues**](https://github.com/danish-foundation-models/dfm-processing/issues) | Notices a bug in the code? Please create an issue |

You can contribute both:

- Developer time, the lifeblood of any open-source project
- Pre-training datasets you wish to include in the model training
- Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
- And probably in many other ways

[![][back-to-top]](#top)

[back-to-top]: https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danish-foundation-models/dfm-processing

Awesome Lists containing this project

README