https://github.com/EleutherAI/dps

Data processing system for polyglot
https://github.com/EleutherAI/dps

Last synced: 5 months ago
JSON representation

Data processing system for polyglot

Host: GitHub
URL: https://github.com/EleutherAI/dps
Owner: EleutherAI
License: apache-2.0
Created: 2022-04-05T14:42:35.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2023-09-05T07:26:33.000Z (almost 3 years ago)
Last Synced: 2025-04-24T18:48:46.222Z (about 1 year ago)
Language: Python
Homepage:
Size: 7.67 MB
Stars: 91
Watchers: 6
Forks: 28
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-production-llm - dps

README

          # DPS (Data Processing System)

**Note**: there are two frameworks for running Spark-based processing jobs in DPS

  * An RDD-based framework, which is described in this README

  * A DataFrame-based framework, described in [a separate document](doc/dataframe.md)

## Requirements

- python 3.8

## How to run DPS?

```bash

python setup.py install

python bin/sparkapp.py {job_name} {params}

# Example

# python bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml

```

## DPS job list

 job | describe | param options

  -- | -- | --

  `sample_job` | Sample jsonl data from text files in directories | `yaml configs`

  `dedup_job` | De-duplicate jsonl data using MinHash method | `yaml configs`

  `korean_job` | Refine jsonl data in Korean language | `yaml configs`

## Development guides

### Test Run

This is test run for `sample_job` job.

#### 1. Setup `dps` package

```bash

python setup.py install

```

#### 2. Check config file and dataset

```bash

cat configs/sample_job.yaml

ls datasets/test_sample_jsonl_data

```

#### 3. Run `sample_job` job by `bin/sparkapp.py`

```bash

python bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml

```

#### 4. Check output file

```bash

cat datasets/test_output_data/part-00000

```

### Add your own job

#### Implement your job function

0. Make an issue on `ElutherAI/dps` repository

    - Describe your job first

    - Define input and outputs and these examples

1. Go to `dps/spark/jobs` and create python `your_own_job.py` script file.

2. Make a function to run your job. Here's template to play your works.

    ```python

    from pyspark import SparkContext

    from pyspark.rdd import RDD

    from dps.spark.spark_session import spark_session

    from dps.spark.utils.io_utils import read_line, to_json

    def your_own_job(input_path, output_path):

        

        with spark_session(f'your own job') as spark:

            sc: SparkContext = spark.sparkContext # Spark context is to run your spark application

            # Read all files in your directory or file

            proc_rdd: RDD = sc.textFile(input_path) \

                .repartition(10) \

                .flatMap(read_line) 

                

            # Write data that you processed

            proc_rdd \

                .repartition(1) \

                .flatMap(to_json) \

                .saveAsTextFile(output_path)

    ```

3. Register your job into `dps/spark/run.py`

    ```python

    from .jobs.your_own_job import your_own_job

    def run():

        fire.Fire({'sample_job': sample_job,

                   'your_own_job': your_own_job

                   })

    ```

4. Test run your job 

    ```bash

    python bin/sparkapp.py your_own_job --input_path='{input_your_data_dir_or_file}' \

                                        --output_path='{output_path}'

    ```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/EleutherAI/dps

Awesome Lists containing this project

README