https://github.com/EleutherAI/dps
Data processing system for polyglot
https://github.com/EleutherAI/dps
Last synced: 5 months ago
JSON representation
Data processing system for polyglot
- Host: GitHub
- URL: https://github.com/EleutherAI/dps
- Owner: EleutherAI
- License: apache-2.0
- Created: 2022-04-05T14:42:35.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-09-05T07:26:33.000Z (almost 3 years ago)
- Last Synced: 2025-04-24T18:48:46.222Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 7.67 MB
- Stars: 91
- Watchers: 6
- Forks: 28
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-production-llm - dps
README
# DPS (Data Processing System)
**Note**: there are two frameworks for running Spark-based processing jobs in DPS
* An RDD-based framework, which is described in this README
* A DataFrame-based framework, described in [a separate document](doc/dataframe.md)
## Requirements
- python 3.8
## How to run DPS?
```bash
python setup.py install
python bin/sparkapp.py {job_name} {params}
# Example
# python bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml
```
## DPS job list
job | describe | param options
-- | -- | --
`sample_job` | Sample jsonl data from text files in directories | `yaml configs`
`dedup_job` | De-duplicate jsonl data using MinHash method | `yaml configs`
`korean_job` | Refine jsonl data in Korean language | `yaml configs`
## Development guides
### Test Run
This is test run for `sample_job` job.
#### 1. Setup `dps` package
```bash
python setup.py install
```
#### 2. Check config file and dataset
```bash
cat configs/sample_job.yaml
ls datasets/test_sample_jsonl_data
```
#### 3. Run `sample_job` job by `bin/sparkapp.py`
```bash
python bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml
```
#### 4. Check output file
```bash
cat datasets/test_output_data/part-00000
```
### Add your own job
#### Implement your job function
0. Make an issue on `ElutherAI/dps` repository
- Describe your job first
- Define input and outputs and these examples
1. Go to `dps/spark/jobs` and create python `your_own_job.py` script file.
2. Make a function to run your job. Here's template to play your works.
```python
from pyspark import SparkContext
from pyspark.rdd import RDD
from dps.spark.spark_session import spark_session
from dps.spark.utils.io_utils import read_line, to_json
def your_own_job(input_path, output_path):
with spark_session(f'your own job') as spark:
sc: SparkContext = spark.sparkContext # Spark context is to run your spark application
# Read all files in your directory or file
proc_rdd: RDD = sc.textFile(input_path) \
.repartition(10) \
.flatMap(read_line)
# Write data that you processed
proc_rdd \
.repartition(1) \
.flatMap(to_json) \
.saveAsTextFile(output_path)
```
3. Register your job into `dps/spark/run.py`
```python
from .jobs.your_own_job import your_own_job
def run():
fire.Fire({'sample_job': sample_job,
'your_own_job': your_own_job
})
```
4. Test run your job
```bash
python bin/sparkapp.py your_own_job --input_path='{input_your_data_dir_or_file}' \
--output_path='{output_path}'
```