{"id":39778719,"url":"https://github.com/EleutherAI/dps","last_synced_at":"2026-01-26T22:00:57.803Z","repository":{"id":37859951,"uuid":"478171836","full_name":"EleutherAI/dps","owner":"EleutherAI","description":"Data processing system for polyglot","archived":false,"fork":false,"pushed_at":"2023-09-05T07:26:33.000Z","size":8047,"stargazers_count":91,"open_issues_count":13,"forks_count":28,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-24T18:48:46.222Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EleutherAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-04-05T14:42:35.000Z","updated_at":"2025-02-26T15:13:19.000Z","dependencies_parsed_at":"2025-04-24T18:48:24.461Z","dependency_job_id":null,"html_url":"https://github.com/EleutherAI/dps","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EleutherAI/dps","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fdps","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fdps/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fdps/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fdps/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EleutherAI","download_url":"https://codeload.github.com/EleutherAI/dps/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fdps/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28789720,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T21:49:50.245Z","status":"ssl_error","status_checked_at":"2026-01-26T21:48:29.455Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-18T12:00:35.801Z","updated_at":"2026-01-26T22:00:57.792Z","avatar_url":"https://github.com/EleutherAI.png","language":"Python","funding_links":[],"categories":["LLM Data Preprocessing"],"sub_categories":[],"readme":"# DPS (Data Processing System)\n\n**Note**: there are two frameworks for running Spark-based processing jobs in DPS\n  * An RDD-based framework, which is described in this README\n  * A DataFrame-based framework, described in [a separate document](doc/dataframe.md)\n\n\n## Requirements\n\n- python 3.8\n\n## How to run DPS?\n\n```bash\npython setup.py install\npython bin/sparkapp.py {job_name} {params}\n\n# Example\n# python bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml\n```\n\n## DPS job list\n\n job | describe | param options\n  -- | -- | --\n  `sample_job` | Sample jsonl data from text files in directories | `yaml configs`\n  `dedup_job` | De-duplicate jsonl data using MinHash method | `yaml configs`\n  `korean_job` | Refine jsonl data in Korean language | `yaml configs`\n\n## Development guides\n\n### Test Run\n\nThis is test run for `sample_job` job.\n\n#### 1. Setup `dps` package\n\n```bash\npython setup.py install\n```\n\n#### 2. Check config file and dataset\n\n```bash\ncat configs/sample_job.yaml\nls datasets/test_sample_jsonl_data\n```\n\n#### 3. Run `sample_job` job by `bin/sparkapp.py`\n\n```bash\npython bin/sparkapp.py sample_job --config_path=./configs/sample_job.yaml\n```\n\n#### 4. Check output file\n\n```bash\ncat datasets/test_output_data/part-00000\n```\n\n### Add your own job\n\n#### Implement your job function\n\n0. Make an issue on `ElutherAI/dps` repository\n    - Describe your job first\n    - Define input and outputs and these examples\n1. Go to `dps/spark/jobs` and create python `your_own_job.py` script file.\n2. Make a function to run your job. Here's template to play your works.\n    ```python\n    from pyspark import SparkContext\n    from pyspark.rdd import RDD\n\n    from dps.spark.spark_session import spark_session\n    from dps.spark.utils.io_utils import read_line, to_json\n\n\n    def your_own_job(input_path, output_path):\n        \n        with spark_session(f'your own job') as spark:\n            sc: SparkContext = spark.sparkContext # Spark context is to run your spark application\n\n            # Read all files in your directory or file\n            proc_rdd: RDD = sc.textFile(input_path) \\\n                .repartition(10) \\\n                .flatMap(read_line) \n                \n            # Write data that you processed\n            proc_rdd \\\n                .repartition(1) \\\n                .flatMap(to_json) \\\n                .saveAsTextFile(output_path)\n    ```\n3. Register your job into `dps/spark/run.py`\n    ```python\n    from .jobs.your_own_job import your_own_job\n\n    def run():\n        fire.Fire({'sample_job': sample_job,\n                   'your_own_job': your_own_job\n                   })\n    ```\n\n4. Test run your job \n    ```bash\n    python bin/sparkapp.py your_own_job --input_path='{input_your_data_dir_or_file}' \\\n                                        --output_path='{output_path}'\n    ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEleutherAI%2Fdps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEleutherAI%2Fdps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEleutherAI%2Fdps/lists"}