{"id":37670224,"url":"https://github.com/rxn4chemistry/rxn-reaction-preprocessing","last_synced_at":"2026-01-16T12:02:50.533Z","repository":{"id":111416798,"uuid":"543667279","full_name":"rxn4chemistry/rxn-reaction-preprocessing","owner":"rxn4chemistry","description":"Preprocessing of datasets of chemical reactions: standardization, filtering, augmentation, tokenization, etc.","archived":false,"fork":false,"pushed_at":"2025-09-10T06:30:28.000Z","size":7125,"stargazers_count":13,"open_issues_count":4,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-10-27T10:57:59.774Z","etag":null,"topics":["chemistry","data","processing","reactions","retrosynthesis","rxn"],"latest_commit_sha":null,"homepage":"https://rxn4chemistry.github.io/rxn-reaction-preprocessing/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rxn4chemistry.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-09-30T15:31:22.000Z","updated_at":"2025-09-10T06:30:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"ce0e8f3f-615b-41bc-9afe-5413f2f39e6f","html_url":"https://github.com/rxn4chemistry/rxn-reaction-preprocessing","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/rxn4chemistry/rxn-reaction-preprocessing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rxn4chemistry%2Frxn-reaction-preprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rxn4chemistry%2Frxn-reaction-preprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rxn4chemistry%2Frxn-reaction-preprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rxn4chemistry%2Frxn-reaction-preprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rxn4chemistry","download_url":"https://codeload.github.com/rxn4chemistry/rxn-reaction-preprocessing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rxn4chemistry%2Frxn-reaction-preprocessing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemistry","data","processing","reactions","retrosynthesis","rxn"],"created_at":"2026-01-16T12:02:50.438Z","updated_at":"2026-01-16T12:02:50.522Z","avatar_url":"https://github.com/rxn4chemistry.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RXN reaction preprocessing\n\n[![Actions tests](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions/workflows/tests.yaml/badge.svg)](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions)\n\nThis repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. \nIt also includes code for stable train/test/validation splits and data augmentation.\n\nLinks:\n* [GitHub repository](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)\n* [Documentation](https://rxn4chemistry.github.io/rxn-reaction-preprocessing/)\n* [PyPI package](https://pypi.org/project/rxn-reaction-preprocessing/)\n\n## System Requirements\n\nThis package is supported on all operating systems.\nIt has been tested on the following systems:\n* macOS: Big Sur (11.1)\n* Linux: Ubuntu 18.04.4\n\nA Python version of 3.7 or greater is recommended.\n\n## Installation guide\n\nThe package can be installed from Pypi:\n```bash\npip install rxn-reaction-preprocessing[rdkit]\n```\nYou can leave out `[rdkit]` if you prefer to install `rdkit` manually (via Conda or Pypi).\n\nFor local development, the package can be installed with:\n```bash\npip install -e \".[dev]\"\n```\n\n## Usage\nThe following command line scripts are installed with the package.\n\n### rxn-data-pipeline\nWrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.\n\nFor an overview of all available configuration parameters and default values, run: `rxn-data-pipeline --cfg job`.\n\nConfiguration using YAML (see the file `config.py` for more options and their meaning):\n```yaml\ndefaults:\n  - base_config\n\ndata:\n  path: /tmp/inference/input.csv\n  proc_dir: /tmp/rxn-preproc/exp\ncommon:\n  sequence:\n    # Define which steps and in which order to execute:\n    - IMPORT\n    - STANDARDIZE\n    - PREPROCESS\n    - SPLIT\n    - TOKENIZE\n  fragment_bond: TILDE\npreprocess:\n  min_products: 0\nsplit:\n  split_ratio: 0.05\ntokenize:\n  input_output_pairs:\n    - inp: ${data.proc_dir}/${data.name}.processed.train.csv\n      out: ${data.proc_dir}/${data.name}.processed.train\n    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n      out: ${data.proc_dir}/${data.name}.processed.validation\n    - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n      out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name example_config\n```\n\nConfiguration using command line arguments (example):\n```bash\nrxn-data-pipeline \\\n  data.path=/path/to/data/rxns-small.csv \\\n  data.proc_dir=/path/to/proc/dir \\\n  common.fragment_bond=TILDE \\\n  rxn_import.data_format=TXT \\\n  tokenize.input_output_pairs.0.out=train.txt \\\n  tokenize.input_output_pairs.1.out=validation.txt \\\n  tokenize.input_output_pairs.2.out=test.txt\n```\n\n## Note about reading CSV files\nPandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.\nIn order for the scripts to work despite this, all the `pd.read_csv` function calls should include the argument `lineterminator='\\n'`.\n\n## Examples\n\n### A pipeline supporting augmentation\n\nA config supporting augmentation of the training split called `train-augmentation-config.yaml`:\n```yaml\ndefaults:\n  - base_config\n\ndata:\n  name: pipeline-with-augmentation\n  path: /tmp/file-with-reactions.txt\n  proc_dir: /tmp/rxn-preprocessing/experiment\ncommon:\n  sequence:\n    # Define which steps and in which order to execute:\n    - IMPORT\n    - STANDARDIZE\n    - PREPROCESS\n    - SPLIT\n    - AUGMENT\n    - TOKENIZE\n  fragment_bond: TILDE\nrxn_import:\n  data_format: TXT\npreprocess:\n  min_products: 1\nsplit:\n  input_file_path: ${preprocess.output_file_path}\n  split_ratio: 0.05\naugment:\n  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv\n  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv\n  permutations: 10\n  tokenize: false\n  random_type: rotated\ntokenize:\n  input_output_pairs:\n    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv\n      out: ${data.proc_dir}/${data.name}.augmented.train\n      reaction_column_name: rxn_rotated\n    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n      out: ${data.proc_dir}/${data.name}.processed.validation\n    - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n      out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name train-augmentation-config\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frxn4chemistry%2Frxn-reaction-preprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frxn4chemistry%2Frxn-reaction-preprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frxn4chemistry%2Frxn-reaction-preprocessing/lists"}