{"id":21721429,"url":"https://github.com/informaticsmatters/pipelines","last_synced_at":"2026-03-01T09:32:12.166Z","repository":{"id":18263807,"uuid":"76369410","full_name":"InformaticsMatters/pipelines","owner":"InformaticsMatters","description":"Containerised components for cheminformatics and computational chemistry","archived":false,"fork":false,"pushed_at":"2023-05-23T00:00:49.000Z","size":63592,"stargazers_count":36,"open_issues_count":18,"forks_count":19,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T21:36:38.806Z","etag":null,"topics":["docker","dockerfile","python","rdkit","squonk","squonk-pipelines"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InformaticsMatters.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-12-13T15:06:20.000Z","updated_at":"2024-12-29T11:16:05.000Z","dependencies_parsed_at":"2022-09-04T00:40:52.258Z","dependency_job_id":"8e1367a3-0ce1-4d8d-9c1b-157cf5ec43f6","html_url":"https://github.com/InformaticsMatters/pipelines","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/InformaticsMatters/pipelines","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fpipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fpipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fpipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fpipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InformaticsMatters","download_url":"https://codeload.github.com/InformaticsMatters/pipelines/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fpipelines/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29965593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T06:55:38.174Z","status":"ssl_error","status_checked_at":"2026-03-01T06:53:04.810Z","response_time":124,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","dockerfile","python","rdkit","squonk","squonk-pipelines"],"created_at":"2024-11-26T02:16:45.359Z","updated_at":"2026-03-01T09:32:12.119Z","avatar_url":"https://github.com/InformaticsMatters.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pipelines\n\n![build](https://github.com/InformaticsMatters/pipelines/workflows/build/badge.svg)\n![build latest](https://github.com/InformaticsMatters/pipelines/workflows/build%20latest/badge.svg)\n![build tag](https://github.com/InformaticsMatters/pipelines/workflows/build%20tag/badge.svg)\n![build stable](https://github.com/InformaticsMatters/pipelines/workflows/build%20stable/badge.svg)\n\n![GitHub release (latest SemVer including pre-releases)](https://img.shields.io/github/v/release/informaticsmatters/pipelines?include_prereleases)\n\nThe project experiments with ways to generate data processing pipelines.\nThe aim is to generate some re-usable building blocks that can be piped\ntogether into more functional pipelines. Their prime initial use is as executors\nfor the Squonk Computational Notebook (http://squonk.it) though it is expected\nthat they will have uses in other environments.\n\nAs well as being executable directly they can also be executed in Docker\ncontainers (separately or as a single pipeline). Additionally they can be\nexecuted using Nextflow (http://nextflow.io) to allow running large jobs\non HPC-like environments.\n\nCurrently it has some python scripts using RDKit (http://rdkit.org) to provide\nbasic cheminformatics and comp chem functionality, though other tools will\nbe coming soon, including some from the Java ecosystem.\n\n* See [here](src/python/pipelines/rdkit/README.md) for more info on the RDKit components.\n* See [here](src/nextflow/README.md) for more info on running these in Nextflow.\n\nNote: this is experimental, subject to change, and there are no guarantees that things work as expected!\nThat said, it's already proved to be highly useful in the Squonk Computational Notebook, and if you are interested let us know, and join the fun.\n\nThe code is licensed under the Apache 2.0 license.\n\n## Pipeline Utils\n\nIn Jan 2018 some of the core functionality from this repository was broken out into the [pipeline-utils](https://github.com/InformaticsMatters/pipeline-utils) repository. This included utility Python modules, as well as creation of a test framework that makes it easier to create and test new modules. This change also makes it easier to create additonal pipeline-like projects. See the [Readme](https://github.com/InformaticsMatters/pipelines-utils/blob/master/README.md) in the pipeline-utils repo for more details.\n\n## General principles\n\n### Modularity\n\nEach component should be small but useful. Try to split complex tasks into\nreusable steps. Think how the same steps could be used in other workflows.\nAllow parts of one component to be used in another component where appropriate\nbut avoid over use. For example see the use of functions in rdkit/conformers.py\nto generate conformers in o3dAlign.py\n\n### Consistency\n\nConsistent approach to how components function, regarding:\n\n1. Use as simple command line tools that can be piped together\n1. Input and outputs either as files of using STDIN and STDOUT\n1. Any info/logging written to STDERR to keep STDOUT free for output\n1. Consistent approach to command line arguments across components\n\nGenerally use consistent coding styles e.g. PEP8 for Python.\n\n## Input and output formats\n\nWe aim to provide consistent input and output formats to allow results to be\npassed between different implementations. Currently all implementations handle\nchemical structures so SD file would typically be used as the lowest common\ndenominator interchange format, but implementations should also try to support\nSquonk's JSON based Dataset formats, which potentially allow richer representations\nand can be used to describe data other than chemical structures.\nThe utils.py module provides helper methods to handle IO.\n\n### Thin output\n\nIn addition implementations are encouraged to support \"thin\" output formats\nwhere this is appropriate. A \"thin\" representation is a minimal representation\ncontaining only what is new or changed, and can significantly reduce the bandwith\nused and avoid the need for the consumer to interpret values it does not\nneed to understand. It is not always appropriate to support thin format output\n(e.g. when the structure is changed by the process).\n\nIn the case of SDF thin format involves using an empty molecule for the molecule\nblock and all properties that were present in the input or were generated by the\nprocess (the empty molecule is used so that the SDF syntax remains valid).\n\nIn the case of Squonk JSON output the thin output would be of type BasicObject\n(e.g. containing no structure information) and include all properties that\nwere present in the input or were generated by the process.\n\nImplicit in this is that some identifier (usually a SD file property, or\nthe JSON UUID property) that is present in the input is included in the output so\nthat the full results can be \"reassembled\" by the consumer of the output.\nThe input would typically only contain additional information that is required\nfor execution of the process e.g. the structure.\n\nFor consistency implementations should try to honor these command line\nswitches relating to input and output:\n\n-i and --input: For specifying the location of the single input. If not specified\nthen STDIN should be used. File names ending with .gz should be interpreted as\ngzipped files. Input on STDIN should not be gzipped.\n\n-if and --informat: For specifying the input format where it cannot be inferred\nfrom the file name (e.g. when using STDIN). Values would be sdf or json.\n\n-o and --output: For specifying the base name of the ouputs (there could be multiple\noutput files each using the same base name but with a different file extension.\nIf not specified then STDOUT should be used. Output file names ending with\n.gz should be compressed using gzip. Output on STDOUT would not be gzipped.\n\n-of and --outformat: For specifying the output format where it cannot be inferred\nfrom the file name (e.g. when using STDOUT). Values would be sdf or json.\n\n--meta: Write additional metadata and metrics (mostly relevant to Squonk's\nJSON format - see below). Default is not to write.\n\n--thin: Write output in thin format (only present where this makes sense).\nDefault is not to use thin format.\n\n### UUIDs\n\nThe JSON format for input and oputput makes heavy use of UUIDs that uniquely\nidentify each structure. Generally speaking, if the structure is not changed\n(e.g. properties are just being added to input structures) then the existing\nUUID should be retained so that UUIDs in the output match those from the input.\nHowever if new structures are being generated (e.g. in reaction enumeration\nor conformer generation) then new UUIDs MUST be generated as there is no longer\na straight relationship between the input and output structures. Instead you\nprobably want to store the UUID of the source structure(s) as a field(s) in\nthe output. To allow correlation of the outputs to the inputs (e.g. for conformer\ngeneration output the source molecule UUID as a field so that each conformer\nidentifies which source molecule it was derived from.\n\nWhen not using JSON format the need to handle UUIDs does not necessarily apply\n(though if there is a field named 'uuid' in the input it will be respected accordingly).\nTo accommodate this you are recommended to ALSO specify the input molecule number\n(1 based index) as an output field independent of whether UUIDs are being handled\nas a \"poor man's\" approach to correlating the outputs to the inputs.\n\n### Filtering\n\nWhen a service that filters molecules special attention is needed to ensure\nthat the molecules are output in the same order as the input (obviously skipping\nstructures that are filtered out). Also the service descriptor (.dsd.json) file needs special care. For\ninstance take a look at the \"thinDescriptors\" section of src/pipelines/rdkit/screen.dsd.json\n\nWhen using multi-threaded execution this is especially important as results\nwill usually not come back in exactly the same order as the input.\n\n### Metrics\n\nTo provide information about what happened you are strongly recommended to generate\na metrics output file (e.g. output_metrics.txt). This file allows to provide\nfeedback about what happened. The contents of this file are fairly simple,\neach line having a\n\n`key=value`\n\nsyntax. Keys beginning and ending with __ (2 underscores) have magical meaning.\nAll other keys are treated as metrics that are recorded against that execution.\nThe current magical values that are recognised are:\n\n* InputCount: The total count of records (structures) that are processed\n* OutputCount: The count of output records\n* ErrorCount: The number of errors encountered\n\nHere is a typical metrics file:\n\n```\n__InputCount__=360\n__OutputCount__=22\nPLI=360\n\n```\n\nIt defines the input and output counts and specifies that 360 PLI 'units'\nshould be recorded as being consumed during execution.\n\nThe purpose of the metrics is primarily to be able to chage for utilisation, but\neven if not charging (which is often the case) then it is still good practice\nto record the utilisation.\n\n### Metadata\n\nSquonk's JSON format requires additional metadata to allow proper handling\nof the JSON. Writing detailed metadata is optional, but recommended. If\nnot present then Squonk will use a minimal representation of metadata, but\nit's recommended to provide this directly so that additional information can\nbe added.\n\nAt the very minimum Squonk needs to know the type of dataset (e.g. MoleculeObject\nor BasicObject), but this should be handled for you automatically if you use\nthe utils.default_open_output* methods. Better though to also specify metadata for\nthe field types when you do this. See e.g. conformers.py for an example of\nhow to do this.\n\n## Deployment to Squonk\n\nThe service descriptors need to to POSTed to the Squonk coreservices REST API.\n\n### Docker\n\nA shell script can be used to deploy the pipelines to a running\ncontainerised Squonk deployment: -\n\n    $ ./post-service-descriptors.sh\n\n### OpenShift/OKD\n\nThe pipelines and service-descriptor container images are built using gradle\nin this project. The are deployed from the Squonk project using Ansible\nplaybooks.\n\n\u003e   A discussion about the deployment of pipelines can be found in the\n    `Posting Squonk pipelines` section of Squonk's OpenShift Ansible\n    [README](https://github.com/InformaticsMatters/squonk/blob/master/openshift/ansible/README.md).\n\n## Running tests\n\nThe test running is in the pipelines-utils repo and tests are run from there.\nFor full details consult that repo.\n\nBut as a quick start you should be able to run the the tests in a conda environment like this:\n\nCreate a conda environment containing RDKit:\n```\nconda env create -f environment-rdkit-utils.yml\n```\nNow activate that environment:\n```\nconda activate pipelines-utils\n```\n\nNote: this environment includes pipeline-utils and pipeline-utils-rdkit from PyPi.\nIf you need to use changes from these repos you will need to create a conda environment that does not contain these and\ninstead set your `PYTHONPATH` environment variable to include the `pipelines-utils` and `pipelines-utils-rdkit` sources\n(adjusting `/path/to/` to whatever is needed):\n```\nexport PYTHONPATH=/path/to/pipelines-utils/src/python:/path/to/pipelines-utils-rdkit/src/python\n```\n\nMove into the `pipelines-utils` repo (this should be alongside `pipelines` and `pipelines-utils-rdkit`):\n```\ncd ../pipelines-utils\n```\n\nRun tests:\n```\n./gradlew runPipelineTester -Pptargs=-opipelines\n```\n\n## Contact\n\nAny questions contact:\n\nTim Dudgeon\ntdudgeon@informaticsmatters.com\n\nAlan Christie\nachristie@informaticsmatters.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finformaticsmatters%2Fpipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finformaticsmatters%2Fpipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finformaticsmatters%2Fpipelines/lists"}