{"id":34108115,"url":"https://github.com/druths/xp","last_synced_at":"2026-03-27T04:35:04.431Z","repository":{"id":31416740,"uuid":"34980120","full_name":"druths/xp","owner":"druths","description":"A framework (comand line tool + libraries) for creating flexible compute pipelines","archived":false,"fork":false,"pushed_at":"2021-01-29T05:00:00.000Z","size":250,"stargazers_count":55,"open_issues_count":7,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-03-27T04:08:23.906Z","etag":null,"topics":["data-science","notebook","pipeline","research-tool","workflow"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/druths.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-05-03T09:10:58.000Z","updated_at":"2026-03-09T19:50:48.000Z","dependencies_parsed_at":"2022-09-02T11:51:13.379Z","dependency_job_id":null,"html_url":"https://github.com/druths/xp","commit_stats":null,"previous_names":["druths/flex"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/druths/xp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/druths%2Fxp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/druths%2Fxp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/druths%2Fxp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/druths%2Fxp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/druths","download_url":"https://codeload.github.com/druths/xp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/druths%2Fxp/sbom","scorecard":{"id":357200,"data":{"date":"2025-08-11","repo":{"name":"github.com/druths/xp","commit":"a4f66ae3551fc0b5c66ece816a8276bd7b7e3ccf"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.1,"checks":[{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":1,"reason":"Found 3/27 approved changesets -- score normalized to 1","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 6 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-18T09:54:04.784Z","repository_id":31416740,"created_at":"2025-08-18T09:54:04.784Z","updated_at":"2025-08-18T09:54:04.784Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31020060,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-27T03:51:26.850Z","status":"ssl_error","status_checked_at":"2026-03-27T03:51:09.693Z","response_time":164,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","notebook","pipeline","research-tool","workflow"],"created_at":"2025-12-14T18:13:45.606Z","updated_at":"2026-03-27T04:35:04.425Z","avatar_url":"https://github.com/druths.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# xp [![Build Status](https://travis-ci.org/druths/xp.svg?branch=master)](https://travis-ci.org/druths/xp) [![Doc Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://xpds.readthedocs.org/en/latest/) #\n\n*Expressive pipelines for data science*\n\nData science projects get disorganized quickly. Every test involves a new\nscript, each script requires a panoply of arguments and produces one or more\ndata files.  Keeping track of all this implied structure is a pain - what is\nthis script for?  What does it depend on?  What created this data file? Which\nparameters updated this table in the database?\n\nEnter xp - a utility that allows you to express and run all the computational\ntasks in a project. Crucially, it captures the specific parameters used for\neach task, the data files produced, and any dependencies that task has on other\ntasks. All this is captured in files called *pipelines* (which can even be\nconnected to one another).  Toss in some helpful comments, and you have\nexecutable documentation for your project.\n\nThis may sound a lot like scientific notebook environments (e.g., Jupyter and\nMathematica), but there are some key differences. Notebooks only allow linear\ndependencies between computational tasks - which is a tremendous simplification\nof even moderate-sized projects.\n\nTo this end, it has three primary goals:\n\n  1. Capture the task-level logic and structure of a data science project in a\n     language-agnostic way\n\n  1. Make it possible to trace data back to the specific tasks (and commands)\n     that produced it\n\n  1. Connect documentation of tasks to the task logic itself\n\nIt aims to achieve these three things without introducing any overhead.  You\nwon't have to write more code than you currently are doing, write or maintain\nany extra documentation, or use fancy data management solutions. Whatever you\nare already doing, xp is compatible with it.\n\nxp is a command-line tool for building data-science pipelines, particularly\nin the research context. By *pipeline*, we mean writing tasks that depend on\none another.  Imagine *make*, except with a lot of intelligence built into it\nthat is relevant to data science work.\n\nMoreover, xp makes it easy to create and update pipelines while always\nretaining a connection to the data that the pipeline produced and is\nself-documenting at the same time.\n\n*Detailed documentation is available on [readthedocs](http://xpds.readthedocs.org/en/latest/).*\n\n# Installation #\n\nInstall xp off [pypi](https://pypi.python.org/pypi?name=xp\u0026:action=display) using\n\n\tpip install xp\n\nor install from source by downloading from github and running\n\n\tpython setup.py install\n\nor\n\n\tsudo python setup.py install\n\ndepending on permission settings for your python site-packages.\n\n# Writing a pipeline #\n\nA *pipeline* is a sequence of steps (called *tasks*) that manipulates data.  On\na very practical level, each task takes some data as input and produces\nother data as output. In the example below, there are four tasks, each\ninvolved in a different part of the workflow.\n\n```\n# Pipeline: cluster_data\n\nDATA_URL=http://foobar:8080/data.tsv\nNAME_COLUMN=1\nCOUNT_COLUMN=2\nALPHA=2\n\ndownload_data:\n\tcode.sh:\n\t\tcurl $DATA_URL \u003e data.tsv\n\nextract_columns: download_data\n\tcode.sh:\n\t\tcut -f $NAME_COLUMN,$COUNT_COLUMN data.tsv | tail +2 \u003e xdata.tsv\n\tcode.py:\n\t\tfrom csv import reader\n\t\tfout = open('xdata2.tsv','w')\n\t\tfor data in reader(open('xdata.tsv','r'),delimiter='\\t'):\n\t\t\tfout.write('%s\\t%s\\n' % (data[0],data[1][1:-1])\n\t\ncluster_rows.sh: extract_columns\n\t./cluster.sh --alpha $ALPHA xdata2.tsv \u003e clusters.tsv\n\nplot_clusters: cluster_rows\n\tcode.gnuplot:\n\t\tplot \"clusters.tsv\" using 1:2 title 'Cluster quality'\n\n```\n\nTasks can depend on other tasks (e.g., `extract_columns` depends on\n`download_data`) - either in the same pipeline or in other pipelines.  By\nmaking tasks depend on other pipelines, it's possible to treat pipelines as\nmodular parts of much larger workflows.\n\nOnce a task completes without error, it is *marked* - which flags it as not\nneeding to be run again.  In order to re-run a task, one can simply unmark it\nand run it again.\n\nIf you choose to run a task which has unmarked dependencies, these will be run\nbefore the task itself is run - in this way, an entire workflow can be run\nusing a single command.\n\nA task contains one or more *blocks* which describe the actual computational\nsteps being taken. As is seen in the example above, blocks can contain code for\nvarious different languages - making it possible to stitch together workflows\nthat involve different languages. A single task can even contain multiple\nblocks for the same or different languages.\n\nCurrently, xp supports four block types:\n\n  - *export* (`export`) - this allows environment variables to be set and \n  \tunset within the context of a specific task\n\n  - *python* (`code.py`)\n\n  - *shell* (`code.sh`)\n\n  * *gnuplot* (`code.gnuplot`)\n\nThese, of course, require that the appropriate executables are present on the\nsystem. To customize the executable used, environment variables can be set\n(`PYTHON_EXEC` and `GNUPLOT_EXEC`, respectively).\n\nFuture releases will support additional languages natively and also provide a\nplugin mechanism for adding new block types. \n\nNotice in the example above that the task `cluster_rows` places a language\nsuffix right after the task name.  This is called a simple task: it consists of\nexactly one block, written in the language of the language suffix, which\nfollows the task definition line directly. This basically is a useful shortcut\nfor tasks which contain only one block.\n\nOnce a pipeline has been written, it can be run using the xp command-line tool.\n\n  xp run pipeline_file\n\nThe command-line tool also allows easy marking (`mark`), unmarking (`unmark`),\nand querying task info (`tasks`) for a pipeline.\n\n## Pipeline-specific Data ##\n\nA common activity that creates a lot of data management issues is running\neffectively the same or similar pipelines using different parameter settings:\nfiles can get overwritten and, more generally, the user typically loses track\nof exactly which files came from what setting.\n\nIn xp, files produced by a pipeline can be easily bound to their pipeline,\neliminating this confusion.\n\n```\nDATA_URL=http://foobar:8080/data.tsv\nALPHA=2\n...\n\ndownload_data:\n\tcode.sh:\n\t\tcurl $DATA_URL \u003e $PLN(data.tsv)\n\n...\n```\n\nIn the excerpt above, the file `data.tsv` is being bound to this pipeline using\nthe `$PLN(.)` function. In effect, the file is prefixed (either by name or\nplaced in a pipeline-specific directory).  Future references to this file via\n`$PLN(data.tsv)` will access only this pipeline's version of the file - even if\nmany pipelines are downloading the files at various times.\n\n(Note that `$` is treated as a special character by xp and so identifiers such\nas `$PATH` or `$1` will be parsed as xp variables. To avoid this, escape the\n`$` using `\\`, e.g. `\\$PATH` or `\\$1`).\n\n## Extending Pipelines ##\n\nIn some cases, one will want to run exactly the same pipeline over and over\nwith different parameter settings. To support this, xp allows *extending*\npipelines.  Much like subclassing, extending a pipeline brings all the content\nof one pipeline into another one.  Assume we are clustering some data using the\nprocess here (in pipeline `cluster_pln`).  The process is parameterized by the\nalpha value.\n\n```\n# Pipeline: cluster_pln\n\nCLUSTERS_FNAME=clusters.tsv\n\ncluster_rows: extract_columns\n\tcode.sh:\n\t\t./cluster.sh --alpha $ALPHA xdata2.tsv \u003e $PLN($CLUSTERS_FNAME)\n\nplot_clusters: cluster_rows\n\tcode.gnuplot:\n\t\tplot \"$PLN($CLUSTERS_FNAME)\" using 1:2 title 'Cluster quality at alpha=$ALPHA'\n```\n\nWe can extend this pipeline to retain the same workflow, but use different values:\n\n```\n# Pipeline: cluster_a2\nextend cluster_pln\n\nALPHA=2\n```\n\nand again for a different value\n\n```\n# Pipeline: cluster_a3\nextend cluster_pln\n\nALPHA=3\n```\n\nNote that in each case the cluster data will be stored to `$PLN(clusters.tsv)`,\nso that each pipeline will have its own separate stored data.\n\n## Connecting Pipelines Together ##\n\nIt's quite reasonable to expect that one pipeline could feed into another\npipeline. xp supports this - pipelines can depend on the tasks in other\npipelines - and in doing so, create even larger workflows that retain their\nnice modular organization.\n\nConsider that the earlier pipeline given above, `cluster_a2`, could actually be\nassembling the data for a classifier. Let's break this classifier portion of\nthe project into its own workflow.\n\n```\n# Pipeline: lda_classifier\n\nuse cluster_a2 as cdata\n\nNEWS_ARTICLES=articles/*.gz\n\nbuild_lda: cdata.cluster_rows\n\texport:\n\t\tLDA_CLASSIFIER=lda_runner\n\n\tcode.sh:\n\t\t${LDA_CLASSIFIER} -input $PLN(cdata, ${cdata.CLUSTERS_FNAME}) -output $PLN(lda_model.json)\n\nlabel_articles: build_lda\n\texport:\n\t\tLDA_LABELER=/opt/bin/lda_labeler\n\tcode.sh:\t\n\t\t${LDA_LABELER} -model $PLN(lda_model.json) -data \"$NEWS_ARTICLES\" \u003e $PLN(news.labels)\n```\n\nIn the example above, notice how the task `build_lda` both depends on a task\nfrom the `cluster_a2` pipeline and *also* uses data from that pipeline's\nnamespace, `$PLN(cdata, ${cdata.CLUSTERS_FNAME})`, where `${cdata.CLUSTERS_FNAME}`\nreferences the `CLUSTERS_FNAME` variable inherited by `cluster_a2` from\n`cluster_pln`.\n\nOf course, we might want to try multiple classifiers on the same source data,\nso we can create other pipelines that use `cluster_a2`, shown next.\n\n```\n# Pipeline: crf_classifier\n\nuse cluster_a2 as cdata\n\nNEWS_ARTICLES=articles/*.gz\n\nbuild_crf_model: cdata.cluster_rows\n\tcode.sh:\n\t\t/opt/bin/build_crf -data $PLN(cdata, ${cdata.CLUSTERS_FNAME}) -output $PLN(crf_model.json)\n\nlabel_articles.py: build_crf_model\n\t\timport crf_model\n\n\t\tmodel = crf_model.load_model('$PLN(crf_model.json)')\n\t\tmodel.label_documents(docs='$NEWS_ARTICLES',out_file='$PLN(news.labels)')\n```\n\n## Examples ##\n\nSee the `examples/` directory in the xp root directory to see some real\npipelines that demonstrate the core features of the tool.\n\n# Command-line usage #\n\nThe `xp` command provides several core capabilities:\n\n  - `xp tasks \u003cpipeline\u003e` will out info about one or more tasks in the pipeline including whether they are marked\n\n  - `xp run \u003cpipeline\u003e` will run a pipeline (or a task within a pipeline)\n\n  - `xp mark \u003cpipeline\u003e` will mark specific tasks or an entire pipeline\n\n  - `xp unmark \u003cpipeline\u003e` will unmark specific tasks or an entire pipeline\n\nAll of these commands have help messages to help their correct use.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdruths%2Fxp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdruths%2Fxp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdruths%2Fxp/lists"}