{"id":13412512,"url":"https://github.com/capillariesio/capillaries","last_synced_at":"2025-08-01T00:02:38.116Z","repository":{"id":63450301,"uuid":"540939783","full_name":"capillariesio/capillaries","owner":"capillariesio","description":"Distributed batch data processing framework","archived":false,"fork":false,"pushed_at":"2024-08-09T17:14:47.000Z","size":5605,"stargazers_count":59,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-08-09T18:54:07.028Z","etag":null,"topics":["batch-processing","cassandra","dag","distributed-computing","distributed-systems","go","golang","rabbitmq","relational-algebra","workflow-engine","workflows"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/capillariesio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-24T19:11:20.000Z","updated_at":"2024-08-09T17:14:50.000Z","dependencies_parsed_at":"2023-11-17T22:57:18.769Z","dependency_job_id":"d45078b3-0cbf-4d17-ba18-ae1e49891a64","html_url":"https://github.com/capillariesio/capillaries","commit_stats":{"total_commits":75,"total_committers":2,"mean_commits":37.5,"dds":0.1466666666666666,"last_synced_commit":"83379c1fbb1913c5372e415e7ed79a2fedaa6c22"},"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capillariesio%2Fcapillaries","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capillariesio%2Fcapillaries/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capillariesio%2Fcapillaries/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capillariesio%2Fcapillaries/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/capillariesio","download_url":"https://codeload.github.com/capillariesio/capillaries/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233352207,"owners_count":18663254,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","cassandra","dag","distributed-computing","distributed-systems","go","golang","rabbitmq","relational-algebra","workflow-engine","workflows"],"created_at":"2024-07-30T20:01:25.547Z","updated_at":"2025-08-01T00:02:38.024Z","avatar_url":"https://github.com/capillariesio.png","language":"Go","funding_links":[],"categories":["Distributed Systems","分布式系统"],"sub_categories":["Search and Analytic Databases","检索及分析资料库"],"readme":"# \u003cimg src=\"doc/logo.svg\" alt=\"logo\" width=\"60\"/\u003e \u003ca href=\"https://capillaries.io\"\u003eCapillaries\u003c/a\u003e \u003cdiv style=\"float:right;\"\u003e [![coveralls](https://coveralls.io/repos/github/capillariesio/capillaries/badge.svg?branch=main)](https://coveralls.io/github/capillariesio/capillaries?branch=main) [![goreport](https://goreportcard.com/badge/github.com/capillariesio/capillaries)](https://goreportcard.com/report/github.com/capillariesio/capillaries) [![Go Reference](https://pkg.go.dev/badge/github.com/capillariesio/capillaries.svg)](https://pkg.go.dev/github.com/capillariesio/capillaries)\u003c/div\u003e\r\n\r\n\r\nCapillaries is a data processing framework that:\r\n- addresses scalability issues and manages intermediate data storage, enabling users to focus on data transforms and quality control;\r\n- bridges the gap between distributed, scalable data processing/integration solutions and the necessity to produce enriched, customer-ready, production-quality, human-curated data within SLA time limits.\r\n\r\nThis is a GitHub readme. More details and a blog at https://capillaries.io.\r\n\r\n## Why Capillaries?\r\n![Capillaries: before and after](doc/beforeafter.png)\r\n\r\n\r\n|             | BEFORE | AFTER |\r\n| ----------- | ------ |------ |\r\n| Cloud-friendly | Depends | Can be deployed to the cloud within minutes; Docker-ready |\r\n| Data aggregation | SQL joins | Capillaries [lookups](doc/glossary.md#lookup) in Cassandra + [Go expressions](doc/glossary.md#go-expressions) (scalability, parallel execution) |\r\n| Data filtering | SQL queries, custom code | [Go expressions](doc/glossary.md#go-expressions) (scalability, maintainability) |\r\n| Data transform | SQL expressions, custom code | [Go expressions](doc/glossary.md#go-expressions), Python [formulas](doc/glossary.md#py_calc-processor) (parallel execution, maintainability) |\r\n| Intermediate data storage | Files, relational databases | on-the-fly-created Cassandra [keyspaces](doc/glossary.md#keyspace) and [tables](doc/glossary.md#table) (scalability, maintainability) |\r\n| Workflow execution | Shell scripts, custom code, workflow frameworks | RabbitMQ as scheduler, workflow status stored in Cassandra (parallel execution, fault tolerance, incremental computing) |\r\n| Workflow monitoring and interaction | Custom solutions | Capillaries [UI](ui/README.md), [Toolbelt](doc/glossary.md#toolbelt) utility, [API](doc/api.md), [Web API](doc/glossary.md#webapi) (transparency, operator validation support) |\r\n| Workflow management | Shell scripts, custom code | Capillaries configuration: [script file](doc/glossary.md#script) with [DAG](doc/glossary.md#dag), Python [formulas](doc/glossary.md#py_calc-processor) |\r\n\r\n## Getting started\r\n\r\nOn Mac, WSL or Linux, run in bash shell:\r\n\r\n```\r\ngit clone https://github.com/capillariesio/capillaries.git\r\ncd capillaries\r\n./copy_demo_data.sh\r\ndocker compose -p \"test_capillaries_containers\" up\r\n```\r\n\r\nNavigate to `http://localhost:8080` to see [Capillaries UI](./doc/glossary.md#capillaries-ui), wait until it shows the `Keyspaces` screen with no errors. It may take a while - all docker containers must start and Cassandra must be fully initialized. Now Capillaries is ready to process sample demo input data according to the sample demo scripts (all copied by copy_demo_data.sh above).\r\n\r\nStart a new Capillaries [data processing run](./doc/glossary.md#run) by clicking \"New run\" and providing the following parameters (no tabs or spaces allowed in textboxes):\r\n\r\n| Field | Value |\r\n|- | - |\r\n| Keyspace | portfolio_quicktest |\r\n| Script URI | /tmp/capi_cfg/portfolio_quicktest/script_quick.json |\r\n| Script parameters URI | /tmp/capi_cfg/portfolio_quicktest/script_params_quick_fs_one.json |\r\n| Start nodes |\t1_read_accounts,1_read_txns,1_read_period_holdings |\r\n\r\nAlternatively, you can start a new [run](./doc/glossary.md#run) using Capillaries [toolbelt](./doc/glossary.md#toolbelt) by executing the following command from the Docker host machine, it should have the same effect as starting a run from the UI:\r\n\r\n```\r\ndocker exec -it capillaries_webapi /usr/local/bin/capitoolbelt start_run -script_file=/tmp/capi_cfg/portfolio_quicktest/script_quick.json -params_file=/tmp/capi_cfg/portfolio_quicktest/script_params_quick_fs_one.json -keyspace=portfolio_quicktest -start_nodes=1_read_accounts,1_read_txns,1_read_period_holdings\r\n```\r\n\r\nWatch the progress in Capillaries UI. A new keyspace `portfolio_quicktest` will appear in the keyspace list. Click on it and watch the run complete - nodes `7_file_account_period_sector_perf` and `7_file_account_year_perf` should produce result files:\r\n\r\n```\r\ncat /tmp/capi_out/portfolio_quicktest/account_period_sector_perf.csv\r\ncat /tmp/capi_out/portfolio_quicktest/account_year_perf.csv\r\n```\r\n\r\n## Monitoring your test runs\r\n\r\nBesides [Capillaries UI](./doc/glossary.md#capillaries-ui) at `http://localhost:8080`, you may want to check out the stats provided by other tools.\r\n\r\nLog messages generated by:\r\n- Capillaries [Daemon](./doc/glossary.md#daemon)\r\n- Capillaries [WebAPI](./doc/glossary.md#webapi)\r\n- Capillaries [UI](./doc/glossary.md#capillaries-ui)\r\n- RabbitMQ\r\n- Cassandra with Prometheus jmx-exporter\r\n- Prometheus\r\nare collected by fluentd and saved in /tmp/capi_log.\r\n\r\nTo see Cassandra cluster status, run this command (reset JVM_OPTS so jmx-exporter doesn't try to attach to the nodetool JMV process):\r\n```\r\ndocker exec -e JVM_OPTS= capillaries_cassandra1 nodetool status\r\n```\r\n\r\nCassandra read/write statistics and some Daemon/Webapi metrics collected by Prometheus available at:\r\n\r\n`http://localhost:9090/query?g0.expr=sum%28irate%28cassandra_clientrequest_localrequests_count%7Bclientrequest%3D%22Write%22%7D%5B1m%5D%29%29\u0026g0.show_tree=0\u0026g0.tab=graph\u0026g0.range_input=15m\u0026g0.res_type=auto\u0026g0.res_density=medium\u0026g0.display_mode=lines\u0026g0.show_exemplars=1\u0026g1.expr=sum%28irate%28cassandra_clientrequest_localrequests_count%7Bclientrequest%3D%22Read%22%7D%5B1m%5D%29%29\u0026g1.show_tree=0\u0026g1.tab=graph\u0026g1.range_input=15m\u0026g1.res_type=auto\u0026g1.res_density=medium\u0026g1.display_mode=lines\u0026g1.show_exemplars=0\u0026g2.expr=irate%28capi_script_def_cache_hit_count%5B1m%5D%29\u0026g2.show_tree=0\u0026g2.tab=graph\u0026g2.range_input=15m\u0026g2.res_type=auto\u0026g2.res_density=medium\u0026g2.display_mode=lines\u0026g2.show_exemplars=0\u0026g3.expr=irate%28capi_script_def_cache_miss_count%5B1m%5D%29\u0026g3.show_tree=0\u0026g3.tab=graph\u0026g3.range_input=15m\u0026g3.res_type=auto\u0026g3.res_density=medium\u0026g3.display_mode=lines\u0026g3.show_exemplars=0`\r\n\r\n## Further steps\r\n\r\n### Blog at \u003ca href=\"https://capillaries.io/blog\"\u003ecapillaries.io\u003c/a\u003e\r\nFor more details about this particular demo, see Capillaries blog: [Use Capillaries to calculate ARK portfolio performance](https://capillaries.io/blog/2023-04-08-portfolio/index.html). To learn how this demo runs on a bigger dataset with 14 million transactions, see [Capillaries: ARK portfolio performance calculation at scale](https://capillaries.io/blog/2023-11-15-portfolio-scale/index.html).\r\n\r\n### Further introduction\r\nFor more details about getting started, see [Getting started](doc/started.md).\r\n\r\n### Deploy Capillaries at scale\r\n\r\n#### Container-based deployments\r\n\r\nCapillaries binaries are intended to be container-friendly. Check out the `docker-compose.yml` and [Kubernetes deployment POC](./deploy/k8s/README.md), these test projects may be a good starting point for creating your full-scale container-based deployment.\r\n\r\n#### VM-based deployment\r\n\r\nSee [Terraform script](./deploy/tf/cassandra_cluster/README.md) that creates Capillaries deployment in AWS.\r\n\r\n## Capillaries in depth\r\n\r\n### [What it is and what it is not](doc/what.md) (use case discussion and diagrams)\r\n### [Getting started](doc/started.md) (run a quick Docker-based demo without compiling a single line of code)\r\n### [Testing](doc/testing.md)\r\n### [Toolbelt, Daemon, and Webapi configuration](doc/binconfig.md)\r\n### [Script configuration](doc/scriptconfig.md)\r\n### [Capillaries UI](ui/README.md)\r\n### [Capillaries API](doc/api.md)\r\n### [Glossary](doc/glossary.md)\r\n### [Q \u0026 A](doc/qna.md)\r\n### [Capillaries blog](https://capillaries.io/blog/index.html)\r\n### [MIT License](LICENSE)\r\n\r\n(C) 2022-2025 KH (kleines.hertz[at]protonmail.com)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapillariesio%2Fcapillaries","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcapillariesio%2Fcapillaries","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapillariesio%2Fcapillaries/lists"}