{"id":26130615,"url":"https://github.com/emeraldpay/dshackle-archive","last_synced_at":"2025-04-13T19:34:17.931Z","repository":{"id":175395992,"uuid":"461364978","full_name":"emeraldpay/dshackle-archive","owner":"emeraldpay","description":"ETL for Bitcoin and Ethereum data","archived":false,"fork":false,"pushed_at":"2025-03-21T08:22:53.000Z","size":70287,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-21T09:28:05.480Z","etag":null,"topics":["analysis","bigdata","bitcoin","ethereum","etl"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/emeraldpay.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-20T02:31:12.000Z","updated_at":"2025-03-21T08:22:58.000Z","dependencies_parsed_at":"2023-07-25T14:31:17.888Z","dependency_job_id":"08101209-3419-49d0-864c-47f81662e5b1","html_url":"https://github.com/emeraldpay/dshackle-archive","commit_stats":null,"previous_names":["emeraldpay/dshackle-archive"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emeraldpay%2Fdshackle-archive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emeraldpay%2Fdshackle-archive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emeraldpay%2Fdshackle-archive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emeraldpay%2Fdshackle-archive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/emeraldpay","download_url":"https://codeload.github.com/emeraldpay/dshackle-archive/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248768341,"owners_count":21158618,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","bigdata","bitcoin","ethereum","etl"],"created_at":"2025-03-10T20:52:02.749Z","updated_at":"2025-04-13T19:34:17.908Z","avatar_url":"https://github.com/emeraldpay.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"= Dshackle Archive\n:version: 0.2.0\n:version-short: 0.2\n\nimage:https://github.com/emeraldpay/dshackle-archive/workflows/Tests/badge.svg[\"Unit Tests\"]\nimage:https://codecov.io/gh/emeraldpay/dshackle-archive/branch/master/graph/badge.svg[\"Coverage\",link=\"https://codecov.io/gh/emeraldpay/dshackle-archive\"]\nimage:https://img.shields.io/docker/pulls/emeraldpay/dshackle-archive?style=flat-square[\"Docker\",link=\"https://hub.docker.com/r/emeraldpay/dshackle-archive\"]\nimage:https://img.shields.io/github/license/emeraldpay/dshackle-archive.svg?style=flat-square\u0026maxAge=2592000[\"License\",link=\"https://github.com/emeraldpay/dshackle-archive/blob/master/LICENSE\"]\nimage:https://img.shields.io/discord/1107840420240707704?style=flat-square[Discord,link=\"https://discord.gg/k9HpF9Jqee\"]\n\nDshackle Archive is a tool that efficiently extracts blockchain data in JSON format and archives it into Avro files for scalable analysis using traditional Big Data tools. It supports Bitcoin and Ethereum-compatible blockchains and can operate in both batch mode for historical data and streaming mode for real-time data extraction.\n\nDshackle Archive copies JSON data from a blockchain to a plain files archive\n(i.e., it's the _Extraction_ of data, as in ETL).\nThe archive is Avro files which contain blocks and transaction details fetched from blockchain nodes via their API.\nThe Avro contain the JSON responses _as is_.\n\n\nFeatures:\n\n- Extracts data from *Bitcoin and Ethereum* compatible blockchains.\n- Produces data in *Avro format*, where one file could be a _range_ of blocks or a _separate_ file per each block\n- Archival can be scaled by using multiple blockchain nodes by using https://github.com/emeraldpay/dshackle[Dshackle Load Balancer].\n- Runs in batch archive mode, when _historical data_ is archived; or in a _streaming mode_ when only fresh data is added to the archive\n- Archive to a Filesystem, Google Storage or S3\n- Produces Notification via Google PubSub or Apache Pulsar\n\nThe general idea it to use Dshackle Archive to copy data from a blockchain to a plain files, keeping the data structures as is, and then use traditional Big Data tools (Spark, Beam, etc.) to analyse data in a scalable way.\n\nNOTE: It takes several weeks to archive the whole blockchain. It's a very IO intensive operation, but can be scaled by using multiple nodes.\n\nNOTE: It uses https://github.com/emeraldpay/dshackle[Dshackle] protocol to connect to the blockchain, not a plain HTTP/JSON RPC. So a Dshackle instance is required to be running.\n\n== Usage\n\n=== Command Line Options\n\n----\nusage: dshackle-archive [options] \u003ccommand\u003e\nCopy blockchain data into files for further analysis\n    --auth.aws.accessKey \u003carg\u003e    AWS / S3 Access Key\n    --auth.aws.secretKey \u003carg\u003e    AWS / S3 Secret Key\n    --auth.gcp \u003carg\u003e              Path to GCP Authentication JSON\n    --aws.endpoint \u003carg\u003e          AWS / S3 endpoint url instead of the\n                                  default one\n    --aws.region \u003carg\u003e            AWS / S3 region ID to use for requests\n    --aws.s3.pathStyle            Enable S3 Path Style access (default is\n                                  false)\n    --aws.trustTls                Trust any TLS certificate for AWS / S3\n                                  (default is false)\n -b,--blockchain \u003carg\u003e            Blockchain\n    --back                        Apply the blocks range (--range option)\n                                  back from to the current blockchain\n                                  height, i.e. process N..M not starting\n                                  not from zero by from the current height\n -c,--connection \u003carg\u003e            Connection (host:port)\n    --compact.forks               Should accept all blocks including forks\n                                  in stream compaction\n    --compact.ranges              Process range files also in stream\n                                  compaction\n    --connection.notls            Disable TLS\n    --connection.timeout \u003carg\u003e    Timeout (in seconds) to get data from\n                                  blockchain before retrying. Default: 60\n                                  seconds\n    --continue                    Continue from the last file if set\n -d,--dir \u003carg\u003e                   Target directory\n    --deduplicate                 Deduplicate transactions and blocks\n                                  (could increase memory footprint)\n    --dirBlocks \u003carg\u003e             How many blocks keep per subdirectory\n    --dryRun                      Do not modify the storage\n -h,--help                        Show Help\n -i,--inputs \u003carg\u003e                Input File(s). Accepts a glob pattern\n                                  for filename\n    --include \u003carg\u003e               Include optional details in JSON (trace,\n                                  stateDiff)\n    --notify.dir \u003carg\u003e            Write notifications as JSON line to the\n                                  specified dir in a file\n                                  \u003cdshackle-archive-%STARTTIME.jsonl\u003e\n    --notify.pubsub \u003carg\u003e         Send notifications as JSON to the\n                                  specified Google Pubsub topic\n    --notify.pulsar.topic \u003carg\u003e   Send notifications as JSON to the Pulsar\n                                  to the specified topic\n                                  (notify.pulsar.url must be specified)\n    --notify.pulsar.url \u003carg\u003e     Send notifications as JSON to the Pulsar\n                                  with specified URL (notify.pulsar.topic\n                                  must be specified)\n    --parallel \u003carg\u003e              How many blocks to request in parallel.\n                                  Range: 1..512. Default: 8\n    --prefix \u003carg\u003e                File prefix\n -r,--range \u003carg\u003e                 Blocks Range (N...M)\n    --rangeChunk \u003carg\u003e            Range chunk size (default 1000)\n    --tail \u003carg\u003e                  Last T block to ensure are archived in\n                                  streaming mode\nAvailable commands:\n archive - the main operation, copies data from a blockchain to archive\n stream  - append fresh blocks one by one to the archive\n compact - merge individual block files into larger range files\n copy    - copy/recover from existing archive by copying into a new one\n report  - show summary on what is in archive for the specified range\n fix     - fix archive by making new archives for missing chunks\n verify  - verify that archive files contains required data and delete incomplete files\n----\n\n=== Commands\n\n==== Archive\n\nThe main operation, copies historical data from a blockchain to archive.\n\nThe data is copied in ranges, and with default range of 1000 blocks it produces two files per range.\nOne for blocks in that range, and another one with _all_ transactions in all blocks in that range.\n\nSee \u003c\u003carchive-format\u003e\u003e.\n\n==== Stream\n\nContinuously append fresh blocks one by one to the archive.\nIn addition to the copying, Dshackle archive can be configured to notify an external system about new blocks in the archive.\n\nNote that when it's in streaming mode the archives are writen in a per-block basis.\nI.e., each block comes with in a separate pair of two files.\nOne for the block itself, and another one for all transactions in that block.\nTo merge the individual files into larger ranges use `compact` command.\n\nTo notify an external system, there are two options:\n\n- `--notify.dir` - write notifications as JSON line to the specified dir in a file `\u003cdshackle-archive-%STARTTIME.jsonl\u003e`\n- `--notify.pubsub` - send notifications as JSON to the specified Google Pubsub topic\n- `--notify.pulsar.url` + `--notify.pulsar.topic` - send notifications as JSON to the specified Apache Pulsar topic\n\nSee \u003c\u003cnotification-format\u003e\u003e.\n\n==== Compact\n\nMerge individual block files into larger range files.\n\n==== Copy\n\nCopy from one archive to another.\n\nTechnically, you can copy files as is, but the command is useful because by using it you can change the range sizes for the target archive.\nAlso, it can be used to recover from a corrupted archive, b/c it makes additional checks and so it skips the corrupted data.\n\n==== Report\n\nShow summary on what is in archive for the specified range.\n\n==== Fix\n\nFixes the archive by checking if there are any missing blocks, and if so, it creates new archives for the missing blocks.\n\n==== Verify\n\nVerify that archive files contains required data and delete incomplete/corrupted files.\nThe a `fix` command is supposed to run to download missing blocks.\n\nWARNING: This command is destructive, it deletes files from the archive.\n\n=== Archive Size\n\nDshackle Archive copies and stored data as JSON responses from blockchain nodes the resulting archive is much larger that the node database size, which keeps data in a compact format.\nIt uses Snappy compression for Avro files, which give a good compression ratio, but still the resulting archive is large.\n\nAverage size of a 1000 blocks range (w/o expensive JSON such as `stateDiff` and `trace`):\n\n- ~300Mb for Ethereum\n- ~400Mb for Bitcoin\n\nAnd the whole archive (w/o expensive JSON such as `stateDiff` and `trace`):\n\n- ~2.5Tb for Ethereum\n- ~1.9Tb for Bitcoin\n\n=== Related projects:\n\n- Avro structure and Java stubs: https://github.com/emeraldpay/dshackle-archive-avro\n- Dshackle load balancer: https://github.com/emeraldpay/dshackle\n\n=== Project Roadmap\n\n- [x] support AWS S3 as a storage\n- [x] support Pulsar as a notification system\n- [ ] support Kafka as a notification system\n- [ ] archive to Cassandra\n\n=== FAQ\n\n==== How to organize the data gathering process?\n\n- First you need to archive the historical data, which may takes several week depending on how many and how fast nodes you have.\n- After finishing the initial archive, you run in the Streaming mode which append new blocks to the archive as they are mined.\n- Periodically (ex. once a day) you run Compaction to merge individual block files into larger range files.\n- Also, periodically (ex. once a day) you run a pair of Verify and Fix commands to ensure the integrity of the archive.\n\n==== What are supported blockchains?\n\nDshackle requires only compatibility onj JSON RPC level, so technically it can work with any blockchain that uses similar API.\nI.e., it's compatible with all major blockchains, including Bitcoin, Ethereum, Binance Smart Chain, Polygon, etc.\n\n==== What blockchain API it uses?\n\nIt uses https://github.com/emeraldpay/dshackle[Dshackle] protocol to connect to the blockchain, not a plain HTTP/JSON RPC.\nSo a Dshackle instance is required to be running.\n\nDshackle is a Load Balancer for Blockchain APIs, and it can route requests to multiple nodes, which scales up the archival throughput.\n\n==== How does Dshackle Archive ensure the integrity and accuracy?\n\nDshackle provides two commands to ensure the integrity of the archive:\n\n- first you run `verify` command, which checks the archive and deletes incomplete or corrupted files\n- then you run `fix` command, which copies the data again for the blocks deleted in the previous step\n\nYou can schedule the execution of these commands to run periodically, e.g. once a day.\nTo avoid scanning the whole archive every time, you can specify a range to check, e.g. `--back --range 100...1100`.\nThe option above specifies that is thould verify/fix only the last 1000 blocks, starting from 100 behind the current height.\nI.e., it goes backward from the current head block.\n\n\n[[archive-format]]\n=== Archive Format\n\nFor a complete descriptions, schema and libs to access Avro files please refer to https://github.com/emeraldpay/dshackle-archive-avro\n\n==== Block\n\n.Fields common between different blockchains\n- `blockchainType` - _type of blockchain_, as a definitions of what fields to expect.\nOne of `ETHEREUM` or `BITCOIN`\n- `blockchainId` - actual blockchain id (`ETH`, `BTC`, etc)\n- `archiveTimestamp` - when the archive record was created.\nMilliseconds since epoch\n- `height` - block height\n- `blockId` - block hash\n- `timestamp` - block timestamp.\nMilliseconds since epoch\n- `parentId` - parent block hash\n- `json` - JSON response for that block\n\n.Ethereum specific fields\n- `unclesCount` - number of uncles for the current block\n- `uncle0Json` - JSON for first uncle (`eth_getUncleByBlockHashAndIndex(0)`)\n- `uncle1Json` - JSON for second uncle (`eth_getUncleByBlockHashAndIndex(1)`)\n\n.Bitcoin specific fields\n- none\n\n==== Transaction\n\n.Fields common between different blockchains\n- `blockchainType` - _type of blockchain_, as a definitions of what fields to expect. One of `ETHEREUM` or `BITCOIN`\n- `blockchainId` - blockchain id (`ETH`, `BTC`, etc)\n- `archiveTimestamp` - when the archive record was created. Milliseconds since epoch\n- `height` - block height\n- `blockId` - block hash\n- `timestamp` - block timestamp. Milliseconds since epoch\n- `index` - index of the transaction in block\n- `txid` - hash or transaction id of the transaction\n- `json` - JSON response for that transaction\n- `raw` - raw bytes of the transaction\n\n.Ethereum specific fields\n- `from` - from address\n- `to` - to address\n- `receiptJson` - JSON response for `eth_getTransactionReceipt`\n- `traceJson` - JSON response for `trace_replayTransaction(trace)`\n- `stateDiffJson` - JSON response for `trace_replayTransaction(stateDiff)`\n\n.Bitcoin specific fields\n- none\n\n[[notification-format]]\n=== Notification format\n\n[source, json]\n----\n{\n  \"version\":\"https://schema.emrld.io/dshackle-archive/notify\",\n  \"ts\":\"2022-05-20T23:14:24.481327Z\",\n  \"blockchain\":\"ETH\",\n  \"type\":\"transactions\",\n  \"run\":\"stream\",\n  \"heightStart\":14813875,\n  \"heightEnd\":14813875,\n  \"location\":\"gs://my-bucket/blockchain-archive/eth/014000000/014813000/014813875.txes.avro\"\n}\n----\n\n.Where\n- `version` id of the current JSON format\n- `ts` timestamp of the archive event\n- `blockchain` blockchain\n- `type` type of file (`transactions` or `blocks`)\n- `run` mode in which the Dshackle Archive is run (`archive`, `stream`, `copy` or `compact`)\n- `heightStart` and `heightEnd` range of blocks in the archived files\n- `location` a URL to the archived file\n\n== Community\n\n=== Development Chat\n\nJoin our Discord chat to discuss development and ask questions:\n\nimage:https://img.shields.io/discord/1107840420240707704?style=flat-square[Discord,link=\"https://discord.gg/k9HpF9Jqee\"]\n\n\n== Commercial Support\n\nWant to support the project, prioritize a specific feature, or get commercial help with using Dshackle in your project?\nPlease contact splix@emerald.cash to discuss the possibility.\n\n== License\n\nCopyright 2023 EmeraldPay, Inc\n\nLicensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femeraldpay%2Fdshackle-archive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femeraldpay%2Fdshackle-archive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femeraldpay%2Fdshackle-archive/lists"}