Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/paradigmxyz/cryo
cryo is the easiest way to extract blockchain data to parquet, csv, json, or python dataframes
https://github.com/paradigmxyz/cryo
crypto ethereum evm parquet rust
Last synced: about 8 hours ago
JSON representation
cryo is the easiest way to extract blockchain data to parquet, csv, json, or python dataframes
- Host: GitHub
- URL: https://github.com/paradigmxyz/cryo
- Owner: paradigmxyz
- License: apache-2.0
- Created: 2023-06-27T00:11:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-08T07:31:22.000Z (25 days ago)
- Last Synced: 2025-01-25T18:01:03.067Z (7 days ago)
- Topics: crypto, ethereum, evm, parquet, rust
- Language: Rust
- Homepage:
- Size: 1.13 MB
- Stars: 1,262
- Watchers: 10
- Forks: 127
- Open Issues: 46
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE
Awesome Lists containing this project
- awesome-evm-data-tools - Cryo - CLI tool for extracting blockchain data to various formats (parquet, csv, json, python dataframes) (Data Query)
- awesome-evm-data-tools - Cryo - CLI tool for extracting blockchain data to various formats (parquet, csv, json, python dataframes) (Data Query)
README
# ❄️🧊 cryo 🧊❄️
[![Rust](https://github.com/paradigmxyz/cryo/actions/workflows/build_and_test.yml/badge.svg)](https://github.com/paradigmxyz/cryo/actions/workflows/build_and_test.yml) [![Telegram Chat](https://img.shields.io/badge/Telegram-join_chat-blue.svg)](https://t.me/paradigm_data)
`cryo` is the easiest way to extract blockchain data to parquet, csv, json, or a python dataframe.
`cryo` is also extremely flexible, with [many different options](#cryo-help) to control how data is extracted + filtered + formatted
*`cryo` is an early WIP, please report bugs + feedback to the issue tracker*
*note that `cryo`'s default settings will slam a node too hard for use with 3rd party RPC providers. Instead, `--requests-per-second` and `--max-concurrent-requests` should be used to impose ratelimits. Such settings will be handled automatically in a future release*.
to discuss cryo, check out [the telegram group](https://t.me/paradigm_data)
## Contents
1. [Example Usage](#example-usage)
2. [Installation](#installation)
3. [Data Schema](#data-schemas)
4. [Code Guide](#code-guide)
5. [Documentation](#documentation)
1. [Basics](#cryo-help)
2. [Syntax](#cryo-syntax)
3. [Datasets](#cryo-datasets)## Example Usage
use as `cryo [OPTIONS]`
| Example | Command |
| :- | :- |
| Extract all logs from block 16,000,000 to block 17,000,000 | `cryo logs -b 16M:17M` |
| Extract blocks, logs, or traces missing from current directory | `cryo blocks txs traces` |
| Extract to csv instead of parquet | `cryo blocks txs traces --csv` |
| Extract only certain columns | `cryo blocks --include number timestamp` |
| Dry run to view output schemas or expected work | `cryo storage_diffs --dry` |
| Extract all USDC events | `cryo logs --contract 0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48` |For a more complex example, see the [Uniswap Example](./examples/uniswap.sh).
`cryo` uses `ETH_RPC_URL` env var as the data source unless `--rpc ` is given
## Installation
The simplest way to use `cryo` is as a cli tool:
#### Method 1: install from source
```bash
git clone https://github.com/paradigmxyz/cryo
cd cryo
cargo install --path ./crates/cli
```This method requires having rust installed. See [rustup](https://rustup.rs/) for instructions.
#### Method 2: install from crates.io
```bash
cargo install cryo_cli
```This method requires having rust installed. See [rustup](https://rustup.rs/) for instructions.
Make sure that `~/.cargo/bin` is on your `PATH`. One way to do this is by adding the line `export PATH="$HOME/.cargo/bin:$PATH"` to your `~/.bashrc` or `~/.profile`.
### Python Installation
`cryo` can also be installed as a python package:
#### Installing `cryo` python from pypi
(make sure rust is installed first, see [rustup](https://www.rust-lang.org/tools/install))
```bash
pip install maturin
pip install cryo
```#### Installing `cryo` python from source
```bash
pip install maturin
git clone https://github.com/paradigmxyz/cryo
cd cryo/crates/python
maturin build --release
pip install --force-reinstall .whl
```## Data Schemas
Many `cryo` cli options will affect output schemas by adding/removing columns or changing column datatypes.
`cryo` will always print out data schemas before collecting any data. To view these schemas without collecting data, use `--dry` to perform a dry run.
#### Schema Design Guide
An attempt is made to ensure that the dataset schemas conform to a common set of design guidelines:
- By default, rows should contain enough information in their columns to be order-able (unless the rows do not have an intrinsic order).
- Columns should usually be named by their JSON-RPC or ethers.rs defaults, except in cases where a much more explicit name is available.
- To make joins across tables easier, a given piece of information should use the same datatype and column name across tables when possible.
- Large ints such as `u256` should allow multiple conversions. A `value` column of type `u256` should allow: `value_binary`, `value_string`, `value_f32`, `value_f64`, `value_u32`, `value_u64`, and `value_d128`. These types can be specified at runtime using the `--u256-types` argument.
- By default, columns related to non-identifying cryptographic signatures are omitted by default. For example, `state_root` of a block or `v`/`r`/`s` of a transaction.
- Integer values that can never be negative should be stored as unsigned integers.
- Every table should allow a `chain_id` column so that data from multiple chains can be easily stored in the same table.Standard types across tables:
- `block_number`: `u32`
- `transaction_index`: `u32`
- `nonce`: `u32`
- `gas_used`: `u64`
- `gas_limit`: `u64`
- `chain_id`: `u64`
- `timestamp`: `u32`#### JSON-RPC
`cryo` currently obtains all of its data using the [JSON-RPC](https://ethereum.org/en/developers/docs/apis/json-rpc/) protocol standard.
|dataset|blocks per request|results per block|method|
|-|-|-|-|
|Blocks|1|1|`eth_getBlockByNumber`|
|Transactions|1|multiple|`eth_getBlockByNumber`, `eth_getBlockReceipts`, `eth_getTransactionReceipt`|
|Logs|multiple|multiple|`eth_getLogs`|
|Contracts|1|multiple|`trace_block`|
|Traces|1|multiple|`trace_block`|
|State Diffs|1|multiple|`trace_replayBlockTransactions`|
|Vm Traces|1|multiple|`trace_replayBlockTransactions`|`cryo` use [ethers.rs](https://github.com/gakonst/ethers-rs) to perform JSON-RPC requests, so it can be used any chain that ethers-rs is compatible with. This includes Ethereum, Optimism, Arbitrum, Polygon, BNB, and Avalanche.
A future version of `cryo` will be able to bypass JSON-RPC and query node data directly.
## Code Guide
- Code is arranged into the following crates:
- `cryo_cli`: convert textual data into cryo function calls
- `cryo_freeze`: core cryo code
- `cryo_python`: cryo python adapter
- `cryo_to_df`: procedural macro for generating dataset definitions
- Do not use panics (including `panic!`, `todo!`, `unwrap()`, and `expect()`) except in the following circumstances: tests, build scripts, lazy static blocks, and procedural macros## Documentation
1. [cryo help](#cryo-help)
2. [cryo syntax](#cryo-syntax)
3. [cryo datasets](#cryo-datasets)#### cryo help
(output of `cryo help`)
```
cryo extracts blockchain data to parquet, csv, or jsonUsage: cryo [OPTIONS] [DATATYPE]...
Arguments:
[DATATYPE]... datatype(s) to collect, use cryo datasets to see all availableOptions:
--remember Remember current command for future use
-v, --verbose Extra verbosity
--no-verbose Run quietly without printing information to stdout
-h, --help Print help
-V, --version Print versionContent Options:
-b, --blocks ... Block numbers, see syntax below
--timestamps ... Timestamp numbers in unix, overridden by blocks
-t, --txs ... Transaction hashes, see syntax below
-a, --align Align chunk boundaries to regular intervals,
e.g. (1000 2000 3000), not (1106 2106 3106)
--reorg-buffer Reorg buffer, save blocks only when this old,
can be a number of blocks [default: 0]
-i, --include-columns [...] Columns to include alongside the defaults,
use `all` to include all available columns
-e, --exclude-columns [...] Columns to exclude from the defaults
--columns [...] Columns to use instead of the defaults,
use `all` to use all available columns
--u256-types ... Set output datatype(s) of U256 integers
[default: binary, string, f64]
--hex Use hex string encoding for binary columns
-s, --sort [...] Columns(s) to sort by, `none` for unordered
--exclude-failed Exclude items from failed transactionsSource Options:
-r, --rpc RPC url [default: ETH_RPC_URL env var]
--network-name Network name [default: name of eth_getChainId]Acquisition Options:
-l, --requests-per-second Ratelimit on requests per second
--max-retries Max retries for provider errors [default: 5]
--initial-backoff Initial retry backoff time (ms) [default: 500]
--max-concurrent-requests Global number of concurrent requests
--max-concurrent-chunks Number of chunks processed concurrently
--chunk-order Chunk collection order (normal, reverse, or random)
-d, --dry Dry run, collect no dataOutput Options:
-c, --chunk-size Number of blocks per file [default: 1000]
--n-chunks Number of files (alternative to --chunk-size)
--partition-by Dimensions to partition by
-o, --output-dir Directory for output files [default: .]
--subdirs ... Subdirectories for output files
can be `datatype`, `network`, or custom string
--label Label to add to each filename
--overwrite Overwrite existing files instead of skipping
--csv Save as csv instead of parquet
--json Save as json instead of parquet
--row-group-size Number of rows per row group in parquet file
--n-row-groups Number of rows groups in parquet file
--no-stats Do not write statistics to parquet files
--compression ... Compression algorithm and level [default: lz4]
--report-dir Directory to save summary report
[default: {output_dir}/.cryo/reports]
--no-report Avoid saving a summary reportDataset-specific Options:
... Address(es)
--address
--to-address ... To Address(es)
--from-address ... From Address(es)
--call-data ... Call data(s) to use for eth_calls
--function ... Function(s) to use for eth_calls
--inputs ... Input(s) to use for eth_calls
--slot ... Slot(s)
--contract ... Contract address(es)
--topic0 ... Topic0(s) [aliases: event]
--topic1 ... Topic1(s)
--topic2 ... Topic2(s)
--topic3 ... Topic3(s)
--event-signature ... Event signature for log decoding
--inner-request-size Blocks per request (eth_getLogs) [default: 1]
--js-tracer Event signature for log decodingOptional Subcommands:
cryo help display help message
cryo help syntax display block + tx specification syntax
cryo help datasets display list of all datasets
cryo help display info about a dataset
```#### cryo syntax
(output of `cryo help syntax`)
```
Block specification syntax
- can use numbers --blocks 5000 6000 7000
- can use ranges --blocks 12M:13M 15M:16M
- can use a parquet file --blocks ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files --blocks ./path/to/files/*.parquet[:COLUMN_NAME]
- numbers can contain { _ . K M B } 5_000 5K 15M 15.5M
- omitting range end means latest 15.5M: == 15.5M:latest
- omitting range start means 0 :700 == 0:700
- minus on start means minus end -1000:7000 == 6001:7001
- plus sign on end means plus start 15M:+1000 == 15M:15.001M
- can use every nth value 2000:5000:1000 == 2000 3000 4000
- can use n values total 100:200/5 == 100 124 149 174 199Timestamp specification syntax
- can use numbers --timestamp 5000 6000 7000
- can use ranges --timestamp 12M:13M 15M:16M
- can use a parquet file --timestamp ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files --timestamp ./path/to/files/*.parquet[:COLUMN_NAME]
- can contain { _ . m h d w M y } 31_536_000 525600m 8760h 365d 52.143w 12.17M 1y
- omitting range end means latest 15.5M: == 15.5M:latest
- omitting range start means 0 :700 == 0:700
- minus on start means minus end -1000:7000 == 6001:7001
- plus sign on end means plus start 15M:+1000 == 15M:15.001M
- can use n values total 100:200/5 == 100 124 149 174 199Transaction specification syntax
- can use transaction hashes --txs TX_HASH1 TX_HASH2 TX_HASH3
- can use a parquet file --txs ./path/to/file.parquet[:COLUMN_NAME]
(default column name is transaction_hash)
- can use multiple parquet files --txs ./path/to/ethereum__logs*.parquet
```#### cryo datasets
(output of `cryo help datasets`)
```
cryo datasets
─────────────
- address_appearances
- balance_diffs
- balance_reads
- balances
- blocks
- code_diffs
- code_reads
- codes
- contracts
- erc20_balances
- erc20_metadata
- erc20_supplies
- erc20_transfers
- erc20_approvals
- erc721_metadata
- erc721_transfers
- eth_calls
- four_byte_counts (alias = 4byte_counts)
- geth_calls
- geth_code_diffs
- geth_balance_diffs
- geth_storage_diffs
- geth_nonce_diffs
- geth_opcodes
- javascript_traces (alias = js_traces)
- logs (alias = events)
- native_transfers
- nonce_diffs
- nonce_reads
- nonces
- slots (alias = storages)
- storage_diffs (alias = slot_diffs)
- storage_reads (alias = slot_reads)
- traces
- trace_calls
- transactions (alias = txs)
- vm_traces (alias = opcode_traces)dataset group names
───────────────────
- blocks_and_transactions: blocks, transactions
- call_trace_derivatives: contracts, native_transfers, traces
- geth_state_diffs: geth_balance_diffs, geth_code_diffs, geth_nonce_diffs, geth_storage_diffs
- state_diffs: balance_diffs, code_diffs, nonce_diffs, storage_diffs
- state_reads: balance_reads, code_reads, nonce_reads, storage_readsuse cryo help to print info about a specific dataset
```