https://github.com/paradigmxyz/cryo

cryo is the easiest way to extract blockchain data to parquet, csv, json, or python dataframes
https://github.com/paradigmxyz/cryo

crypto ethereum evm parquet rust

Last synced: 7 months ago
JSON representation

cryo is the easiest way to extract blockchain data to parquet, csv, json, or python dataframes

Host: GitHub
URL: https://github.com/paradigmxyz/cryo
Owner: paradigmxyz
License: apache-2.0
Created: 2023-06-27T00:11:53.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-01-08T07:31:22.000Z (11 months ago)
Last Synced: 2025-04-28T13:58:58.101Z (8 months ago)
Topics: crypto, ethereum, evm, parquet, rust
Language: Rust
Homepage:
Size: 1.13 MB
Stars: 1,348
Watchers: 11
Forks: 143
Open Issues: 57
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE

Awesome Lists containing this project

awesome-evm-data-tools - Cryo - CLI tool for extracting blockchain data to various formats (parquet, csv, json, python dataframes) (Data Query)
awesome-ethereum-rust - cryo
awesome-web3-data - Cryo - High-performance blockchain data extraction tool supporting various blockchains. (ETL Tools / Market Intelligence & Analysis)

README

# ❄️🧊 cryo 🧊❄️

[![Rust](https://github.com/paradigmxyz/cryo/actions/workflows/build_and_test.yml/badge.svg)](https://github.com/paradigmxyz/cryo/actions/workflows/build_and_test.yml) [![Telegram Chat](https://img.shields.io/badge/Telegram-join_chat-blue.svg)](https://t.me/paradigm_data)

`cryo` is the easiest way to extract blockchain data to parquet, csv, json, or a python dataframe.

`cryo` is also extremely flexible, with [many different options](#cryo-help) to control how data is extracted + filtered + formatted

*`cryo` is an early WIP, please report bugs + feedback to the issue tracker*

*note that `cryo`'s default settings will slam a node too hard for use with 3rd party RPC providers. Instead, `--requests-per-second` and `--max-concurrent-requests` should be used to impose ratelimits. Such settings will be handled automatically in a future release*.

to discuss cryo, check out [the telegram group](https://t.me/paradigm_data)

## Contents

1. [Example Usage](#example-usage)
2. [Installation](#installation)
3. [Data Schema](#data-schemas)
4. [Code Guide](#code-guide)
5. [Documentation](#documentation)
1. [Basics](#cryo-help)
2. [Syntax](#cryo-syntax)
3. [Datasets](#cryo-datasets)

## Example Usage

use as `cryo [OPTIONS]`

For a more complex example, see the [Uniswap Example](./examples/uniswap.sh).

`cryo` uses `ETH_RPC_URL` env var as the data source unless `--rpc ` is given

## Installation

The simplest way to use `cryo` is as a cli tool:

#### Method 1: install from source

```bash
git clone https://github.com/paradigmxyz/cryo
cd cryo
cargo install --path ./crates/cli
```

This method requires having rust installed. See [rustup](https://rustup.rs/) for instructions.

#### Method 2: install from crates.io

```bash
cargo install cryo_cli
```

This method requires having rust installed. See [rustup](https://rustup.rs/) for instructions.

Make sure that `~/.cargo/bin` is on your `PATH`. One way to do this is by adding the line `export PATH="$HOME/.cargo/bin:$PATH"` to your `~/.bashrc` or `~/.profile`.

### Python Installation

`cryo` can also be installed as a python package:

#### Installing `cryo` python from pypi

(make sure rust is installed first, see [rustup](https://www.rust-lang.org/tools/install))

```bash
pip install maturin
pip install cryo
```

#### Installing `cryo` python from source

```bash
pip install maturin
git clone https://github.com/paradigmxyz/cryo
cd cryo/crates/python
maturin build --release
pip install --force-reinstall .whl
```

## Data Schemas

Many `cryo` cli options will affect output schemas by adding/removing columns or changing column datatypes.

`cryo` will always print out data schemas before collecting any data. To view these schemas without collecting data, use `--dry` to perform a dry run.

#### Schema Design Guide

An attempt is made to ensure that the dataset schemas conform to a common set of design guidelines:
- By default, rows should contain enough information in their columns to be order-able (unless the rows do not have an intrinsic order).
- Columns should usually be named by their JSON-RPC or ethers.rs defaults, except in cases where a much more explicit name is available.
- To make joins across tables easier, a given piece of information should use the same datatype and column name across tables when possible.
- Large ints such as `u256` should allow multiple conversions. A `value` column of type `u256` should allow: `value_binary`, `value_string`, `value_f32`, `value_f64`, `value_u32`, `value_u64`, and `value_d128`. These types can be specified at runtime using the `--u256-types` argument.
- By default, columns related to non-identifying cryptographic signatures are omitted by default. For example, `state_root` of a block or `v`/`r`/`s` of a transaction.
- Integer values that can never be negative should be stored as unsigned integers.
- Every table should allow a `chain_id` column so that data from multiple chains can be easily stored in the same table.

Standard types across tables:
- `block_number`: `u32`
- `transaction_index`: `u32`
- `nonce`: `u32`
- `gas_used`: `u64`
- `gas_limit`: `u64`
- `chain_id`: `u64`
- `timestamp`: `u32`

#### JSON-RPC

`cryo` currently obtains all of its data using the [JSON-RPC](https://ethereum.org/en/developers/docs/apis/json-rpc/) protocol standard.

`cryo` use [ethers.rs](https://github.com/gakonst/ethers-rs) to perform JSON-RPC requests, so it can be used any chain that ethers-rs is compatible with. This includes Ethereum, Optimism, Arbitrum, Polygon, BNB, and Avalanche.

A future version of `cryo` will be able to bypass JSON-RPC and query node data directly.

## Code Guide
- Code is arranged into the following crates:
- `cryo_cli`: convert textual data into cryo function calls
- `cryo_freeze`: core cryo code
- `cryo_python`: cryo python adapter
- `cryo_to_df`: procedural macro for generating dataset definitions
- Do not use panics (including `panic!`, `todo!`, `unwrap()`, and `expect()`) except in the following circumstances: tests, build scripts, lazy static blocks, and procedural macros

## Documentation

1. [cryo help](#cryo-help)
2. [cryo syntax](#cryo-syntax)
3. [cryo datasets](#cryo-datasets)

#### cryo help

(output of `cryo help`)

```
cryo extracts blockchain data to parquet, csv, or json

Usage: cryo [OPTIONS] [DATATYPE]...

Arguments:
[DATATYPE]... datatype(s) to collect, use cryo datasets to see all available

Options:
--remember Remember current command for future use
-v, --verbose Extra verbosity
--no-verbose Run quietly without printing information to stdout
-h, --help Print help
-V, --version Print version

Content Options:
-b, --blocks ... Block numbers, see syntax below
--timestamps ... Timestamp numbers in unix, overridden by blocks
-t, --txs ... Transaction hashes, see syntax below
-a, --align Align chunk boundaries to regular intervals,
e.g. (1000 2000 3000), not (1106 2106 3106)
--reorg-buffer Reorg buffer, save blocks only when this old,
can be a number of blocks [default: 0]
-i, --include-columns [...] Columns to include alongside the defaults,
use `all` to include all available columns
-e, --exclude-columns [...] Columns to exclude from the defaults
--columns [...] Columns to use instead of the defaults,
use `all` to use all available columns
--u256-types ... Set output datatype(s) of U256 integers
[default: binary, string, f64]
--hex Use hex string encoding for binary columns
-s, --sort [...] Columns(s) to sort by, `none` for unordered
--exclude-failed Exclude items from failed transactions

Source Options:
-r, --rpc RPC url [default: ETH_RPC_URL env var]
--network-name Network name [default: name of eth_getChainId]

Acquisition Options:
-l, --requests-per-second Ratelimit on requests per second
--max-retries Max retries for provider errors [default: 5]
--initial-backoff Initial retry backoff time (ms) [default: 500]
--max-concurrent-requests Global number of concurrent requests
--max-concurrent-chunks Number of chunks processed concurrently
--chunk-order Chunk collection order (normal, reverse, or random)
-d, --dry Dry run, collect no data

Output Options:
-c, --chunk-size Number of blocks per file [default: 1000]
--n-chunks Number of files (alternative to --chunk-size)
--partition-by Dimensions to partition by
-o, --output-dir Directory for output files [default: .]
--subdirs ... Subdirectories for output files
can be `datatype`, `network`, or custom string
--label Label to add to each filename
--overwrite Overwrite existing files instead of skipping
--csv Save as csv instead of parquet
--json Save as json instead of parquet
--row-group-size Number of rows per row group in parquet file
--n-row-groups Number of rows groups in parquet file
--no-stats Do not write statistics to parquet files
--compression ... Compression algorithm and level [default: lz4]
--report-dir Directory to save summary report
[default: {output_dir}/.cryo/reports]
--no-report Avoid saving a summary report

Dataset-specific Options:
--address
... Address(es)
--to-address
... To Address(es)
--from-address
... From Address(es)
--call-data ... Call data(s) to use for eth_calls
--function ... Function(s) to use for eth_calls
--inputs ... Input(s) to use for eth_calls
--slot ... Slot(s)
--contract ... Contract address(es)
--topic0 ... Topic0(s) [aliases: event]
--topic1 ... Topic1(s)
--topic2 ... Topic2(s)
--topic3 ... Topic3(s)
--event-signature ... Event signature for log decoding
--inner-request-size Blocks per request (eth_getLogs) [default: 1]
--js-tracer Event signature for log decoding

Optional Subcommands:
cryo help display help message
cryo help syntax display block + tx specification syntax
cryo help datasets display list of all datasets
cryo help display info about a dataset
```

#### cryo syntax

(output of `cryo help syntax`)

```
Block specification syntax
- can use numbers --blocks 5000 6000 7000
- can use ranges --blocks 12M:13M 15M:16M
- can use a parquet file --blocks ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files --blocks ./path/to/files/*.parquet[:COLUMN_NAME]
- numbers can contain { _ . K M B } 5_000 5K 15M 15.5M
- omitting range end means latest 15.5M: == 15.5M:latest
- omitting range start means 0 :700 == 0:700
- minus on start means minus end -1000:7000 == 6001:7001
- plus sign on end means plus start 15M:+1000 == 15M:15.001M
- can use every nth value 2000:5000:1000 == 2000 3000 4000
- can use n values total 100:200/5 == 100 124 149 174 199

Timestamp specification syntax
- can use numbers --timestamp 5000 6000 7000
- can use ranges --timestamp 12M:13M 15M:16M
- can use a parquet file --timestamp ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files --timestamp ./path/to/files/*.parquet[:COLUMN_NAME]
- can contain { _ . m h d w M y } 31_536_000 525600m 8760h 365d 52.143w 12.17M 1y
- omitting range end means latest 15.5M: == 15.5M:latest
- omitting range start means 0 :700 == 0:700
- minus on start means minus end -1000:7000 == 6001:7001
- plus sign on end means plus start 15M:+1000 == 15M:15.001M
- can use n values total 100:200/5 == 100 124 149 174 199

Transaction specification syntax
- can use transaction hashes --txs TX_HASH1 TX_HASH2 TX_HASH3
- can use a parquet file --txs ./path/to/file.parquet[:COLUMN_NAME]
(default column name is transaction_hash)
- can use multiple parquet files --txs ./path/to/ethereum__logs*.parquet
```

#### cryo datasets

(output of `cryo help datasets`)

```
cryo datasets
─────────────
- address_appearances
- balance_diffs
- balance_reads
- balances
- blocks
- code_diffs
- code_reads
- codes
- contracts
- erc20_balances
- erc20_metadata
- erc20_supplies
- erc20_transfers
- erc20_approvals
- erc721_metadata
- erc721_transfers
- eth_calls
- four_byte_counts (alias = 4byte_counts)
- geth_calls
- geth_code_diffs
- geth_balance_diffs
- geth_storage_diffs
- geth_nonce_diffs
- geth_opcodes
- javascript_traces (alias = js_traces)
- logs (alias = events)
- native_transfers
- nonce_diffs
- nonce_reads
- nonces
- slots (alias = storages)
- storage_diffs (alias = slot_diffs)
- storage_reads (alias = slot_reads)
- traces
- trace_calls
- transactions (alias = txs)
- vm_traces (alias = opcode_traces)

dataset group names
───────────────────
- blocks_and_transactions: blocks, transactions
- call_trace_derivatives: contracts, native_transfers, traces
- geth_state_diffs: geth_balance_diffs, geth_code_diffs, geth_nonce_diffs, geth_storage_diffs
- state_diffs: balance_diffs, code_diffs, nonce_diffs, storage_diffs
- state_reads: balance_reads, code_reads, nonce_reads, storage_reads

use cryo help to print info about a specific dataset
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paradigmxyz/cryo

Awesome Lists containing this project

README