https://github.com/blockchain-data-analytics/cardano_mainchain_parquet

Prepare Parquet tables from Db-sync snapshots and blazingly fast query them using "duckdb" or "spark"
https://github.com/blockchain-data-analytics/cardano_mainchain_parquet

Last synced: 3 months ago
JSON representation

Prepare Parquet tables from Db-sync snapshots and blazingly fast query them using "duckdb" or "spark"

Host: GitHub
URL: https://github.com/blockchain-data-analytics/cardano_mainchain_parquet
Owner: Blockchain-Data-Analytics
License: apache-2.0
Created: 2023-04-12T15:44:34.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-04-12T16:01:02.000Z (about 3 years ago)
Last Synced: 2025-03-04T13:46:56.814Z (over 1 year ago)
Size: 6.84 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Extract Cardano on-chain data to Parquet files

[Db-sync](https://github.com/input-output-hk/cardano-db-sync) maps on-chain data for the Cardano blockchain to PostgreSQL.

For extended use cases, instead of querying the database directly we would like to use a computer cluster to run such queries in parallel at MaxSpeed™.

We already provided an export to [BigQuery](https://github.com/input-output-hk/data-analytics-bigquery) on Google's cloud. But this comes with a cost.

Instead, this project let's you export Db-sync's tables to Parquet files which can be queried locally on your machine or in your local network using many computers.

We use [duckdb](https://duckdb.org/) to output Parquet files which can be queried using _duckdb_ or [Spark SQL](https://spark.apache.org/sql/).

## snapshot

get a db-sync snapshot from: https://update-cardano-mainnet.iohk.io/cardano-db-sync/index.html

install it in a PostgreSQL db

## PostgreSQL setup

create a database with name e.g. "cardanomainnet13" and a user with the same name.

also add the views from ext/data-analytics-bigquery.git/schema/views to schema analytics.

first, change the target users in the SQL files in that directory: 

`for F in *.sql; sed -i -e 's/db_sync_master/cardanomainnet13/g;s/db_sync_reader/cardanomainnet13/g'; done`

## copy to Parquet files

then run the continuous export to Parquet files

(create a .pgpass file or set the password in an env. var. PGPASSWORD)

```sh

export PGHOST=host

export PGPORT=5432

export PGDATABASE=cardanomainnet13

export PGUSER=cardanomainnet13

export PGPASSFILE=.pgpass

```

the target path:

```sh

PARQUET_PATH=/usr/local/spark/data

```

```sh

TBL=block  # for example

DBFILE=${TBL}.ddb

INCREMENT=20

./duckdb ${DBFILE} < schema/${TBL}.sql

for E in `seq 0 ${INCREMENT} 500`; do

  echo "exporting ${TBL} from ${E} to $((E+${INCREMENT})) as CSV"

  psql -c "\\copy ( SELECT * FROM analytics.vw_bq_${TBL} WHERE epoch_no >= ${E} AND epoch_no < ${E} + ${INCREMENT} ) to '${TBL}_${E}.csv' csv;"

  ./duckdb -c "COPY ${TBL} FROM '${TBL}_${E}.csv';" ${DBFILE}

done

./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}

```

table tx_in_out

---------------

this table is very large and thus we will split it into pieces.

first, generate the CSV files with the above loop.

then, load groups of CSV files into Parquet files.

```sh

TBL=tx_in_out

DBFILE=${TBL}.ddb

INCREMENT=20

for E in `seq 0 ${INCREMENT} 500`; do

    if [ -e $DBFILE ]; then rm -v $DBFILE; fi

    ./duckdb ${DBFILE} < schema/${TBL}.sql

    for F in `seq $E 1 $((E+INCREMENT-1))`; do

        if [ -e ${TBL}_${F}.csv ]; then

            ./duckdb -c "COPY ${TBL} FROM '${TBL}_${F}.csv';" ${DBFILE}

        fi

    done

    ./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}/${TBL}_${E}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}

done

querying those files:

`duckdb -c "SELECT epoch_no, COUNT(*) FROM read_parquet('${PARQUET_PATH}/cardano_mainchain/tx_in_out/tx_in_out*.parquet') GROUP BY epoch_no ORDER BY epoch_no;"`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/blockchain-data-analytics/cardano_mainchain_parquet

Awesome Lists containing this project

README