{"id":20433880,"url":"https://github.com/flashbots/mempool-dumpster","last_synced_at":"2025-04-12T19:46:36.439Z","repository":{"id":185886272,"uuid":"673777685","full_name":"flashbots/mempool-dumpster","owner":"flashbots","description":"Dump all the mempool transactions 🗑️ ♻️ (in Parquet + CSV)","archived":false,"fork":false,"pushed_at":"2025-04-02T11:31:43.000Z","size":683,"stargazers_count":227,"open_issues_count":7,"forks_count":33,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-04-11T20:48:02.233Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://mempool-dumpster.flashbots.net","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flashbots.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-02T11:57:37.000Z","updated_at":"2025-04-08T06:07:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"3e236b10-1664-4ead-a042-2d800656a547","html_url":"https://github.com/flashbots/mempool-dumpster","commit_stats":null,"previous_names":["flashbots/mempool-archiver","flashbots/mempool-dumpster"],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flashbots%2Fmempool-dumpster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flashbots%2Fmempool-dumpster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flashbots%2Fmempool-dumpster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flashbots%2Fmempool-dumpster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flashbots","download_url":"https://codeload.github.com/flashbots/mempool-dumpster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248625497,"owners_count":21135513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T08:22:09.602Z","updated_at":"2025-04-12T19:46:36.408Z","avatar_url":"https://github.com/flashbots.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Mempool Dumpster 🗑️♻️\n\n[![Goreport status](https://goreportcard.com/badge/github.com/flashbots/mempool-dumpster)](https://goreportcard.com/report/github.com/flashbots/mempool-dumpster)\n[![Test status](https://github.com/flashbots/mempool-dumpster/actions/workflows/checks.yml/badge.svg?branch=main)](https://github.com/flashbots/mempool-dumpster/actions?query=workflow%3A%22Checks%22)\n\nArchiving mempool transactions in [Parquet](https://github.com/apache/parquet-format) and CSV format.\n\n**The data is freely available at https://mempool-dumpster.flashbots.net**\n\nOverview:\n\n- Data is published under the [CC-0 public domain license](https://creativecommons.org/publicdomain/zero/1.0/)\n- Saving about 1M - 2M unique transactions per day\n- This project is under active development and the codebase might change significantly without notice. The functionality itself is pretty stable.\n- Related tooling: https://github.com/dvush/mempool-dumpster-rs\n- Introduction \u0026 guide: https://collective.flashbots.net/t/mempool-dumpster-a-free-mempool-transaction-archive/2401\n- Mempool dumpster data is also available through [BigQuery](https://console.cloud.google.com/bigquery/analytics-hub/exchanges/projects/1035301635708/locations/us/dataExchanges/ethereum_mempool_dumpster_by_flashbots_1953788a09b/listings/ethereum_mempool_dumpster_community_dataset_by_flashbots_19537a4ec38) and [Dune](https://www.dune.com) (`dune.flashbots.dataset_mempool_dumpster`).\n\n---\n\n## Available mempool transaction sources\n\n1. Generic EL nodes - go-ethereum, Infura, etc. (Websockets, using `newPendingTransactions`)\n2. Alchemy (Websockets, using [`alchemy_pendingTransactions`](https://docs.alchemy.com/reference/alchemy-pendingtransactions), warning - burns a lot of credits)\n3. [bloXroute](https://docs.bloxroute.com/streams/newtxs-and-pendingtxs) (Websockets and gRPC)\n4. [Chainbound Fiber](https://fiber.chainbound.io/docs/usage/getting-started/) (gRPC)\n5. [Eden](https://docs.edennetwork.io/eden-rpc/speed-rpc) (Websockets and gRPC)\n\nNote: Some sources send transactions that are already included on-chain, which are discarded (not added to archive or summary)\n\n---\n\n## Output files\n\nDaily files uploaded by mempool-dumpster (i.e. for [September 2023](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/index.html)):\n\n1. Parquet file with [transaction metadata and raw transactions](/common/types.go#L7) (~800MB/day, i.e. [`2023-09-08.parquet`](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-08.csv.zip))\n1. CSV file with only the transaction metadata (~100MB/day zipped, i.e. [`2023-09-08.csv.zip`](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-08.csv.zip))\n1. CSV file with details about when each transaction was received by any source (~100MB/day zipped, i.e. [`2023-09-08_sourcelog.csv.zip`](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-08_sourcelog.csv.zip))\n1. Summary in text format (~2kB, i.e. [`2023-09-08_summary.txt`](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-08_summary.txt))\n\n### Schema of output files\n\n**Parquet**\n\n```bash\n$ clickhouse local -q \"DESCRIBE TABLE 'transactions.parquet';\"\ntimestamp               Nullable(DateTime64(3))\nhash                    Nullable(String)\nchainId                 Nullable(String)\ntxType                  Nullable(Int64)\nfrom                    Nullable(String)\nto                      Nullable(String)\nvalue                   Nullable(String)\nnonce                   Nullable(String)\ngas                     Nullable(String)\ngasPrice                Nullable(String)\ngasTipCap               Nullable(String)\ngasFeeCap               Nullable(String)\ndataSize                Nullable(Int64)\ndata4Bytes              Nullable(String)\nsources                 Array(Nullable(String))\nincludedAtBlockHeight   Nullable(Int64)\nincludedBlockTimestamp  Nullable(DateTime64(3))\ninclusionDelayMs        Nullable(Int64)\nrawTx                   Nullable(String)\n```\n\n**CSV**\n\nSame as parquet, but without `rawTx`:\n\n```\ntimestamp_ms,hash,chain_id,from,to,value,nonce,gas,gas_price,gas_tip_cap,gas_fee_cap,data_size,data_4bytes,sources,included_at_block_height,included_block_timestamp_ms,inclusion_delay_ms,tx_type\n```\n\n---\n\n## FAQ\n\n- **_When is the data uploaded?_** ... The data for the previous day is uploaded daily between UTC 4am and 4:30am.\n- **_What about transactions that are already included on-chain?_** ... Some sources send transactions even after they have been included on-chain. When a transaction is received, mempool-dumpster checks if it has been included already, and if so discards it from the transaction files (note: it is still added to the sourcelog).\n- **_What is `inclusionDelayMs`, and why can it be negative?_**\n    - When a block is included on-chain, it includes a `block.timestamp` field.\n    - `inclusionDelayMs = (block.timestamp * 1000) - MempoolDumpster.receivedAtMs`\n    - Block builders set `block.timestamp`, typically to the beginning of the slot.\n    - A slot is 12 seconds. If mempool dumpster receives a transaction in the middle of the slot (i.e. `t=6`), it could get included in the current slot. In this case, the builder would set the timestamp to `t=0`, i.e. 6 seconds before MD has seen the transaction. This scenario would result in a negative `inclusionDelay` value (i.e. `inclusionDelayMs=-6000`).\n- **_What are exclusive transactions?_** ... a transaction that was seen from no other source (transaction only provided by a single source). These transactions might include recycled transactions (which were already seen long ago but not included, and resent by a transaction source).\n- **_What does \"XOF\" stand for?_** ... XOF stands for \"exclusive orderflow\" (i.e. exclusive transactions).\n- **_What is a-pool?_** ... A-Pool is a regular geth node with some optimized peering settings, subscribed to over the network.\n- **_gRPC vs Websockets?_** ... bloXroute and Chainbound are connected with gRPC, all other sources are connected with Websockets (note that gRPC has a lower latency than WebSockets).\n\n---\n\n# Working with Parquet\n\n[Apache Parquet](https://parquet.apache.org/) is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk (more [here](https://www.databricks.com/glossary/what-is-parquet#:~:text=What%20is%20Parquet%3F,handle%20complex%20data%20in%20bulk.)).\n\nWe recommend to use [ClickHouse local](https://clickhouse.com/docs/en/operations/utilities/clickhouse-local) (as well as [DuckDB](https://duckdb.org/)) to work with Parquet files, it makes it easy to run [queries](https://clickhouse.com/docs/en/sql-reference/statements) like:\n\n```bash\n# count rows\n$ clickhouse local -q \"SELECT count(*) FROM 'transactions.parquet' LIMIT 1;\"\n\n# count by transaction type\n$ clickhouse local -q \"SELECT txType, COUNT(txType) FROM 'transactions.parquet' GROUP BY txType;\"\n\n# show hash+rawTx from first entry\n$ clickhouse local -q \"SELECT hash,hex(rawTx) FROM 'transactions.parquet' LIMIT 1;\"\n\n# details of a particular hash\n$ clickhouse local -q \"SELECT timestamp,hash,from,to,hex(rawTx) FROM 'transactions.parquet' WHERE hash='0x152065ad73bcf63f68572f478e2dc6e826f1f434cb488b993e5956e6b7425eed';\"\n\n# all transactions seen from mempoolguru\n$ clickhouse local -q \"SELECT COUNT(*) FROM 'transactions.parquet' WHERE has(sources, 'mempoolguru');\"\n\n# all transactions that were seen by both mempoolguru and chainbound\n$ clickhouse local -q \"SELECT COUNT(*) FROM 'transactions.parquet' WHERE hasAll(sources, ['mempoolguru', 'local']);\"\n\n# exclusive transactions from bloxroute\n$ clickhouse local -q \"SELECT COUNT(*) FROM 'transactions.parquet' WHERE length(sources) == 1 AND sources[1] == 'bloxroute';\"\n\n# count of landed vs not-landed exclusive transactions, by source\n$ clickhouse local -q \"WITH includedBlockTimestamp!=0 as included SELECT sources[1], included, count(included) FROM 'out/out/transactions.parquet' WHERE length(sources) == 1 GROUP BY sources[1], included;\"\n\n# uniswap v2 transactions\n$ clickhouse local -q \"SELECT COUNT(*) FROM 'transactions.parquet' WHERE to='0x7a250d5630b4cf539739df2c5dacb4c659f2488d';\"\n\n# uniswap v2 transactions and separate by included/not-included\n$ clickhouse local -q \"WITH includedBlockTimestamp!=0 as included SELECT included, COUNT(included) FROM 'transactions.parquet' WHERE to='0x7a250d5630b4cf539739df2c5dacb4c659f2488d' GROUP BY included;\"\n\n# inclusion delay for uniswap v2 transactions (time between receiving and being included on-chain)\n$ clickhouse local -q \"WITH inclusionDelayMs/1000 as incdelay SELECT quantiles(0.5, 0.9, 0.99)(incdelay), avg(incdelay) as avg FROM 'transactions.parquet' WHERE to='0x7a250d5630b4cf539739df2c5dacb4c659f2488d' AND includedBlockTimestamp!=0;\"\n\n# count uniswap v2 contract methods\n$ clickhouse local -q \"SELECT data4Bytes, COUNT(data4Bytes) FROM 'transactions.parquet' WHERE to='0x7a250d5630b4cf539739df2c5dacb4c659f2488d' GROUP BY data4Bytes;\"\n```\n\nSee this post for more details: https://collective.flashbots.net/t/mempool-dumpster-a-free-mempool-transaction-archive/2401\n\n---\n\n## Running the analyzer\n\nYou can easily run the included analyzer to create summaries like [2023-09-22_summary.txt](https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-22_summary.txt):\n\n1. First, download the parquet and sourcelog files from https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09\n2. Then run the analyzer:\n\n```bash\ngo run cmd/analyze/* \\\n    --out summary.txt \\\n    --input-parquet /mnt/data/mempool-dumpster/2023-09-22/2023-09-22.parquet \\\n    --input-sourcelog /mnt/data/mempool-dumpster/2023-09-22/2023-09-22_sourcelog.csv.zip\n```\n\nTo speed things up, you can use the `MAX` environment variable to set a maximum number of transactions to process:\n\n```bash\nMAX=10000 go run cmd/analyze/* \\\n    --out summary.txt \\\n    --input-parquet /mnt/data/mempool-dumpster/2023-09-22/2023-09-22.parquet \\\n    --input-sourcelog /mnt/data/mempool-dumpster/2023-09-22/2023-09-22_sourcelog.csv.zip\n```\n\n## Interesting analyses\n\n- Something interesting with `inclusionDelay`?\n- Trash transactions (invalid nonce, not enough sender funds)\n\nFeel free to continue the conversation in the [Flashbots Forum](https://collective.flashbots.net/t/mempool-dumpster-a-free-mempool-transaction-archive/2401)!\n\n---\n\n# System architecture\n\n1. [Collector](cmd/collect/main.go): Connects to EL nodes and writes new mempool transactions and sourcelog to hourly CSV files. Multiple collector instances can run without colliding.\n2. [Merger](cmd/merge/main.go): Takes collector CSV files as input, de-duplicates, checks transaction inclusion status, sorts by timestamp and writes output files (Parquet, CSV and Summary).\n3. [Analyzer](cmd/analyze/main.go): Analyzes sourcelog CSV files and produces summary report.\n4. [Website](cmd/website/main.go): Website dev-mode as well as build + upload.\n\n\n![system diagram (https://excalidraw.com/#json=Jj2VXHWIN9TZqNOOVJiAk,UgZ_ui_aLZlnYUy6nBH5mw)](docs/system-diag1.png)\n\n---\n\n# Getting started\n\n## Mempool Collector\n\n1. Subscribes to new pending transactions at various data sources\n1. Writes 3 files:\n    1. Transactions CSV: `timestamp_ms, hash, raw_tx` (one file per hour by default)\n    1. Sourcelog CSV: `timestamp_ms, hash, source` (one entry for every single transaction received by any source)\n    1. Trash CSV: `timestamp_ms, hash, source, reason, note` (trash transactions received by any source, these are not added to the transactions CSV. currently only if already included in previous block)\n1. Note: the collector can store transactions repeatedly, and only the merger will properly deduplicate them later\n\n**Default filenames:**\n\nTransactions\n- Schema: `\u003cout_dir\u003e/\u003cdate\u003e/transactions/txs_\u003cdate\u003e_\u003cuid\u003e.csv`\n- Example: `out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv`\n\nSourcelog\n- Schema: `\u003cout_dir\u003e/\u003cdate\u003e/sourcelog/src_\u003cdate\u003e_\u003cuid\u003e.csv`\n- Example: `out/2023-08-07/sourcelog/src_2023-08-07-10-00_collector1.csv`\n\nTrash\n- Schema: `\u003cout_dir\u003e/\u003cdate\u003e/trash/trash_\u003cdate\u003e_\u003cuid\u003e.csv`\n- Example: `out/2023-08-07/trash/trash_2023-08-07-10-00_collector1.csv`\n\n**Running the mempool collector:**\n\n```bash\n# print help\ngo run cmd/collect/main.go -help\n\n# Connect to ws://localhost:8546 and write CSVs into ./out\ngo run cmd/collect/main.go -out ./out\n\n# Connect to multiple nodes\ngo run cmd/collect/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546\n```\n\n## Merger\n\n- Iterates over collector output directory / CSV files\n- Deduplicates transactions, sorts them by timestamp\n\n```bash\n# print help\ngo run cmd/merge/* -h\n\n# deduplicate transactions\ngo run cmd/merge/* transactions --check-node ws://server1.com ./out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv\n```\n\n\n---\n\n# Architecture\n\n## General design goals\n\n- Keep it simple and stupid\n- Vendor-agnostic (main flow should work on any server, independent of a cloud provider)\n- Downtime-resilience to minimize any gaps in the archive\n- Multiple collector instances can run concurrently, without getting into each others way\n- Merger produces the final archive (based on the input of multiple collector outputs)\n- The final archive:\n  - Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files\n  - Compatible with [ClickHouse](https://clickhouse.com/docs/en/integrations/s3) and [S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) (Parquet using gzip compression)\n  - Easily distributable as torrent\n\n## Collector\n\n- `NodeConnection`\n    - One for each EL connection\n    - New pending transactions are sent to `TxProcessor` via a channel\n- `TxProcessor`\n    - Check if it already processed that tx\n    - Store it in the output directory\n\n## Merger\n\n- Uses https://github.com/xitongsys/parquet-go to write Parquet format\n\n## Transaction RLP format\n\n- encoding transactions in typed EIP-2718 envelopes:\n  - https://medium.com/@markodayansa/a-comprehensive-guide-to-rlp-encoding-in-ethereum-6bd75c126de0\n  - https://blog.mycrypto.com/new-transaction-types-on-ethereum\n  - https://eips.ethereum.org/EIPS/eip-2718\n\n## Stats libraries\n\n- currently using: https://github.com/HdrHistogram/hdrhistogram-go/\n- possibly more versatile: https://github.com/montanaflynn/stats\n- see also:\n    - https://github.com/guptarohit/asciigraph\n\n---\n\n# Contributing\n\nInstall dependencies\n\n```bash\ngo install mvdan.cc/gofumpt@latest\ngo install honnef.co/go/tools/cmd/staticcheck@latest\ngo install github.com/golangci/golangci-lint/cmd/golangci-lint@latest\ngo install github.com/daixiang0/gci@latest\n```\n\nLint, test, format\n\n```bash\nmake lint\nmake test\nmake fmt\n```\n\n---\n\n# See also\n\n- [Discussion about compression](https://github.com/flashbots/mempool-dumpster/issues/2) and [storage](https://github.com/flashbots/mempool-dumpster/issues/1)\n- Forum post: https://collective.flashbots.net/t/mempool-dumpster-a-free-mempool-transaction-archive/2401\n\n---\n\n# License\n\n- Code: [MIT](./LICENSE)\n- Data: [CC-0 public domain](https://creativecommons.org/publicdomain/zero/1.0/)\n\n---\n\n# Maintainers\n\n- [metachris](https://twitter.com/metachris)\n- [0x416e746f6e](https://github.com/0x416e746f6e)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflashbots%2Fmempool-dumpster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflashbots%2Fmempool-dumpster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflashbots%2Fmempool-dumpster/lists"}