https://github.com/blockscout/blockscout-req-opt-batch-size-finder
Script for finding optimal batch size and concurrency for block import for the given archive node endpoint
https://github.com/blockscout/blockscout-req-opt-batch-size-finder
Last synced: 6 months ago
JSON representation
Script for finding optimal batch size and concurrency for block import for the given archive node endpoint
- Host: GitHub
- URL: https://github.com/blockscout/blockscout-req-opt-batch-size-finder
- Owner: blockscout
- License: gpl-3.0
- Created: 2022-05-27T08:13:52.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-06-27T13:21:53.000Z (almost 4 years ago)
- Last Synced: 2025-05-21T00:37:17.054Z (about 1 year ago)
- Language: Rust
- Size: 954 KB
- Stars: 2
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Script to define optimal batch size and concurrency of blocks import
## Run
Pass the arguments in command line:
```
RUST_LOG=info cargo run node_end_point block_num_total cnt
```
Where:
- `RUST_LOG=info` is to track progress
- `node_end_point` is node for test (e. g. *https://rpc.xdaichain.com/*, or *rpc.xdaichain.com*)
- `block_num_total` is number of generated blocks
- `cnt` is number of runs (optional, 10 by default)
## Tools
The work used: *rust* (rustc, cargo 1.60.0), python3. Part of the *Cargo.toml*:
```
[dependencies]
rand = "0.8.4"
reqwest = { version = "0.11", features = ["json", "blocking"] }
serde = {version = "1.0.137", features = ["derive"] }
serde_json = "1.0.81"
env_logger = "0.9.0"
log = "0.4.17"
csv = "1.1.6"
```
## Structure of concurrency
The picture shows how the block numbers are stored in memory and how concurrency is applied to them:

## Analysis
Let's take a look at the distribution when iterating `block_batch_size`:
* These plots shows dependence of time on concurrency. Vertical line is the num of cores (I have 8).


As we can see, when script create more than 8 green threads, the scheduler makes a big contribution to performance.
It also applies to *eth_getTransactionReceipt* request:

* These graphs plotted for https://sokol.poa.network/ node.


Analyzing them, we can put forward a hypothesis about the best enumeration of variables.
One of the hypothesis is: *change varible `block_concurrency`, thus, go by divisors of `block_num_total`*.
* Here graph for https://rpc.xdaichain.com/ node.

For *eth_getBlockByNumber requests* we can see two other minimums, not only (10, 4). There are (7, 6) and (15, 3).
## Problems
* With a large number of requests to the node, sometimes the server gives an error [429 Too Many Requests](https://developer.mozilla.org/ru/docs/Web/HTTP/Status/429). In this case, the script works fine, skipping these requests.
* When the script is running for a long time (with `cnt`>=40) sometimes an error is issued (*TimedOut*). Now I'm trying to catch this error.
## Results
You can check [results](results) folder.
## Сonclusion
Input variables are set in the script itself, but it can be easily fixed.
Among them: `node_end_point`, `block_num_total`, `cnt` (number of runs), `block_range`.
Two different versions of the script were written, their difference is in the approach to number of runs. In one of them hole script with *eth_getBlockByNumber* and *eth_getTransactionReceipt* request repeated `cnt` times. In other, every request repeated `cnt` times.
It seems that the second version is more visual.
I was surprised by the results of the script: the minimum was different for different `node_end_point` and with different `block_num_total`.