{"id":20388385,"url":"https://github.com/trudi-group/ipfs-crawler","last_synced_at":"2025-06-12T20:32:39.278Z","repository":{"id":37694813,"uuid":"241151836","full_name":"trudi-group/ipfs-crawler","owner":"trudi-group","description":"A crawler for the IPFS network, code for our paper (https://arxiv.org/abs/2002.07747). Also holds scripts to evaluate the obtained data and make similar plots as in the paper.","archived":false,"fork":false,"pushed_at":"2024-11-01T01:48:41.000Z","size":171814,"stargazers_count":69,"open_issues_count":3,"forks_count":16,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-19T19:08:20.814Z","etag":null,"topics":["crawler","ipfs","ipfs-network","kademlia-dht","libp2p"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trudi-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-17T16:15:08.000Z","updated_at":"2025-03-24T02:41:08.000Z","dependencies_parsed_at":"2022-09-11T23:40:57.904Z","dependency_job_id":"a8eff000-8d1c-41f3-94fb-1c5af92cc757","html_url":"https://github.com/trudi-group/ipfs-crawler","commit_stats":{"total_commits":165,"total_committers":8,"mean_commits":20.625,"dds":0.4484848484848485,"last_synced_commit":"fee3d056e58590515bd3ccdbb49c6b6384b394cc"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/trudi-group/ipfs-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trudi-group%2Fipfs-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trudi-group%2Fipfs-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trudi-group%2Fipfs-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trudi-group%2Fipfs-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trudi-group","download_url":"https://codeload.github.com/trudi-group/ipfs-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trudi-group%2Fipfs-crawler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259522455,"owners_count":22870469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","ipfs","ipfs-network","kademlia-dht","libp2p"],"created_at":"2024-11-15T03:09:34.919Z","updated_at":"2025-06-12T20:32:39.236Z","avatar_url":"https://github.com/trudi-group.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Libp2p-Crawler\n\nA crawler for the Kademlia-part of various libp2p networks.\n\n**For more details, see [our paper](https://arxiv.org/abs/2002.07747).**\n\nIf you use our work, please **cite our papers**:\n\nSebastian A. Henningsen, Martin Florian, Sebastian Rust, Björn Scheuermann:\n**Mapping the Interplanetary Filesystem.** *Networking 2020*: 289-297\\\n[[BibTex]](https://dblp.uni-trier.de/rec/conf/networking/HenningsenFR020.html?view=bibtex)\n\nSebastian A. Henningsen, Sebastian Rust, Martin Florian, Björn Scheuermann:\n**Crawling the IPFS Network.** *Networking 2020*: 679-680\\\n[[BibTex]](https://dblp.uni-trier.de/rec/conf/networking/HenningsenRF020.html?view=bibtex)\n\nFor a Live Version of the crawler results, check out our [Periodic Measurements of the IPFS Network](https://trudi.weizenbaum-institut.de/ipfs_analysis.html)\n\n## Building\n\nYou can build this in a containerized environment.\nThis will build on Debian Bullseye and extract the compiled binary to `out/`:\n```bash\n./build-in-docker.sh\n```\n\nThis is the preferred way of compilation.\nYou can also manually compile the crawler.\nThis will need an older version of Go installed, since the most recent version is usually not supported by the QUIC implementation.\n\n## Usage\n\nTo crawl the network once, execute the crawler with the corresponding config file:\n```bash\nexport LIBP2P_ALLOW_WEAK_RSA_KEYS=\"\" \u0026\u0026 export LIBP2P_SWARM_FD_LIMIT=\"10000\" \u0026\u0026 ./out/libp2p-crawler --config dist/config_ipfs.yaml\n```\n\nOne crawl will take 5-10 minutes, depending on your machine.\n\n### Docker\n\nThe image executes `dist/docker_entrypoint.sh` by default, which will set the environment variables and launch the crawler with all arguments provided to it.\nThis loads a config file located at `/libp2p-crawler/config.yaml` in the image.\nYou can thus override the executed config by mounting a different file to this location.\n\nYou'll need to mount the precomputed hashes as well as an output directory.\nThe working directory of the container is `/libp2p-crawler`.\nA typical invocation could look like this:\n\n```bash\ndocker run -it --rm \\\n  -v ./dist/config_ipfs.yaml:/libp2p-crawler/config.yaml \\\n  -v ./precomputed_hashes:/libp2p-crawler/precomputed_hashes \\\n  -v ./output_data_crawls:/libp2p-crawler/output_data_crawls \\\n  trudi-group/ipfs-crawler:latest\n```\n\nThe crawler runs as `root` within the container and, thus, also writes files as `uid` `0`.\nThis is somewhat annoying on the host, since files in the mapped output directory will also be owned by `root`.\n\n### Computing Preimages\n\n**Important note:** We ship the pre-images necessary for a successful crawl, but you can compute them yourself with `make preimages`.\nNote that the preimages only have to be computed *once*, it'll take some minutes, to compute them, though.\n\n```bash\ngo build cmd/hash-precomputation/main.go\nmv main cmd/hash-precomputation/hash-precomputation\n./cmd/hash-precomputation/hash-precomputation\nmkdir -p precomputed_hashes\nmv preimages.csv precomputed_hashes/preimages.csv\n```\n\n## Configuration\n\nThe crawler is configured via a YAML configuration file.\nExample configurations with sane defaults are provided in [dist/](dist):\n- [dist/config_ipfs.yaml](dist/config_ipfs.yaml) contains a configuration to crawl the IPFS network.\n- [dist/config_filecoin_mainnet.yaml](dist/config_filecoin_mainnet.yaml) contains a configuration to crawl the Filecoin mainnet.\n\n### Bootstrap Peers\n\nThe crawler needs to know which peers to use to start a crawl.\nThese are configured via the configuration file.\nTo get the default bootstrap peers of an IPFS node, simply run ```./ipfs bootstrap list \u003e bootstrappeers.txt```.\n\n## In a Nutshell\n\nThis crawler is designed to enumerate all reachable nodes within the DHT/KAD-part of libp2p networks and return their neighborhood graph.\nFor each node it saves\n* The ID\n* All known multiaddresses that were found in the DHT\n* If a connection could be established\n* All peers in the routing table of the peer, if crawling succeeded\n* The agent version, if the identify protocol succeeded\n* Supported protocols, if the identify protocol succeeded\n* Plugin-extensible metadata\n\nThis is achieved by sending multiple `FindNode`-requests to each node in the network, targeted in such a way that each request extracts the contents of exactly one DHT bucket.\n\nThe crawler is optimized for speed, to generate as accurate snapshots as possible.\nIt starts from the (configurable) bootstrap nodes, polls their buckets and continues to connect to every peer it has not seen so far.\n\nFor an in-depth dive and discussion to the crawler and the obtained results, you can watch @scriptkitty's talk at ProtocolLabs:\n\n[![Link to YouTube](https://img.youtube.com/vi/jQI37Y25jwk/1.jpg)](https://www.youtube.com/watch?v=jQI37Y25jwk)\n\n## Evaluation of Results\n\nAfter running a few crawls, the output directory should have some data in it.\nTo run the evaluation and generate the same plots/tables as in the paper (and more!) you have the option to run it via Docker or manually.\nWe've compiled the details [in the README](./eval/README.md)\n\n## Features\n\n### Plugins\n\nWe support implementing plugins that interact with peers discovered through a crawl.\nThese plugins are executed, in order, for all peers that are connectable.\nOutput of all plugins is collected and appended to each node's metadata.\n\nCurrently implemented plugins:\n- `bitswap-probe` probes nodes for content via Bitswap.\n  This correctly handles different Bitswap versions and capabilities of the peers.\n  See also [the README](./plugins/bsprobe/README.md).\n\n### Node Caching\n\nIf configured, the crawler will cache the nodes it has seen.\nThe next crawl will then not only start at the boot nodes but also add all previously reachable nodes to the crawl queue.\nThis can increase the crawl speed, and therefore the accuracy of the snapshots, significantly.\nDue to node churn, this setting is most reasonable when performing many consecutive crawls.\n\n## Output of a crawl\n\nA crawl writes two files to the output directory configured via the configuration file:\n* ```visitedPeers_\u003cstart_of_crawl_datetime\u003e.json```\n* ```peerGraph_\u003cstart_of_crawl_datetime\u003e.csv```\n\n### Format of ```visitedPeers```\n\n```visitedPeers``` contains a json structure with meta information about the crawl as well as each found node.\nEach node entry corresponds to exactly one node on the network and has the following fields:\n```json\n{\n  \"id\": \"\u003cmultihash of the node id\u003e\",\n  \"multiaddrs\": \u003clist of multiaddresses\u003e,\n  \"connection_error\": null | \"\u003chuman-readable error\u003e\",\n  \"result\": null (if connection_error != null) | {\n    \"agent_version\": \"\u003cagent version string, if known\u003e\",\n    \"supported_protocols\": \u003clist of supported protocols\u003e,\n    \"crawl_begin_ts\": \"\u003ctimestamp of when crawling was initiated\u003e\",\n    \"crawl_end_ts\": \"\u003ctimestamp of when crawling was finished\u003e\",\n    \"crawl_error\": null | \"\u003chuman-readable error\u003e\",\n    \"plugin_results\": null | {\n      \"\u003cplugin name\u003e\": {\n        \"begin_timestamp\": \"\u003ctimestamp of when the plugin was executed on the peer\u003e\",\n        \"end_timestamp\": \"\u003ctimestamp of when the plugin finished executing on the peer\u003e\",\n        \"error\": null | \"\u003chuman-redable error\u003e\",\n        \"result\": null (if error != null) | \u003creturn value of executing the plugin\u003e\n      }\n    }\n  }\n}\n```\n\nThe Node's ID is a [multihash](https://github.com/multiformats/multihash), the addresses a peer advertises are [multiaddresses](https://github.com/multiformats/multiaddr).\n```crawlable``` is true/false and indicates, whether the respective node could be reached by the crawler or not. Note that the crawler will try to connect to *all* multiaddresses that it found in the DHT for a given peer.\n```agent_version``` is simply the agent version string the peer provides when connecting to it.\n\nData example (somewhat anonymized):\n```json\n{\n  \"id\": \"12D3KooWDwu...\",\n  \"multiaddrs\": [\n    \"/ip6/::1/udp/4001/quic\",\n    \"/ip4/127.0.0.1/udp/4001/quic\",\n    \"/ip4/154.x.x.x/udp/4001/quic\",\n    \"...\"\n  ],\n  \"connection_error\": null,\n  \"result\": {\n    \"agent_version\": \"kubo/0.18.1/675f8bd/docker\",\n    \"supported_protocols\": [\n      \"/libp2p/circuit/relay/0.2.0/hop\",\n      \"/ipfs/ping/1.0.0\",\n      \"...\",\n      \"/ipfs/id/1.0.0\",\n      \"/ipfs/id/push/1.0.0\"\n    ],\n    \"crawl_begin_ts\": \"2023-04-27T15:57:11.782371723+02:00\",\n    \"crawl_end_ts\": \"2023-04-27T15:57:13.434195769+02:00\",\n    \"crawl_error\": null,\n    \"plugin_data\": {\n      \"bitswap-probe\": {\n        \"begin_timestamp\": \"2023-04-27T15:57:14.434195769+02:00\",\n        \"end_timestamp\": \"2023-04-27T15:57:15.434195769+02:00\",\n        \"error\": null,\n        \"result\": {\n          \"error\": null,\n          \"haves\": null,\n          \"dont_haves\": [\n            {\n              \"/\": \"QmY7Yh4UquoXHLPFo2XbhXkhBvFoPwmQUSa92pxnxjQuPU\"\n            }\n          ],\n          \"blocks\": null,\n          \"no_response\": null\n        }\n      }\n    }\n  }\n}\n```\n\n### Format of `peerGraph`\n\n`peerGraph` is an edgelist, where each line in the file corresponds to one edge. A line has the form\n\n```csv\nsource,target,target_crawlable,source_crawl_timestamp\n```\n\nTwo nodes are connected, if the crawler found the peer `target` in the buckets of peer `source`.\nExample line (somewhat anonymized):\n\n```csv\n12D3KooWD9QV2...,12D3KooWCDx5k1...,true,2023-04-14T03:18:06+01:00\n```\n\nwhich says that the peer with ID `12D3KooWD9QV2...` had an entry for peer `12D3KooWCDx5k1...` in its buckets and that the latter was reachable by our crawler.\n\nIf `target_crawlable` is `false`, this indicates that the crawler was not able to connect to or enumerate all of `target`'s peers.\nSince some nodes reside behind NATs or are otherwise uncooperative, this is not uncommon to see.\n\n## Libp2p complains about key lengths\n\nLibp2p uses a minimum keylenght of [2048 bit](https://github.com/libp2p/go-libp2p-core/blob/master/crypto/rsa_common.go), whereas IPFS uses [512 bit](https://github.com/ipfs/infra/issues/378).\nTherefore, the crawler can only connect to one IPFS bootstrap node and refuses a connection with the others, due to this key length mismatch.\nLibp2p can be configured to ignore this mismatch via an environment variable:\n\n```bash\nexport LIBP2P_ALLOW_WEAK_RSA_KEYS=\"\"\n```\n\n## Socket limit\n\nipfs-crawler uses a lot of sockets.\nOn linux, this can result into \"too many sockets\" errors during connections.\nPlease raise the maximum number of sockets on linux via \n```bash\nulimit -n unlimited\n```\nor equivalent commands on different platforms.\n\n## License\n\nMIT, see [LICENSE](LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrudi-group%2Fipfs-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrudi-group%2Fipfs-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrudi-group%2Fipfs-crawler/lists"}