{"id":47817649,"url":"https://github.com/commoncrawl/cc-web-graph-neo4j","last_synced_at":"2026-04-03T18:49:41.128Z","repository":{"id":344790370,"uuid":"1126449119","full_name":"commoncrawl/cc-web-graph-neo4j","owner":"commoncrawl","description":"Instructions and code for using the Common Crawl Web Graph in Neo4j format","archived":false,"fork":false,"pushed_at":"2026-03-16T09:49:54.000Z","size":120,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-16T22:36:33.537Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-01T23:50:43.000Z","updated_at":"2026-03-16T09:49:51.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/commoncrawl/cc-web-graph-neo4j","commit_stats":null,"previous_names":["commoncrawl/cc-web-graph-neo4j"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-web-graph-neo4j","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-web-graph-neo4j","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-web-graph-neo4j/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-web-graph-neo4j/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-web-graph-neo4j/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-web-graph-neo4j/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-web-graph-neo4j/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31370218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T17:53:18.093Z","status":"ssl_error","status_checked_at":"2026-04-03T17:53:17.617Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-03T18:49:40.622Z","updated_at":"2026-04-03T18:49:41.116Z","avatar_url":"https://github.com/commoncrawl.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cc-web-graph-neo4j\n\nThis repo contains documentation and code related to the Common Crawl Foundation's [Web Graphs](https://commoncrawl.org/web-graphs),\nstored in a [neo4j graph database](https://neo4j.com/).\nWe have been computing these web graphs since 2018, and currently every crawl has\na web graph covering the previous 3 crawls.\n\nThese graphs are computed by the [WebGraph Framework](https://webgraph.di.unimi.it/). Historically CCF only distributed these graphs in a\nnot-commonly-used format.\nThis repo contains both instructions for using the graphs in neo4j form, and also code to convert from Web Graph\nFramework format to neo4j.\n\n## Status\n\nThis project is in beta-testing. Please give it a try with the one Web Graph we've converted: we provide both the domain and host version.\n\nThe host Web Graph contains each hostname as a separate node, and links between them as edges.\nThe domain Web Graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix\nlist maintained on publicsuffix.org.\n\n\u003e [!TIP]\n\u003e We are collecting feedback on the instructions and the code, and will be making improvements based on your needs and suggestions.\n\u003e Eventually we will provide all of our web graphs in neo4j format.\n\n## Motivation\n\nThese papers give good examples of what web graphs are useful for:\n\n- Bharat, Krishna, et al. \"Who links to whom: Mining linkage between web sites.\" Proceedings 2001 IEEE International\n  Conference on Data Mining. IEEE, 2001.\n- Somboonviwat, Kulwadee, Masaru Kitsuregawa, and Takayuki Tamura. \"Simulation study of language specific web crawling.\"\n  21st International Conference on Data Engineering Workshops (ICDEW'05). IEEE, 2005.\n- Lehmberg, Oliver, Robert Meusel, and Christian Bizer. \"Graph structure in the web: aggregated by pay-level domain.\"\n  Proceedings of the 2014 ACM conference on Web science. 2014.\n\n## Hardware Requirements\n\nWe recommend 2–4 CPU cores or more, 16–32 GB of memory, and ample storage -- 512GB to 1TB.\n\n## Docker container\n\nThese instructions set up a neo4j image inside a docker container.\nThe container is configured to accept exec operations as described in this README.\n\n```\nmkdir -p data/neo4j_db data/import data/export logs plugins\nPW=asdfasdf CONAME=web-graph-neo4j bash create.sh\ndocker stop web-graph-neo4j\n```\n\n\u003e [!IMPORTANT]\n\u003e Buglet: `logs/`, `data/neo4j_db` end up owned by user:group 7474:7474\n\nThe proper way to fix the permissions is to create a user with that uid/gid on the host and chown the directories to\nthat user.\n\n```shell\nsudo groupadd -g 7474 neo4j\nsudo useradd -u 7474 -g 7474 neo4j;\nsudo chown -R neo4j:neo4j data logs\n```\n\nYou could also add your own user to group neo4j for simplified access.\n\nAt this point you have a container (with Neo4J not running yet) that you can stop and start and run commands in.\nFor example,\n\n```\ndocker start web-graph-neo4j\ndocker exec web-graph-neo4j ls /data\ndocker stop web-graph-neo4j\n```\n\nAlso, note that there are 3 special directories on the local disk, one for the neo4j database, one for incoming files,\nand one for files created by running commands in the container. These are:\n\n- data/neo4j_db\n- data/import\n- data/export\n\n## Download and use an existing neo4j web graph\n\nOur pre-made neo4j format web graphs are stored as neo4j dump files.\nTo use them, you have to download the dumps and then load them.\n\n\u003e [!TIP]\n\u003e Consider allocating around 500Gb-700Gb at max, for the whole process (the dump can be removed after loading it).\n\nThe dump for the domain Web Graph is ~100Gb and for the host Web Graph is 180Gb.\nHowever, the loaded database is about 2.5-3 times the dump size, and this may increase when creating more indexes.\n\n\u003e [!IMPORTANT]\n\u003e neo4j community edition supports only one database per instance, so we\n\u003e strongly recommend to pick one dump to load and work with.\n\n### Download\n\n#### Domain Web Graph\n\n```\nwget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-system.dump\nwget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-neo4j.dump\n```\n\nor from inside AWS:\n\n```\ns3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-system.dump\ns3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-neo4j.dump\n```\n\n#### Host Web Graph\n\n```\nwget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-system.dump\nwget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-neo4j.dump\n```\n\nor from inside AWS:\n\n```\ns3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-system.dump\ns3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-neo4j.dump\n```\n\n### Load\n\nThis step turns the dump files into a neo4j database. Note that the database will be about 2.5X the size of the dump.\n\nMove the dumps in the import directory:\n\n```shell\nmv cc-main-2025-oct-nov-dec-domain-system.dump data/import/system.dump\nmv cc-main-2025-oct-nov-dec-domain-neo4j.dump data/import/neo4j.dump\n```\n\n\u003e [!IMPORTANT]\n\u003e Load and dump operations should always be performed with Neo4J in offline mode, or stopped.\n\u003e You can check using `docker exec web-graph-neo4j neo4j status`\n\nLoad the system and neo4j databases:\n\n```shell\ndocker start web-graph-neo4j\ndocker exec web-graph-neo4j neo4j-admin database load --expand-commands system --from-path=/import --overwrite-destination=true\ndocker exec web-graph-neo4j neo4j-admin database load --expand-commands neo4j --from-path=/import  --overwrite-destination=true\ndocker stop web-graph-neo4j\n```\n\nAt this point, you should see the unpacked database in `data/neo4j_db`. If you like, you can now remove the 2 dump files\nin import/\n\n### Use\n\nThe container is configured to sleep infinitely, after starting, you can \"exec\" to start up neo4j:\n\n```shell\ndocker start web-graph-neo4j\ndocker exec web-graph-neo4j neo4j start\n```\n\nAfter, you can access it with a browser at https://localhost:7474/\n\nIf you want to run scripts against neo4j, write the output into /export\n\nThe web dashboard looks like:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/dashboard_eg.jpg\" alt=\"You might see a dashboard enabled to play with\" width=\"800\"/\u003e\n\u003c/p\u003e\n\n## Type of Nodes\n\nExample Node details of `host-level` or `domain-level` Web Graph (Note: `num_hosts` is only provided in `domain-level`):\n\n| Key        | Value                                           |\n|--------------|--------------------------------------------------|\n| `\u003cid\u003e`     | 4:5b402213-36e2-4fd4-af16-2f4de077133b:50869977 |\n| num_hosts  | 2365                                            |\n| host_parts | [\"com\", \"microsoft\"]                            |\n| id         | \"105638887\"                                     |\n\n## Credits\n\nOur data originates from The [Web Graph](https://commoncrawl.org/web-graphs), and the insights align\nwith [Web Graph Statistics](https://commoncrawl.github.io/cc-webgraph-statistics/); the project presents results\non [neo4j](https://github.com/neo4j/neo4j).\n\n## Contributing\n\nWe'd love to hear from you.\nFeedback and code contributions are mostly welcome!\n\nFor example, let us know whether the instructions ran end-to-end on your machine, don't forget to note OS, RAM, disk,\nand the Web Graph release you used.\nShould you have ideas of analysis or queries you would like ot run on the Web Graph, please them our way as well and we\nwill be delighted to help you.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-web-graph-neo4j","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-web-graph-neo4j","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-web-graph-neo4j/lists"}