{"id":28626182,"url":"https://github.com/commoncrawl/wac2025-webgraph-workshop","last_synced_at":"2025-06-12T08:41:09.543Z","repository":{"id":285809771,"uuid":"959421296","full_name":"commoncrawl/wac2025-webgraph-workshop","owner":"commoncrawl","description":"Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025","archived":false,"fork":false,"pushed_at":"2025-04-10T05:43:03.000Z","size":1486,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T06:36:45.278Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-02T18:55:33.000Z","updated_at":"2025-04-10T05:43:06.000Z","dependencies_parsed_at":"2025-04-02T20:19:50.632Z","dependency_job_id":"47ba1c50-8adb-4978-b226-8fc72af7bfd0","html_url":"https://github.com/commoncrawl/wac2025-webgraph-workshop","commit_stats":null,"previous_names":["commoncrawl/wac2025-webgraph-workshop"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/wac2025-webgraph-workshop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwac2025-webgraph-workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwac2025-webgraph-workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwac2025-webgraph-workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwac2025-webgraph-workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/wac2025-webgraph-workshop/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwac2025-webgraph-workshop/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432314,"owners_count":22856724,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T08:41:08.085Z","updated_at":"2025-06-12T08:41:09.537Z","avatar_url":"https://github.com/commoncrawl.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"Introduction to Web Graphs\n==========================\n\nWorkshop at the [IIPC Web Archiving Conference 2025](https://netpreserve.org/ga2025/).\n\n- [Workshop Slides](./docs/web-graph-workshop-wac2025.pdf)\n- [Links and bibliographic references](./docs/web-graph-workshop-wac2025.bib)\n\n\n\n## Preparations Ahead of the Workshop\n\n\n### Installation of the Java JDK\n\nIf you are a Java developer and familiar with Java, Maven and Git, please directly go to the section \"[Set Up for Java Developers](#set-up-for-java-developers). It's assumed that the Java JDK and all development resources are already installed.\n\nPlease install the latest Java JDK from \u003chttps://www.oracle.com/java/technologies/downloads/\u003e or make sure it is already installed on your laptop. Please follow the installation instructions \u003chttps://docs.oracle.com/en/java/javase/21/install/overview-jdk-installation.html\u003e.\n\nNotes:\n- Java 11 or higher is required.\n- The Java Development Kit (JDK) is required – the Java runtime (JRE) is not sufficient, because we will use the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/introduction-jshell.html) included only in the JDK.\n\n\n\n### Download the CC-Webgraph JAR\n\nPlease, download the full cc-webgraph JAR (including all dependent libraries) from [here](https://github.com/commoncrawl/wac2025-webgraph-workshop/raw/refs/heads/main/data/large-files/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar).\n\nIn the following, we refer to this JAR file using the variable `$CC_WEBGRAPH_JAR`. If you know about environment variables, you should define `CC_WEBGRAPH_JAR` and point it to the absolute path of the downloaded JAR file. If you use a Shell you can define the variable as\n\n    CC_WEBGRAPH_JAR=\"$PWD\"/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar\n\nOptionally you might want to set the environment variable `CLASSPATH`, used by Java and the JShell, and point it to jar file. On Unix systems this is done by:\n\n    CLASSPATH=\"$CC_WEBGRAPH_JAR\"\n    export CLASSPATH\n\n\n### Clone or Download the cc-webgraph Project Repository\n\nThis step is optional.\n\nThe `cc-webgraph` project repository bundles tools and scripts to construct, process and explore Common Crawl web graphs.\n\nYou can either clone the repository:\n\n    git clone https://github.com/commoncrawl/cc-webgraph.git\n\nor download it as zip file:\n\n    wget --timestamping https://github.com/commoncrawl/cc-webgraph/archive/refs/heads/main.zip\n    unzip main.zip\n\nIn the following, we refer to the project directory (`cc-webgraph` or `cc-webgraph-main`) using the variable `$CC_WEBGRAPH`. If you use a Shell you can define the variable as\n\n    CC_WEBGRAPH=\"$PWD\"/cc-webgraph        # cloned repository\n\nor\n\n    CC_WEBGRAPH=\"$PWD\"/cc-webgraph-main   # zipped repository package\n\nWe will use scripts from it in the workshop, but you can also visit and download the scripts individually from Github.\n\n\n### Get Familiar With Java and the JShell\n\nPlease read about the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/introduction-jshell.html).\nIdeally, you launch it in a terminal by typing `jshell'. You should get the prompt:\n\n    |  Welcome to JShell -- Version 21.0.6\n    | For an introduction type: /help intro |\n\n    jshell\u003e\n\n\n### Bash, cURL and Wget\n\nThis requirement is optional.\n\nWe provide scripts to download CCF's webgraphs and to build the required offset and vertex maps. If you want to run the scripts a Bash shell (version 4 or higher) is required. In addition, the download script relies on cURL or Wget as download tool.\n\nHowever, in case you have no Bash, or none of cURL or Wget installed, it's no issue. You only need to copy-paste a few commands into your terminal.\n\n\n### Set Up for Java Developers\n\nIt is assumed that Java JDK (Java 11 or upwards), Maven and Git are installed.\n\n1. Clone the \"cc-webgraph\" repository:\n\n        git clone https://github.com/commoncrawl/cc-webgraph.git\n\n2. Run the Java build:\n\n        cd cc-webgraph/\n        mvn package\n\n3. Try the cc-webgraph package:\n\n        java -cp target/cc-webgraph-*-jar-with-dependencies.jar \u003cclassname\u003e \u003cargs\u003e...\n\n   For example:\n\n        java -cp target/cc-webgraph-*-jar-with-dependencies.jar it.unimi.dsi.webgraph.BVGraph --help\n\n4. Define two environment variables, pointing to the project directory and the JAR file:\n\n        CC_WEBGRAPH=\"$PWD\"\n        CC_WEBGRAPH_JAR=$(ls \"$PWD\"/target/cc-webgraph-*-jar-with-dependencies.jar)\n\n   Optionally you might want to set the environment variable `CLASSPATH` which is used by Java and the JShell:\n\n        CLASSPATH=\"$CC_WEBGRAPH_JAR\"\n        export CLASSPATH\n\nThe project `cc-webgraph` provides few Java classes and scripts to construct and process web graphs from Common Crawl data.\n\nThe assembly jar file includes also the [WebGraph](https://webgraph.di.unimi.it/) and [LAW](https://law.di.unimi.it/software.php) packages required to compute [PageRank](https://en.wikipedia.org/wiki/PageRank) and [Harmonic Centrality](https://en.wikipedia.org/wiki/Centrality#Harmonic_centrality).\n\n\n### Dowload of Webgraphs\n\nPlease download at least one of the following webgraphs:\n\n|    Disk |    RAM | Nodes |  Arcs | Name                               |\n|--------:|-------:|------:|------:|:-----------------------------------|\n| 800 MiB |  2 GiB |  6.8M |  173M | enwiki-2024                        |\n|  13 GiB | 16 GiB |  135M | 2038M | cc-main-2025-jan-feb-mar-domain    |\n\nWhich one(s) depends on your hardware. Working with the Common Crawl domain-level graph may requires more disk and RAM.\n\nPlease follow the download and setup instructions shared below together with some references about the two graphs.\n\n\n#### English Wikipedia 2024 (Provided By LAW, University of Milano)\n\nThe Laboratory of Web Algorithmics (LAW) at the University of Milano is the home of the [WebGraph](https://webgraph.di.unimi.it/) framework.\n\nThe list of LAW webgraph datasets is long and includes graphs derived from web crawls but also social network graphs. You find them all at \u003chttps://law.di.unimi.it/datasets.php\u003e.\n\nThe webgraph of the English Wikipedia 2024 is list as social network graph. The nodes are Wikipedia articles and the arcs links to other articles. Of course, you can take this graph for a real webgraph, because every Wikipedia article is shown on a webpage and there is a mapping between article name and its URL.\n\nMore information about the `enwiki-2024` webgraph is found on \u003chttps://law.di.unimi.it/webdata/enwiki-2024/\u003e.\n\nThe LAW also provides ranks based on this webgraph. The Wikiranks site is worth a visit: \u003chttps://wikirank-2024.di.unimi.it/?search=\u0026type=all\u0026pageSize=20\u003e. It allows you to compare how three graph-based ranking algorithms (Harmonic Centrality, PageRank and Indegree Count) compare with each other and to the number of Page Views.\n\n\nIn order to explore the graph yourself using the WebGraph framework, please download the graph files. We need at least the six files listed in [enwiki-2024-download-list.txt](./data/enwiki-2024/enwiki-2024-download-list.txt). Please, download the files in a separate folder, for example in `data/enwiki-2024/`. If you have Wget installed the download could be done by:\n\n    cd data/enwiki-2024/\n    wget --continue --timestamping --input-file enwiki-2024-download-list.txt\n\nAfter the files are downloaded, the following two commands are required to build the offset lists for both the graph and its transpose:\n\n    java -cp \"$CC_WEBGRAPH_JAR\" it.unimi.dsi.webgraph.BVGraph --offsets --list enwiki-2024\n    java -cp \"$CC_WEBGRAPH_JAR\" it.unimi.dsi.webgraph.BVGraph --offsets --list enwiki-2024-t\n\n\n\n#### Domain-Level Webgraph of Common Crawls 2025 Jan/Feb/Mar\n\nThe domain-level webgraph `cc-main-2025-jan-feb-mar-domain` was built using the hyperlinks extracted from three Common Crawl datasets crawl in January, February and March 2025. More information about this graph data set is found\n- on the Common Crawl blog \u003chttps://commoncrawl.org/blog/host--and-domain-level-web-graphs-january-february-and-march-2025\u003e\n- and the dataset download page \u003chttps://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/index.html\u003e\n\nPlease, download six of the domain-level files in a separate folder:\n\n- You can use the download script [`graph_explore_download_webgraph.sh`](https://github.com/commoncrawl/cc-webgraph/raw/refs/heads/main/src/script/webgraph_ranking/graph_explore_download_webgraph.sh) and run it:\n\n        cd data/cc-main-2025-jan-feb-mar-domain\n        bash graph_explore_download_webgraph.sh cc-main-2025-jan-feb-mar-domain\n\n  Or if you have a local copy of the cc-webgraph project:\n\n        bash \"$CC_WEBGRAPH\"/src/script/webgraph_ranking/graph_explore_download_webgraph.sh cc-main-2025-jan-feb-mar-domain\n\n- Alternatively, using the [download list](./data/cc-main-2025-jan-feb-mar-domain/cc-main-2025-jan-feb-mar-domain-download-list.txt):\n\n        cd data/cc-main-2025-jan-feb-mar-domain\n        wget --continue --timestamping --input-file cc-main-2025-jan-feb-mar-domain-download-list.txt\n\n- Or file by file:\n\n    | Size      | File |\n    | --------- | ---- |\n    | 941.1 MiB | [cc-main-2025-jan-feb-mar-domain-vertices.txt.gz](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain-vertices.txt.gz) |\n    | 4.6 GiB   | [cc-main-2025-jan-feb-mar-domain.graph](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain.graph) |\n    |   2 KiB   | [cc-main-2025-jan-feb-mar-domain.properties](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain.properties) |\n    | 4.6 GiB   | [cc-main-2025-jan-feb-mar-domain-t.graph](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain-t.graph) |\n    |   2 KiB   | [cc-main-2025-jan-feb-mar-domain-t.properties ](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain-t.properties) |\n    |   1 KiB   | [cc-main-2025-jan-feb-mar-domain.stats](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/domain/cc-main-2025-jan-feb-mar-domain.stats) |\n\n\nAfter the files are downloaded, we need to build\n- the offset lists for both the graph and its transpose\n- a mapping between vertex labels and vertex IDs\n\nThis can be done by running the script [graph_explore_build_vertex_map.sh](https://raw.githubusercontent.com/commoncrawl/cc-webgraph/refs/heads/main/src/script/webgraph_ranking/graph_explore_build_vertex_map.sh):\n\n    bash \"$CC_WEBGRAPH\"/src/script/webgraph_ranking/graph_explore_build_vertex_map.sh cc-main-2025-jan-feb-mar-domain cc-main-2025-jan-feb-mar-domain-vertices.txt.gz\n\nOr, in case you cannot run the script, by building the offset lists with the commands:\n\n    java -cp \"$CC_WEBGRAPH_JAR\" it.unimi.dsi.webgraph.BVGraph --offsets --list enwiki-2024\n    java -cp \"$CC_WEBGRAPH_JAR\" it.unimi.dsi.webgraph.BVGraph --offsets --list enwiki-2024-t\n\nAnd downloading at the vertex map from [here](https://github.com/commoncrawl/wac2025-webgraph-workshop/raw/refs/heads/main/data/large-files/cc-main-2025-jan-feb-mar-domain.iepm).\n\nWhen everything is done, you should see the following files in your directory:\n```\ncc-main-2025-jan-feb-mar-domain.graph\ncc-main-2025-jan-feb-mar-domain.properties\ncc-main-2025-jan-feb-mar-domain-t.graph\ncc-main-2025-jan-feb-mar-domain-t.properties\ncc-main-2025-jan-feb-mar-domain-vertices.txt.gz\ncc-main-2025-jan-feb-mar-domain.stats\ncc-main-2025-jan-feb-mar-domain.offsets\ncc-main-2025-jan-feb-mar-domain.obl\ncc-main-2025-jan-feb-mar-domain-t.offsets\ncc-main-2025-jan-feb-mar-domain-t.obl\ncc-main-2025-jan-feb-mar-domain.iepm\n```\n\n\n### More About Web Graphs and Next Steps\n\nYou are now prepared for the workshop.\n\nIf you have time, we recommend to watch Paolo Boldi's talk from 2013 [A modern view of centrality measures](https://www.youtube.com/watch?v=cnGJtGP4gL4).\nIt's an excellent introduction in graph centrality measures and graph-based ranking – from a real expert of the topic and one of the two authors of the WebGraph framework (the other is Sebastiano Vigna).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fwac2025-webgraph-workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fwac2025-webgraph-workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fwac2025-webgraph-workshop/lists"}