{"id":28626129,"url":"https://github.com/commoncrawl/cc-crawl-statistics","last_synced_at":"2025-06-12T08:40:58.158Z","repository":{"id":43997923,"uuid":"63318430","full_name":"commoncrawl/cc-crawl-statistics","owner":"commoncrawl","description":"Statistics of Common Crawl monthly archives mined from URL index files","archived":false,"fork":false,"pushed_at":"2025-05-28T10:03:04.000Z","size":380421,"stargazers_count":180,"open_issues_count":0,"forks_count":11,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-05-28T11:22:02.692Z","etag":null,"topics":["common-crawl","commoncrawl","statistics"],"latest_commit_sha":null,"homepage":"https://commoncrawl.github.io/cc-crawl-statistics/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-07-14T08:38:12.000Z","updated_at":"2025-05-28T10:03:08.000Z","dependencies_parsed_at":"2023-01-28T20:45:58.644Z","dependency_job_id":"db2d31c3-d61a-44f1-aa97-c4c99d1fae70","html_url":"https://github.com/commoncrawl/cc-crawl-statistics","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-crawl-statistics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-crawl-statistics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-crawl-statistics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-crawl-statistics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-crawl-statistics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-crawl-statistics/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-crawl-statistics/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432236,"owners_count":22856706,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common-crawl","commoncrawl","statistics"],"created_at":"2025-06-12T08:40:56.994Z","updated_at":"2025-06-12T08:40:58.153Z","avatar_url":"https://github.com/commoncrawl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Basic Statistics of Common Crawl Monthly Archives\n=================================================\n\nAnalyze the [Common Crawl](https://commoncrawl.org/) data to get metrics about the monthly crawl archives:\n* size of the monthly crawls, number of\n  * fetched pages\n  * unique URLs\n  * unique documents (by content digest)\n  * number of different hosts, domains, top-level domains\n* distribution of pages/URLs on hosts, domains, top-level domains\n* and ...\n  * mime types\n  * protocols / schemes (http vs. https)\n  * content languages (since summer 2018)\n\nThis is a description how to generate the statistics from the Common Crawl URL index files.\n\nThe results are presented on https://commoncrawl.github.io/cc-crawl-statistics/.\n\n\nStep 1: Count Items\n-------------------\n\nThe items (URLs, hosts, domains, etc.) are counted using the Common Crawl index files\non AWS S3 `s3://commoncrawl/cc-index/collections/*/indexes/cdx-*.gz`.\n\n1. define a pattern of cdx files to process - usually from one monthly crawl (here: `CC-MAIN-2016-26`)\n   - either smaller set of local files for testing\n   ```\n   INPUT=\"test/cdx/cdx-0000[0-3].gz\"\n   ```\n   - or one monthly crawl to be accessed via Hadoop on AWS S3:\n   ```\n   INPUT=\"s3a://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-*.gz\"\n   ```\n\n2. run `crawlstats.py --job=count` to process the cdx files and count the items:\n   ```\n   python3 crawlstats.py --job=count --no-exact-counts \\\n        --no-output --output-dir .../count/ $INPUT\n   ```\n\nHelp on command-line parameters (including [mrjob](https://pypi.org/project/mrjob/) options) are shown by\n`python3 crawlstats.py --help`.\nThe option `--no-exact-counts` is recommended (and is the default) to save storage space and computation time\nwhen counting URLs and content digests.\n\n\nStep 2: Aggregate Counts\n------------------------\n\nRun `crawlstats.py --job=stats` on the output of step 1:\n```\npython3 crawlstats.py --job=stats --max-top-hosts-domains=500 \\\n     --no-output --output-dir .../stats/ .../count/\n```\nThe max. number of most frequent thosts and domains contained in the output is set by the option\n`--max-top-hosts-domains=N`.\n\n\nStep 3: Download the Data\n-------------------------\n\nIn order to prepare the plots, the the output of step 2 must be downloaded to local disk.\nSimplest, the data is fetched from the Common Crawl Public Data Set bucket on AWS S3:\n```sh\nwhile read crawl; do\n    aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz\ndone \u003c\u003cEOF\nCC-MAIN-2008-2009\n...\nEOF\n```\n\nOne aggregated, gzip-compressed statistics file, is about 1 MiB in size. So you could just run\n[get_stats.sh](get_stats.sh) to download the data files for all released monthly crawls.\n\nAlso the output of step 1 is provided on `s3://commoncrawl/`. The counts for every crawl is hold\nin 10 bzip2-compressed files, together 1 GiB per crawl in average. To download the counts for one crawl:\n- if you're on AWS and [AWS CLI]() is installed and configured\n  ```sh\n  CRAWL=CC-MAIN-2022-05\n  aws s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count stats/count/$CRAWL\n  ```\n- otherwise\n  ```sh\n  CRAWL=CC-MAIN-2022-05\n  mkdir -p stats/count/$CRAWL\n  for i in $(seq 0 9); do\n    curl https://data.commoncrawl.org/crawl-analysis/$CRAWL/count/part-0000$i.bz2 \\\n      \u003estats/count/$CRAWL/part-0000$i.bz2\n  done\n  ```\n\n\nStep 4: Plot the Data\n---------------------\n\nTo prepare the plots using the downloaded aggregated data:\n```\ngzip -dc stats/CC-MAIN-*.gz | python3 plot/crawl_size.py\n```\nThe full list of commands to prepare all plots is found in [plot.sh](plot.sh). Don't forget to install the Python\nmodules [required for plotting](requirements_plot.txt).\n\n\nStep 5: Local Site Preview\n--------------------------\n\nThe [crawl statistics site](https://commoncrawl.github.io/cc-crawl-statistics/) is hosted by [Github pages](https://pages.github.com/). The site is updated as soon as plots or description texts are updated, committed and pushed to the Github repository.\n\nTo preview local changes, it's possible to serve the site locally:\n1. build the Docker image with Ruby, Jekyll and the content to be served\n   ```\n   docker build -f site.Dockerfile -t cc-crawl-statistics-site:latest .\n   ```\n2. run a Docker container to serve the site preview\n   ```\n   docker run --network=host --rm -ti cc-crawl-statistics-site:latest\n   ```\n   The site should be served on localhost, port 4000 (http://127.0.0.1:4000).\n   If not, the correct location is shown in the output of the `docker run` command.\n\n   If running this on a Mac, you may find that the loopback interface (127.0.0.1) within the container is not accessible, so you can change the line in the [Dockerfile](site.Dockerfile) to:\n\n   ```\n   CMD bundle exec jekyll serve --host 0.0.0.0\n   ```\n\n   ... and then the site will be served on http://0.0.0.0:4000 instead.  (You will of course need to rebuild the Docker image after updating the Dockerfile.)\n\nRelated Projects\n----------------\n\nThe [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/)\nsimplifies counting and analytics a lot - easier to maintain, more transparent, reproducible and\nextensible than running two MapReduce jobs, see the the list of example\n- [SQL queries](https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena) and\n- [Jupyter notebooks](https://github.com/commoncrawl/cc-notebooks)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-crawl-statistics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-crawl-statistics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-crawl-statistics/lists"}