{"id":19316577,"url":"https://github.com/tokenmill/common-crawl-utils","last_synced_at":"2025-04-22T17:30:27.701Z","repository":{"id":62433451,"uuid":"189964382","full_name":"tokenmill/common-crawl-utils","owner":"tokenmill","description":"Various Common Crawl utilities in Clojure.","archived":false,"fork":false,"pushed_at":"2023-12-05T22:22:56.000Z","size":56,"stargazers_count":7,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-02T02:02:09.935Z","etag":null,"topics":["cdx-api","clojure","clojure-library","common-crawl","warc"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tokenmill.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-03T08:22:08.000Z","updated_at":"2025-02-16T14:08:08.000Z","dependencies_parsed_at":"2024-11-10T01:12:01.159Z","dependency_job_id":null,"html_url":"https://github.com/tokenmill/common-crawl-utils","commit_stats":{"total_commits":16,"total_committers":7,"mean_commits":"2.2857142857142856","dds":0.625,"last_synced_commit":"2c0b500948dcba8f2255aac98fac17c4e09bb21d"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcommon-crawl-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcommon-crawl-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcommon-crawl-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fcommon-crawl-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tokenmill","download_url":"https://codeload.github.com/tokenmill/common-crawl-utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250287337,"owners_count":21405588,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdx-api","clojure","clojure-library","common-crawl","warc"],"created_at":"2024-11-10T01:11:56.413Z","updated_at":"2025-04-22T17:30:27.331Z","avatar_url":"https://github.com/tokenmill.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"http://www.tokenmill.lt\"\u003e\n      \u003cimg src=\".github/tokenmill-logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\n\u003c/a\u003e\n\n# common-crawl-utils\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![pipeline status](https://gitlab.com/tokenmill/oss/common-crawl-utils/badges/master/pipeline.svg)](https://gitlab.com/tokenmill/oss/common-crawl-utils/pipelines/master/latest)\n[![Clojars Project](https://img.shields.io/clojars/v/lt.tokenmill/common-crawl-utils.svg)](https://clojars.org/lt.tokenmill/common-crawl-utils)\n\nVarious [Common Crawl](https://commoncrawl.org/) utilities in Clojure:\n\n- Fetcher of the selected content from the Common Crawl Archive\n- Wrapper on the [Common Crawl Index API](https://index.commoncrawl.org/)\n- Reader for the raw Common Crawl WARC archives\n\n## Selected Content Fetcher\n\n```clojure\n(common-crawl-utils.fetcher/fetch-content \n  {:cdx-api \"https://index.commoncrawl.org/CC-MAIN-2019-09-index\" \n   :url \"tokenmill.lt\" \n   :filter [\"status:200\"]})\n=\u003e\n({:offset \"272838009\",\n  :content {:warc \"WARC/1.0\\r\n                   WARC-Type: response\\r\n                   WARC-Date: 2019-02-17T12:51:41Z\\r\n                   WARC-Record-ID: \u003curn:uuid:114c36c3-b278-49bf-b4c0-0cf2a0eaac7c\u003e\\r\n                   Content-Length: 33486\\r\n                   Content-Type: application/http; msgtype=response\\r\n                   WARC-Warcinfo-ID: \u003curn:uuid:64ce1a6d-7cdc-44ec-a802-f4feac213f0c\u003e\\r\n                   WARC-Concurrent-To: \u003curn:uuid:3143fe9c-ec79-4f53-8665-86c668cb46d8\u003e\\r\n                   WARC-IP-Address: 79.98.31.5\\r\n                   WARC-Target-URI: http://tokenmill.lt/\\r\n                   WARC-Payload-Digest: sha1:U3FWVBI7XZ2KVBD72MRR7TCHHXSX2FJS\\r\n                   WARC-Block-Digest: sha1:MPJRHAU5WNOIDVX2AZTDOEP2YPXXLC66\\r\n                   WARC-Identified-Payload-Type: text/html\",\n            :header \"HTTP/1.1 200 OK\\r\n                     Date: Sun, 17 Feb 2019 12:51:41 GMT\\r\n                     Content-Type: text/html;charset=UTF-8\\r\n                     X-Crawler-Transfer-Encoding: chunked\\r\n                     Server: Jetty(7.x.y-SNAPSHOT)\",\n            :html \"THE ACTUAL HTML\"},\n  :digest \"U3FWVBI7XZ2KVBD72MRR7TCHHXSX2FJS\",\n  :mime \"text/html\",\n  :charset \"UTF-8\",\n  :mime-detected \"text/html\",\n  :filename \"crawl-data/CC-MAIN-2019-09/segments/1550247481992.39/warc/CC-MAIN-20190217111746-20190217133746-00381.warc.gz\",\n  :status \"200\",\n  :urlkey \"lt,tokenmill)/\",\n  :url \"http://tokenmill.lt/\",\n  :length \"8548\",\n  :languages \"eng\",\n  :timestamp \"20190217125141\"})\n```\n\nUses [Common Crawl Index API](https://index.commoncrawl.org/) to get\ncoordinates to content which is stored on\n[AWS](https://registry.opendata.aws/commoncrawl/).\n\n## Common Crawl Index API Wrapper\n\nCommon Crawl Index API wrapper allows to query Common Crawl Index API for\ncoordinates into the Common Crawl data. \n\n### Constructing Queries\n\nAll queries must contain \"url\" key. To return all existing coordinates\nthat match specified host, we append \"/*\" to it or set key \"matchType\"\nto \"host\".\n\n```\n{:url \"tokenmill.lt/*\"}\n\n{:url \"tokenmill.lt\" :matchType \"host\"}\n```\n\nAdditionally, results can be filtered. Below is an example, where\ncoordinates that have \"status\" containing \"200\" and \"mime\" containing\n\"html\" are returned.\n\n```\n{:url \"tokenmill.lt/*\" :filter [\"status:200\" \"mime:html\"]}\n```\n\nAll available fields: *urlkey*, *timestamp*, *url*, *mime*, *status*,\n*digest*, *length*, *offset*, *filename*.\n\nFull reference can be found at\n[CDX Server API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API).\n\n### Coordinates\n\nCommon Crawl is updated on a monthly basis. Each crawl has a specific\nindex API, which we can query like this:\n\n```\n(common-crawl-utils.coordinates/fetch {:cdx-api \"https://index.commoncrawl.org/CC-MAIN-2019-09-index\" :url \"tokenmill.lt\" :filter [\"status:200\"]})\n=\u003e\n({:offset \"272838009\",\n :digest \"U3FWVBI7XZ2KVBD72MRR7TCHHXSX2FJS\",\n :mime \"text/html\",\n :charset \"UTF-8\",\n :mime-detected \"text/html\",\n :filename \"crawl-data/CC-MAIN-2019-09/segments/1550247481992.39/warc/CC-MAIN-20190217111746-20190217133746-00381.warc.gz\",\n :status \"200\",\n :urlkey \"lt,tokenmill)/\",\n :url \"http://tokenmill.lt/\",\n :length \"8548\",\n :languages \"eng\",\n :timestamp \"20190217125141\"})\n```\n\nWhen \"cdx-api\" keyword is not specified, most recent one is\nused. Currently available index collections can be accessed with\n\"*common-crawl-utils.utils/get-crawls*\" or can be found at:\nhttps://index.commoncrawl.org/collinfo.json\n\n## Reader\n\nCan directly read Common Crawl .warc files containing content, as well\nas .cdx files containing coordinates.\n\n```\n(common-crawl-utils.reader/read-warc)\n\n(common-crawl-utils.reader/read-coordinates)\n```\n\nIf no arguments are specified, reads from latest crawl. Otherwise, we\ncan specify crawl \"id\" which can be found at\nhttps://index.commoncrawl.org/collinfo.json.\n\n```\n(first (common-crawl-utils.reader/read-warc \"CC-MAIN-2019-09\"))\n=\u003e\n{:content-length 61992,\n  :content-type #object[org.jwat.common.ContentType 0x3eb133dd \"application/http; msgtype=response\"],\n  :date #inst\"2019-02-15T19:26:02.000-00:00\",\n  :filename nil,\n  :target-uri #object[org.jwat.common.Uri 0x3efffa13 \"http://0204mm.com/?PUT=a_show\u0026AID=68666\u0026FID=1361239\u0026R2=\u0026CHANNEL=\"],\n  :target-uri-str \"http://0204mm.com/?PUT=a_show\u0026AID=68666\u0026FID=1361239\u0026R2=\u0026CHANNEL=\",\n  :warc-type \"response\",\n  :payload-stream #object[org.jwat.common.ByteCountingPushBackInputStream\n                          0x1b9da111\n                          \"org.jwat.common.ByteCountingPushBackInputStream@1b9da111\"]}\n```\n\n## License\n\nCopyright \u0026copy; 2019 [TokenMill UAB](http://www.tokenmill.lt).\n\nDistributed under the The Apache License, Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fcommon-crawl-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftokenmill%2Fcommon-crawl-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fcommon-crawl-utils/lists"}