{"id":15724437,"url":"https://github.com/cldellow/cdx","last_synced_at":"2025-10-21T03:30:18.570Z","repository":{"id":140790464,"uuid":"174448286","full_name":"cldellow/cdx","owner":"cldellow","description":"Scala code to interact with the Common Crawl CDX index","archived":false,"fork":false,"pushed_at":"2019-04-28T11:26:13.000Z","size":26,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-06T03:41:52.222Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cldellow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-08T01:33:49.000Z","updated_at":"2019-04-28T11:26:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"eec1e53a-f242-4af8-8af6-2ca37b764e50","html_url":"https://github.com/cldellow/cdx","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fcdx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fcdx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fcdx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fcdx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cldellow","download_url":"https://codeload.github.com/cldellow/cdx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237425230,"owners_count":19308050,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T22:16:39.648Z","updated_at":"2025-10-21T03:30:18.254Z","avatar_url":"https://github.com/cldellow.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cdx\n\nA subset of https://github.com/ikreymer/cdx-index-client\n\nDesigned to make it easier to create subsets of the Common\nCrawl, for manipulation in other programs.\n\n## Usage\n\n```bash\n# print out 1 200 OK copy of the URL\n./fetch CC-MAIN-2018-51 https://kwknittersguild.ca/fair/\n```\n\n```bash\n# print out 1 200 OK copy of the URL and its first 10 internal links\n./one-hop CC-MAIN-2018-51 https://kwknittersguild.ca/fair/\n```\n\n```bash\n# filter the entries in the provided file (assumes the file was previously\n# created via warc-service)\n./filter-language eng \u003cfilename.zst\u003e\n```\n\n## Cleanup\n\nFiles are stored in `./cache/{cdx,warc,misc}` by default.\n\nYou can change the default path of `./cache` by overriding the `CDX_ROOT` environment variable.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fcdx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcldellow%2Fcdx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fcdx/lists"}