https://github.com/cldellow/cdx
Scala code to interact with the Common Crawl CDX index
https://github.com/cldellow/cdx
Last synced: 9 months ago
JSON representation
Scala code to interact with the Common Crawl CDX index
- Host: GitHub
- URL: https://github.com/cldellow/cdx
- Owner: cldellow
- Created: 2019-03-08T01:33:49.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-04-28T11:26:13.000Z (about 7 years ago)
- Last Synced: 2025-02-06T03:41:52.222Z (over 1 year ago)
- Language: Shell
- Size: 25.4 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# cdx
A subset of https://github.com/ikreymer/cdx-index-client
Designed to make it easier to create subsets of the Common
Crawl, for manipulation in other programs.
## Usage
```bash
# print out 1 200 OK copy of the URL
./fetch CC-MAIN-2018-51 https://kwknittersguild.ca/fair/
```
```bash
# print out 1 200 OK copy of the URL and its first 10 internal links
./one-hop CC-MAIN-2018-51 https://kwknittersguild.ca/fair/
```
```bash
# filter the entries in the provided file (assumes the file was previously
# created via warc-service)
./filter-language eng
```
## Cleanup
Files are stored in `./cache/{cdx,warc,misc}` by default.
You can change the default path of `./cache` by overriding the `CDX_ROOT` environment variable.