Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anjackson/cdx-db
Generating Parquet files containing CDX data for SQL queries
https://github.com/anjackson/cdx-db
Last synced: 7 days ago
JSON representation
Generating Parquet files containing CDX data for SQL queries
- Host: GitHub
- URL: https://github.com/anjackson/cdx-db
- Owner: anjackson
- Created: 2022-11-30T23:34:06.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-26T15:58:27.000Z (almost 2 years ago)
- Last Synced: 2023-04-17T16:06:15.653Z (over 1 year ago)
- Language: Python
- Size: 439 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# cdx-db
Generating Parquet files containing CDX data for SQL queriesQuerying an [OutbackCDX](https://github.com/nla/outbackcdx) service, and using [fastparquet](https://fastparquet.readthedocs.io/) to build up a copy of the data in
[Apache Parquet](https://parquet.apache.org/) files. Then using [DuckDB](https://duckdb.org/docs/data/parquet.html) to query those files using SQL.The `grab.py` script queries the UKWA API and builds up a Parquet file. The `query.py` script then runs some SQL queries against the file. e.g. a data frame of status codes in the dataset:
```
cursor.execute("SELECT statuscode,count(*) FROM cdx GROUP BY statuscode ORDER BY count(*) DESC").df()
```Which looks like:
```
statuscode count_star()
0 200 460121
1 404 42536
2 302 1358
3 0 1034
4 301 841
5 400 263
6 304 73
7 403 73
8 502 3
```## Questions
- What is the SQL dialect?
- How does this compare with plain text and with OutbackCDX index sizes?## Dead URL Scanner
This repository also contains an experimental dead URL identification procedure.