https://github.com/atomotic/digipres-index-parquet
https://github.com/atomotic/digipres-index-parquet
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/atomotic/digipres-index-parquet
- Owner: atomotic
- Created: 2024-07-14T08:46:58.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-07-14T08:55:15.000Z (10 months ago)
- Last Synced: 2025-01-05T02:24:52.649Z (4 months ago)
- Size: 549 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Digipres practice index as parquet
Download the SQLite file
```
wcurl https://github.com/digipres/digipres-practice-index/raw/main/releases/practice.db
```Launch DuckDB
```
$ duckdb
```Install [SQLite Extension](https://duckdb.org/docs/extensions/sqlite.html)
```sql
INSTALL sqlite;
LOAD sqlite;
```Attach the SQLite database
```sql
ATTACH 'practice.db' (TYPE SQLITE);
```Schema inspect
```sql
.schema
CREATE TABLE publications(source_name VARCHAR, landing_page_url VARCHAR, document_url VARCHAR, slides_url VARCHAR, notes_url VARCHAR, "year" BIGINT, title VARCHAR, abstract VARCHAR, "language" VARCHAR, creators VARCHAR, institutions VARCHAR, license VARCHAR, size BIGINT, "type" VARCHAR, date VARCHAR, keywords VARCHAR);
CREATE TABLE publications_fts(title BLOB, creators BLOB, abstract BLOB, keywords BLOB, institutions BLOB, "type" BLOB);
CREATE TABLE publications_fts_config(k BLOB PRIMARY KEY, v BLOB);
CREATE TABLE publications_fts_data(id BIGINT PRIMARY KEY, block BLOB);
CREATE TABLE publications_fts_docsize(id BIGINT PRIMARY KEY, sz BLOB);
CREATE TABLE publications_fts_idx(segid BLOB, term BLOB, pgno BLOB, PRIMARY KEY(segid, term));
```Export to parquet
```sql
COPY practice.publications TO digipres_index.parquet;
```Example queries
```sql
$ duckdbSELECT title, landing_page_url FROM digipres_index.parquet WHERE keywords like '%database%';
```Use remote parquet: https://atomotic.github.io/digipres-index-parquet/digipres_index.parquet
```sql
SELECT title FROM 'https://atomotic.github.io/digipres-index-parquet/digipres_index.parquet' limit 10;
```Use it in browser with [DuckDB shell](https://shell.duckdb.org/#queries=v0,select-title-from-'https%3A%2F%2Fatomotic.github.io%2Fdigipres%20index%20parquet%2Fdigipres_index.parquet'-limit-10~,select-title%2Clanding_page_url-from-'https%3A%2F%2Fatomotic.github.io%2Fdigipres%20index%20parquet%2Fdigipres_index.parquet'-where-keywords-like-'%25warc%25'~)