https://github.com/cablehead/agent-crates-io
Exploring the csv data behind crates.io
https://github.com/cablehead/agent-crates-io
Last synced: about 1 year ago
JSON representation
Exploring the csv data behind crates.io
- Host: GitHub
- URL: https://github.com/cablehead/agent-crates-io
- Owner: cablehead
- Created: 2023-05-14T15:12:46.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-14T15:12:55.000Z (about 3 years ago)
- Last Synced: 2025-03-25T05:05:47.282Z (about 1 year ago)
- Language: Shell
- Size: 1000 Bytes
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# agent: crates.io
## To get started
```
curl -L https://static.crates.io/db-dump.tar.gz | gunzip -c | tar xvf -
```
## Contents
- get-embedding.sh: get a text-embedding-ada-002 embedding for content on stdin
- q*.sql: some sql queries to use with duckdb, beginning to expore the csv data
## Resources
- https://crates.io/data-access
- To explore the csv files:
- https://duckdb.org
- https://github.com/jqnatividad/qsv#qsv-ultra-fast-csv-data-wrangling-toolkit
- https://github.com/kamu-data/kamu-cli/ ## Mike I'm curious if we could ingest the crates.io csv files into this tool
- New generation decentralized data warehouse and streaming data pipeline
- ~/.local/bin/kamu
## Method
- generate a csv file that's a summary of all crate information
- text-embedding-ada-002: used for generated embeddings
- turn user prompt into an embedding
- table scan to find best matches