https://github.com/teaxyz/chai
tea’s package dataset
https://github.com/teaxyz/chai
data packages
Last synced: 8 months ago
JSON representation
tea’s package dataset
- Host: GitHub
- URL: https://github.com/teaxyz/chai
- Owner: teaxyz
- License: mit
- Created: 2024-09-27T19:44:45.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-14T17:54:30.000Z (9 months ago)
- Last Synced: 2025-05-14T18:51:50.115Z (9 months ago)
- Topics: data, packages
- Language: Python
- Homepage:
- Size: 781 KB
- Stars: 191
- Watchers: 6
- Forks: 101
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CHAI
CHAI is an attempt at an open-source data pipeline for package managers. The
goal is to have a pipeline that can use the data from any package manager and
provide a normalized data source for myriads of different use cases.
## Getting Started
Use [Docker](https://docker.com)
1. Install Docker
2. Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
3. Using a terminal, navigate to the cloned repository directory
4. Run `docker compose build` to create the latest Docker images
5. Then, run `docker compose up` to launch.
> [!NOTE]
>
> This will run CHAI for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> Currently, we support only two package managers:
>
> - crates
> - Homebrew
>
> You can run a single package manager by running
> `docker compose up -e ... `
>
> We are planning on supporting `NPM`, `PyPI`, and `rubygems` next.
### Arguments
Specify these eg. `FOO=bar docker compose up`:
- `FREQUENCY`: Sets how often (in hours) the pipeline should run.
- `TEST`: Runs the loader in test mode when set to true, skipping certain data insertions.
- `FETCH`: Determines whether to fetch new data from the source when set to true.
- `NO_CACHE`: When set to true, deletes temporary files after processing.
> [!NOTE]
> The flag `NO_CACHE` does not mean that files will not get downloaded to your local
> storage (specifically, the ./data directory). It only means that we'll
> delete these temporary files from ./data once we're done processing them.
These arguments are all configurable in the `docker-compose.yml` file.
### Docker Services Overview
1. `db`: [PostgreSQL] database for the reduced package data
2. `alembic`: handles migrations
3. `package_managers`: fetches and writes data for each package manager
4. `api`: a simple REST API for reading from the db
### Hard Reset
Stuff happens. Start over:
`rm -rf ./data`: removes all the data the fetcher is putting.
## Goals
Our goal is to build a data schema that looks like this:

You can read more about specific data models in the dbs [readme](db/README.md)
Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. We also built a simple example that displays
[sbom-metadata](examples/sbom-meta) for your repository.
There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)
> [!TIP]
> Help us add the above to the examples folder.
## FAQs / Common Issues
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and
is used as `CHAI_DATABASE_URL` in the environment. `psql CHAI_DATABASE_URL`
will connect you to the database.
## Deployment
```sh
export CHAI_DATABASE_URL=postgresql://:@host.docker.internal:/chai
export PGPASSWORD=
docker compose up alembic
```
## Tasks
These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing
`dev` loads the environment. Alternatively, run them manually.
### reset
```sh
rm -rf db/data data .venv
```
### build
```sh
docker compose build
```
### start
Requires: build
```sh
docker compose up -d
```
### test
Env: TEST=true
Env: DEBUG=true
```sh
docker compose up
```
### full-test
Requires: build
Env: TEST=true
Env: DEBUG=true
```sh
docker compose up
```
### stop
```sh
docker compose down
```
### logs
```sh
docker compose logs
```
### db-start
Runs migrations and starts up the database
```sh
docker compose build --no-cache db alembic
docker compose up alembic -d
```
### db-reset
Requires: stop
```sh
rm -rf db/data
```
### db-generate-migration
Inputs: MIGRATION_NAME
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
```sh
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
```
### db-upgrade
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
```sh
cd alembic
alembic upgrade head
```
### db-downgrade
Inputs: STEP
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
```sh
cd alembic
alembic downgrade -$STEP
```
### db
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
```
### db-list-packages
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
```
### db-list-history
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
```
### restart-api
Refreshes table knowledge from the db.
```sh
docker-compose restart api
```
### remove-orphans
```sh
docker compose down --remove-orphans
```
### run-pipeline
Inputs: SERVICE
Requires: build
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
```sh
docker compose up $SERVICE
```
[PostgreSQL]: https://www.postgresql.org
[`pkgx`]: https://pkgx.sh