https://github.com/teaxyz/chai
tea’s package dataset
https://github.com/teaxyz/chai
data packages
Last synced: 5 months ago
JSON representation
tea’s package dataset
- Host: GitHub
- URL: https://github.com/teaxyz/chai
- Owner: teaxyz
- License: mit
- Created: 2024-09-27T19:44:45.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-14T17:54:30.000Z (5 months ago)
- Last Synced: 2025-05-14T18:51:50.115Z (5 months ago)
- Topics: data, packages
- Language: Python
- Homepage:
- Size: 781 KB
- Stars: 191
- Watchers: 6
- Forks: 101
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CHAI
CHAI is an attempt at an open-source data pipeline for package managers. The
goal is to have a pipeline that can use the data from any package manager and
provide a normalized data source for myriads of different use cases.## Getting Started
Use [Docker](https://docker.com)
1. Install Docker
2. Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
3. Using a terminal, navigate to the cloned repository directory
4. Run `docker compose build` to create the latest Docker images
5. Then, run `docker compose up` to launch.> [!NOTE]
>
> This will run CHAI for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> Currently, we support only two package managers:
>
> - crates
> - Homebrew
>
> You can run a single package manager by running
> `docker compose up -e ... `
>
> We are planning on supporting `NPM`, `PyPI`, and `rubygems` next.### Arguments
Specify these eg. `FOO=bar docker compose up`:
- `FREQUENCY`: Sets how often (in hours) the pipeline should run.
- `TEST`: Runs the loader in test mode when set to true, skipping certain data insertions.
- `FETCH`: Determines whether to fetch new data from the source when set to true.
- `NO_CACHE`: When set to true, deletes temporary files after processing.> [!NOTE]
> The flag `NO_CACHE` does not mean that files will not get downloaded to your local
> storage (specifically, the ./data directory). It only means that we'll
> delete these temporary files from ./data once we're done processing them.These arguments are all configurable in the `docker-compose.yml` file.
### Docker Services Overview
1. `db`: [PostgreSQL] database for the reduced package data
2. `alembic`: handles migrations
3. `package_managers`: fetches and writes data for each package manager
4. `api`: a simple REST API for reading from the db### Hard Reset
Stuff happens. Start over:
`rm -rf ./data`: removes all the data the fetcher is putting.
## Goals
Our goal is to build a data schema that looks like this:

You can read more about specific data models in the dbs [readme](db/README.md)
Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. We also built a simple example that displays
[sbom-metadata](examples/sbom-meta) for your repository.There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)> [!TIP]
> Help us add the above to the examples folder.## FAQs / Common Issues
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and
is used as `CHAI_DATABASE_URL` in the environment. `psql CHAI_DATABASE_URL`
will connect you to the database.## Deployment
```sh
export CHAI_DATABASE_URL=postgresql://:@host.docker.internal:/chai
export PGPASSWORD=
docker compose up alembic
```## Tasks
These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing
`dev` loads the environment. Alternatively, run them manually.### reset
```sh
rm -rf db/data data .venv
```### build
```sh
docker compose build
```### start
Requires: build
```sh
docker compose up -d
```### test
Env: TEST=true
Env: DEBUG=true```sh
docker compose up
```### full-test
Requires: build
Env: TEST=true
Env: DEBUG=true```sh
docker compose up
```### stop
```sh
docker compose down
```### logs
```sh
docker compose logs
```### db-start
Runs migrations and starts up the database
```sh
docker compose build --no-cache db alembic
docker compose up alembic -d
```### db-reset
Requires: stop
```sh
rm -rf db/data
```### db-generate-migration
Inputs: MIGRATION_NAME
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai```sh
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
```### db-upgrade
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
```sh
cd alembic
alembic upgrade head
```### db-downgrade
Inputs: STEP
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai```sh
cd alembic
alembic downgrade -$STEP
```### db
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
```### db-list-packages
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
```### db-list-history
```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
```### restart-api
Refreshes table knowledge from the db.
```sh
docker-compose restart api
```### remove-orphans
```sh
docker compose down --remove-orphans
```### run-pipeline
Inputs: SERVICE
Requires: build
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai```sh
docker compose up $SERVICE
```[PostgreSQL]: https://www.postgresql.org
[`pkgx`]: https://pkgx.sh