An open API service indexing awesome lists of open source software.

https://github.com/teaxyz/chai

tea’s package dataset
https://github.com/teaxyz/chai

data packages

Last synced: 5 months ago
JSON representation

tea’s package dataset

Awesome Lists containing this project

README

          

# CHAI

CHAI is an attempt at an open-source data pipeline for package managers. The
goal is to have a pipeline that can use the data from any package manager and
provide a normalized data source for myriads of different use cases.

## Getting Started

Use [Docker](https://docker.com)

1. Install Docker
2. Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
3. Using a terminal, navigate to the cloned repository directory
4. Run `docker compose build` to create the latest Docker images
5. Then, run `docker compose up` to launch.

> [!NOTE]
>
> This will run CHAI for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> Currently, we support only two package managers:
>
> - crates
> - Homebrew
>
> You can run a single package manager by running
> `docker compose up -e ... `
>
> We are planning on supporting `NPM`, `PyPI`, and `rubygems` next.

### Arguments

Specify these eg. `FOO=bar docker compose up`:

- `FREQUENCY`: Sets how often (in hours) the pipeline should run.
- `TEST`: Runs the loader in test mode when set to true, skipping certain data insertions.
- `FETCH`: Determines whether to fetch new data from the source when set to true.
- `NO_CACHE`: When set to true, deletes temporary files after processing.

> [!NOTE]
> The flag `NO_CACHE` does not mean that files will not get downloaded to your local
> storage (specifically, the ./data directory). It only means that we'll
> delete these temporary files from ./data once we're done processing them.

These arguments are all configurable in the `docker-compose.yml` file.

### Docker Services Overview

1. `db`: [PostgreSQL] database for the reduced package data
2. `alembic`: handles migrations
3. `package_managers`: fetches and writes data for each package manager
4. `api`: a simple REST API for reading from the db

### Hard Reset

Stuff happens. Start over:

`rm -rf ./data`: removes all the data the fetcher is putting.

## Goals

Our goal is to build a data schema that looks like this:

![db/CHAI_ERD.png](db/CHAI_ERD.png)

You can read more about specific data models in the dbs [readme](db/README.md)

Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. We also built a simple example that displays
[sbom-metadata](examples/sbom-meta) for your repository.

There are many other potential use cases for this data:

- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)

> [!TIP]
> Help us add the above to the examples folder.

## FAQs / Common Issues

1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and
is used as `CHAI_DATABASE_URL` in the environment. `psql CHAI_DATABASE_URL`
will connect you to the database.

## Deployment

```sh
export CHAI_DATABASE_URL=postgresql://:@host.docker.internal:/chai
export PGPASSWORD=
docker compose up alembic
```

## Tasks

These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing
`dev` loads the environment. Alternatively, run them manually.

### reset

```sh
rm -rf db/data data .venv
```

### build

```sh
docker compose build
```

### start

Requires: build

```sh
docker compose up -d
```

### test

Env: TEST=true
Env: DEBUG=true

```sh
docker compose up
```

### full-test

Requires: build
Env: TEST=true
Env: DEBUG=true

```sh
docker compose up
```

### stop

```sh
docker compose down
```

### logs

```sh
docker compose logs
```

### db-start

Runs migrations and starts up the database

```sh
docker compose build --no-cache db alembic
docker compose up alembic -d
```

### db-reset

Requires: stop

```sh
rm -rf db/data
```

### db-generate-migration

Inputs: MIGRATION_NAME
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
```

### db-upgrade

Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic upgrade head
```

### db-downgrade

Inputs: STEP
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
cd alembic
alembic downgrade -$STEP
```

### db

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
```

### db-list-packages

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
```

### db-list-history

```sh
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
```

### restart-api

Refreshes table knowledge from the db.

```sh
docker-compose restart api
```

### remove-orphans

```sh
docker compose down --remove-orphans
```

### run-pipeline

Inputs: SERVICE
Requires: build
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai

```sh
docker compose up $SERVICE
```

[PostgreSQL]: https://www.postgresql.org
[`pkgx`]: https://pkgx.sh