Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/unytics/catalog_builder

Data Catalogs Made Easy
https://github.com/unytics/catalog_builder

bigquery data-catalog data-discovery databricks dbt redshift snowflake

Last synced: 3 months ago
JSON representation

Data Catalogs Made Easy

Awesome Lists containing this project

README

        

![logo](https://github.com/unytics/catalog_builder/assets/111615732/bdb75e70-c7cd-4c7b-aa28-f015011f1edb)


Build a custom data-catalog in minutes

---


## 🔍️ 1. What is CatalogBuilder?

- CatalogBuilder is a simple tool to **generate & deploy a documentation website for your data assets**.
- It enables anyone at your company to **quickly find the trusted data they are looking for**.


## 💡 2. Why CatalogBuilder?

> There are **many open-source projects** (*admundsen, open-metadata, datahub, metacat, atlas*) to build such a catalog in-house. But as they offer a lot of advanced features, they are **hard to manage and deploy** if you're not a tech expert. They can be even **harder to customize**.
>
> **dbt docs** is great to generate a documentation website on top of your dbt assets but:
>
> - it focuses on dbt only (while you are interested in other sources + metadata)
> - is very hard to customize (except you're an angular expert)
> - can be slow.


👉 CatalogBuilder aims at offering a **lightweight alternative** to generate a documentation website on top of your data assets. It focuses on **read-only data discovery** and:

1. ✔️ can be easily customized and deployed by low tech people
2. ✔️ can then handle the very specific needs of your company
3. ✔️ is fast and lightweight
4. ✔️ is built on top of the very famous [mkdocs-material](https://github.com/squidfunk/mkdocs-material) python library which is used by millions of developers to deploy their documentation (*such as [fastapi](https://fastapi.tiangolo.com/)*).


## 💥 3. Getting Started with `catalog` CLI

> `catalog` is the CLI (command-line-interface) of CatalogBuilder to generate, show & deploy the documentation.

### 3.1 Install `catalog` CLI 🛠️

``` sh
pip install catalog-builder
```

### 3.2 Create your first documentation configuration 👨‍💻

``` sh
catalog download dbt_gitlab_data_team
```

To get started, let's download a catalog configuration example from the GitHub repo and play with it. The above command will download the [`catalogs/dbt_gitlab_data_team`](https://github.com/unytics/catalog_builder/tree/main/catalogs/dbt_gitlab_data_team) folder on your laptop.

> You will find in the folder:
>
> - `assets file`: a file containing the list of the assets you want to put in your documentation. It can be a parquet file named `assets.parquet` or a [json lines file](https://medium.com/@sujathamudadla1213/difference-between-ordinary-json-and-json-lines-fc746f93d75e) named `assets.jsonl`. Each asset in the file must have the following fields:
> - `asset_type`: for example: `table`.
> - `documentation_path`: the path of the asset page in the generated documentation. For example `dataset_name/table_name`.
> - `data`: a dict of attributes used to generate the documentation. For example `{"name": "foo"}`
> - `generate_assets_file.py`: the python script used to (re)generate the `assets file`.
> - `requirements.txt`: the python requirements needed by `generate_assets_file.py`.
> - `templates`: a folder which includes a jinja-template markdown-file for each `asset_type`. These templates are used to generate a markdown documentation file for each asset.
> - `source_docs`: a folder which includes files to include as-is in the documentation.
> - `mkdocs.yml`: the mkdocs configuration file used by mkdocs to build the documentation website from the generated markdown files.

### 3.3 Build your catalog website 👾

``` sh
catalog build dbt_gitlab_data_team
```

> 1. For each asset of the `assets file`, the jinja template of `asset_type` will be rendered using the asset `data` to generate a markdown file which will be written into `catalogs/dbt_gitlab_data_team/docs/` at `documentation_path`.
> 2. All files in `catalogs/dbt_gitlab_data_team/source_docs/` are copied into `catalogs/dbt_gitlab_data_team/docs/`
> 3. Mkdocs will then build the documentation website from the markdown files into `catalogs/dbt_gitlab_data_team/site` (using `mkdocs.yml` configuration file).

### 3.4 Run your catalog website locally ⚡

``` sh
catalog serve dbt_gitlab_data_team
```

> You can now see the generated documentation website at http://localhost:8000.

### 3.5 Deploy the documentation website! 🚀

**A. To deploy on GitHub pages**:

``` sh
catalog deploy github-pages dbt_gitlab_data_team
```

> Mkdocs will [deploy the site on GitHub pages](https://www.mkdocs.org/user-guide/deploying-your-docs/) (this only works if you are on a github repository).

**B. To deploy on Google Cloud Storage Bucket**:

``` sh
catalog deploy gcs dbt_gitlab_data_team
```

> Mkdocs will copy all the files in `catalogs/dbt_gitlab_data_team/site` to the bucket defined by `site_url` value of `catalogs/dbt_gitlab_data_team/mkdocs.yml`. For instance if the site url is `http://catalogs.unytics.io/dbt_gitlab_data_team/` it will copy all files under `catalogs/dbt_gitlab_data_team/site` to `gs://catalogs.unytics.io/dbt_gitlab_data_team/`

**C. To deploy elsewhere**:

You can follow [these instructions](https://www.mkdocs.org/user-guide/deploying-your-docs/#other-providers) from mkdocs.


## 💎 4. Generate your dbt documentation

To generate a documentation website for your own dbt project, do the following:

1. Change directory to your dbt project directory
3. Download `catalogs/dbt` documentation example by running `catalog download dbt`.
2. Run `dbt docs generate` to compute `target/manifest.json` and `target/catalog.json`.
4. Generate the assets file by running `python catalogs/dbt/generate_assets_file.py`. The script will parse `target/manifest.json` and `target/catalog.json` to generate the `assets file` in the expected format.
5. Run `catalog serve dbt` to build the website and show it locally.


## Keep in touch 🧑‍💻

[Join our Slack](https://join.slack.com/t/unytics/shared_invite/zt-1gbv491mu-cs03EJbQ1fsHdQMcFN7E1Q) for any question, to get help for getting started, to speak about a bug, to suggest improvements, or simply if you want to have a chat 🙂.


## 👋 Contribute

Any contribution is more than welcome 🤗!

- Add a ⭐ on the repo to show your support
- [Join our Slack](https://join.slack.com/t/unytics/shared_invite/zt-1gbv491mu-cs03EJbQ1fsHdQMcFN7E1Q) and talk with us
- Raise an issue to raise a bug or suggest improvements
- Open a PR!

.md-sidebar--primary {
display: none!important;
}
:root {
--md-primary-fg-color: #2acfa7ff!important;
}