{"id":14065670,"url":"https://github.com/zsvoboda/dbd","last_synced_at":"2025-09-11T05:17:39.911Z","repository":{"id":37691713,"uuid":"443827971","full_name":"zsvoboda/dbd","owner":"zsvoboda","description":"dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.","archived":false,"fork":false,"pushed_at":"2022-02-13T18:30:58.000Z","size":2943,"stargazers_count":57,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-08T13:58:48.233Z","etag":null,"topics":["bigquery","csv","database","database-schemas","elt","etl","excel","json","mysql","parquet","postgresql","python","python3","redshift","snowflake","sql","sqlite","xls","xlsx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zsvoboda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-02T17:28:59.000Z","updated_at":"2024-11-21T21:06:42.000Z","dependencies_parsed_at":"2022-07-18T01:16:55.212Z","dependency_job_id":null,"html_url":"https://github.com/zsvoboda/dbd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zsvoboda/dbd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zsvoboda%2Fdbd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zsvoboda%2Fdbd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zsvoboda%2Fdbd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zsvoboda%2Fdbd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zsvoboda","download_url":"https://codeload.github.com/zsvoboda/dbd/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zsvoboda%2Fdbd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274580636,"owners_count":25311212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-11T02:00:13.660Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","csv","database","database-schemas","elt","etl","excel","json","mysql","parquet","postgresql","python","python3","redshift","snowflake","sql","sqlite","xls","xlsx"],"created_at":"2024-08-13T07:04:37.548Z","updated_at":"2025-09-11T05:17:39.875Z","avatar_url":"https://github.com/zsvoboda.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# dbd: database prototyping tool\ndbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.\n\ndbd helps you with following tasks:\n- Loading CSV, JSON, Excel, and Parquet data to database. It supports both local and online files (HTTP URLs). Data can be loaded incrementally or in full. \n- Transforming data in existing database tables using insert-from-sql statements.\n- Executing DDL (Data Definition Language) SQL scripts (statements like `CREATE SCHEMA`, etc.).    \n\n## How dbd works\ndbd processes a model directory that contains directories and files:\n\n- **Directories** create new database schemas.\n- **Files** create new database table or view. The new table's or view's name is the same as the data file name.\n  - `.csv`, `.json`, `.xlsx`, and `.parquet` data files are introspected and loaded to database as tables.   \n  - `.sql` files that contain SQL SELECT statements are executed and the result is loaded to database as table or view.\n  - `.ref` files contain one or more local paths or URLs pointing to supported data files. The referenced files are loaded to database as tables.  \n  - `.yaml` files contain metadata for the files above. The `.yaml` file has the same name as a data, `.sql`, or `.ref` file and specifies details of target table's columns (data types, constraints, indexes, etc.). `.yaml` files are optional. If not specified, dbd uses defaults (e.g. `TEXT` data types for CSV columns)\n  - `.ddl` files contain multiple SQL statements separated by semicolon that are executed against the database.\n\ndbd knows the correct order in which to process files in the model directory to respect mutual dependencies between \ncreated objects.\n\n![How dbd works](https://raw.githubusercontent.com/zsvoboda/dbd/master/img/dbd.infographic.png)\n\ndbd currently supports Postgres, MySQL/MariaDB, SQLite, Snowflake, BigQuery, and Redshift databases.\n\n## Getting started and Examples\nA short 5-minute getting started tutorial is available \n[here](https://zsvoboda.medium.com/analyze-covid-data-in-less-than-5-minutes-9176f440dd1a).\n\nYou can also check out dbd's [examples here](https://github.com/zsvoboda/dbd/tree/master/examples). \nThe easiest way how to execute them is to either clone or download dbd's github repository and start with the\n[SQLite examples](https://github.com/zsvoboda/dbd/tree/master/examples/sqlite).\n\n```shell\npython3 -m venv dbd-env\nsource dbd-env/bin/activate\npip3 install dbd\ngit clone https://github.com/zsvoboda/dbd.git\ncd dbd/examples/sqlite/basic\ndbd run . \n```\n\nThese commands should create a new `basic.db` SQLite database with `area`, `population`, and `state` tables that are created and loaded from the corresponding files in the `model` directory.\n\n## Installing dbd\ndbd requires Python 3.8 or higher. \n\n### Prerequisites\nCheck that you have a recent version of Python 3.8 or higher.\n\n```shell\npython3 -V\n```\n\nif not use a package manager to install the latest python:\n\nOn Fedora run:\n\n```shell\nsudo yum install python3\n```\n\nOn Ubuntu run:\n\n```shell\nsudo apt install python3\n```\n\nInstall Python virtual environment:\n\nOn Fedora run:\n\n```shell\nsudo yum install python3-virtualenv\n```\n\nOn Ubuntu run:\n\n```shell\nsudo apt install python3-venv\n```\n\nOn Windows just install Python 3.8 or higher from the Store.\n\nThen activate the virtual environment:\n\nOn Linux run:\n\n```shell\npython3 -m venv dbd-env\nsource dbd-env/bin/activate\n```\n\nOn Windows run:\n\n```shell\npython3 -m venv dbd-env\ncall dbd-env\\Scripts\\activate.bat\n```\n\n### PyPI\n\n```shell\npip3 install dbd\n```\n\nOR\n\n```shell\ngit clone https://github.com/zsvoboda/dbd.git\ncd dbd\npip3 install .\n```\n\n### Running dbd\n`dbd` installs a command line executable that must reside on your path. Sometimes Python places the executable \n(called `dbd`) outside of your `PATH`. Try to execute `dbd` after the installation. If the command cannot be found, \ntry to execute\n\n```shell\nexport PATH=~/.local/bin:$PATH\n```\n\nand run `dbd` again. `pip3` usually complains about the fact that the directory where it is placing the executable is \nnot in `PATH`. You need to take the scripts directory that it suggests and put it on your `PATH`.\n\nOnce you can execute the `dbd` command, clone the dbd repository and start with the SQLite examples:\n\n```shell\ngit clone https://github.com/zsvoboda/dbd.git\ncd dbd/examples/sqlite/basic\ndbd run . \n```\n\nYou can also start with [this tutorial](https://zsvoboda.medium.com/analyze-covid-data-in-less-than-5-minutes-9176f440dd1a). \n\n## Starting a new dbd project\nYou can generate dbd project initial layout by executing `init` command:\n\n```shell\ndbd init \u003cnew-project-name\u003e\n```\n\nThe `init` command generates a new dbd project directory with the following content: \n\n- `model` directory that contains the content files.   \n- `dbd.profile` configuration file that defines database connections. The profile file is usually shared by more dbd projects. \n- `dbd.project` project configuration file references one of the connections from the profile file and define the `model` directory location.  \n\n## dbd profile configuration file\ndbd stores database connections in the `dbd.profile` configuration file. dbd searches for it in the current directory or in your home directory. You can use `--profile` option to point it to a profile file in different location.   \n\nThe profile file is YAML file with the following structure:\n\n```yaml\ndatabases:\n  db1:\n    db.url: \u003csql-alchemy-database-url\u003e\n  db2:\n    db.url: \u003csql-alchemy-database-url\u003e\n  db3:\n    db.url: \u003csql-alchemy-database-url\u003e\n```\n\nRead [this document](https://docs.sqlalchemy.org/en/14/core/engines.html) for more details about  specific SQLAlchemy database URL formats.  \n\n## dbd project configuration file\ndbd stores project configuration in project configuration file that is usually stored in your dbd project directory. dbd searches for `dbd.project` file in your project's directory root. You can also use the `--project` option of the `dbd` command to specify a custom project configuration file. \n\nThe project configuration file also uses YAML format and references dbd model directory and databse connection from a profile config file. All paths in project file are either absolute or relative to the directory where the profile file is located. \n\nFor example:\n\n```yaml\nmodel: ./model\ndatabase: db2\n```\n\n## Model directory\nModel directory contains directories and files. Directories represent database schemas. Files, in  most cases, represent database tables. \n\nFor example, this `model` directory layout\n\n```text\ndbd-model-directory\n+- schema1\n +-- us_states.csv\n+- schema2\n +-- us_counties.csv\n```\n\ncreates two database schemas: `schema1` and `schema2` and two database tables: `us_states` in `schema1` and `us_counties` in `schema2`. Both tables are populated with the data from the CSV files.  \n\ndbd supports following files located in the `model` directory:\n\n* __DATA files:__ `.csv`, `.json`, `.xls`, `.xlsx`, `.parquet` files are loaded to database as tables\n* __REF files:__ `.ref` files contain one or more absolute or relative paths to local files or URLs of online data files that are loaded to database as tables. All referenced files must have the same structure as they are loaded to the same table.  \n* __SQL files:__ `.sql` with SQL SELECT statements are executed using insert-from-select SQL construct. The INSERT command is generated (the SQL file only contains a SQL SELECT statement)\n* __DDL files:__ contain a sequence of SQL statements separated by semicolon. The DDL files can be named `prolog.ddl` and `epilog.ddl`. The `prolog.ddl` is executed before all other files in a specific schema. The `epilog.ddl` is executed last. The `prolog.ddl` and `epilog.ddl` in the top-level model directory are executed as the very first or the very last files in the model. \n* __YAML files:__ specify additional configuration for the __DATA__, __SQL__, and __REF__ files.\n\n## REF files\n`.ref` file contains one or more references to files that dbd loads to the database as tables. The references can be URLs, absolute file paths or paths relative to the `.ref` file. All referenced data files must have the same structure as they are loaded to the same database table.\n\nHere is an example of a `.ref` file: \n\n```\nhttps://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-03-2022.csv\nhttps://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-04-2022.csv\n../data/01-05-2022.csv\n../data/01-06-2022.csv\n```\n\nThe paths and URLs can point to data files with different formats (e.g. CSV or JSON) as long as the files have the \nsame structure (number of columns and column types).\n\n### Referencing files inside ZIP archives\nREF files support paths that reference files inside ZIP archives using the `\u003e` path separator. For example:\n\n`../data/archive.zip\u003ecovid-variants.csv`\n\nOR\n\n`https://raw.githubusercontent.com/zsvoboda/dbd/master/tests/fixtures/capabilities/zip_local/data/archive.zip\u003ecovid-variants.csv`\n\n### Kaggle datasets\nYou can reference a kaggle dataset using the `kaggle://kaggle-dataset-name\u003edataset-file` url. \n\nFor example:\n\n`kaggle://kalilurrahman/new-york-times-covid19-dataset\u003eus.csv`\n\n#### Kaggle authentication\nTo use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab \nof your user profile (`https://www.kaggle.com/\u003cusername\u003e/account`) and select 'Create API Token'. \nThis will trigger the download of kaggle.json, a file containing your API credentials. \nPlace this file in the location `~/.kaggle/kaggle.json` (on Windows in the location \n`C:\\Users\\\u003cWindows-username\u003e\\.kaggle\\kaggle.json` - you can check the exact location, \nsans drive, with `echo %HOMEPATH%`). \nYou can define a shell environment variable `KAGGLE_CONFIG_DIR` to change this location to `$KAGGLE_CONFIG_DIR/kaggle.json` \n(on Windows it will be `%KAGGLE_CONFIG_DIR%\\kaggle.json`).\n\nFor your security, ensure that other users of your computer do not have read access to your credentials. \nOn Unix-based systems you can do this with the following command:\n\n`chmod 600 ~/.kaggle/kaggle.json`\n\nYou can also choose to export your Kaggle username and token to the environment:\n\n```shell\nexport KAGGLE_USERNAME=datadinosaur\nexport KAGGLE_KEY=xxxxxxxxxxxxxx\n```\n\n## SQL files \n`.sql` file performs SQL data transformation in the target database. It contains a SQL SELECT statement that \ndbd wraps in insert-from-select statement, executes it, and stores the result into a table or view that inherits its name from the SQL file name.\n\nHere is an example of `us_states.sql` file that creates a new `us_states` database table:\n\n```sqlite\nSELECT\n        state.abbrev AS state_code,\n        state.state AS state_name,\n        population.population AS state_population,\n        area.area_sq_mi  AS state_area_sq_mi\n    FROM state\n        JOIN population ON population.state = state.abbrev\n        JOIN area on area.state_name = state.state\n```\n\n## YAML files\n`.yaml` file specifies additional configuration for a corresponding __DATA__, __REF__ or __SQL__ file with the same base file name. Here is a YAML configuration example for the `us_states.sql` file above:\n\n```yaml\ntable:\n  columns:\n    state_code:\n      nullable: false\n      primary_key: true\n      type: CHAR(2)\n    state_name:\n      nullable: false\n      index: true\n      type: VARCHAR(50)\n    state_population:\n      nullable: false\n      type: INTEGER\n    state_area_sq_mi:\n      nullable: false\n      type: INTEGER\nprocess:\n  materialization: table\n  mode: drop\n```\n\nThis `.yaml` file re-types the `state_population` and the `state_area_sq_mi` columns to INTEGER, disallows NULL values in all columns, and makes the `state_code` column table's primary key. \n\nYou don't have to describe all table's columns. The columns that you leave out will have their types\nset to the default TEXT datatype in case of DATA files and is defined by the insert-from-select in case of SQL files.    \n\nThe `us_states.sql` table is dropped and data are re-loaded in full everytime the dbd executes this model. \n\n### Table section\n`.yaml` file's columns are mapped to a columns of the table that dbd creates from a corresponding __DATA__, __REF__ or __SQL__ file. For example, a CSV header columns or SQL SELECT column `AS` column clauses. \n\ndbd supports following column's parameters:\n\n* __type:__ column's SQL type.\n* __primary_key:__ is the column part of table's primary key (true|false)?\n* __foreign_keys:__ all other database table columns that are referenced from a column in table (in format `foreign-table`.`referenced-column`).\n* __nullable:__ does column allow null values (true|false)?\n* __index:__ is column indexed (true|false)?\n* __unique:__ does column store unique values (true|false)?\n\n### Process section\nThe `process` section defines following processing options:\n\n* __materialization:__ specifies whether dbd creates a physical `table` or a `view` when processing  SQL file. The __REF__ and __DATA__ files always yield physical table. \n* __mode:__ specifies what dbd does with table's data. You can specify values `drop`, `truncate`, or `keep`. The  __mode__ option is ignored for views.\n\n## Iterative development\nThe `dbd` tool's parameter `--only` helps with iterative development process by allowing you to specify a subset of the \ntables to process. The `--only` parameter accepts a comma-separated list of fully qualified table names (`schema`.`table-name`).\nFor example:\n\n`dbd run --only stage.ext_country`\n\nonly processes the `ext_country` table in the `stage` schema.\n\nThe `--only` parameter also processes all passed table's dependencies. You can skip the dependencies with `--no-deps` option.\n\n## Jinja templates\nMost of model files support [Jinja2 templates](https://jinja.palletsprojects.com/en/3.0.x/). For example, this __REF__ file loads 6 CSV files to database (4 online files from a URL and 2 from a local filesystem):\n\n```jinja\n{% for n in range(4) %}\nhttps://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-0{{ n+1 }}-2022.csv\n{% endfor %}\n../data/01-05-2022.csv\n../data/01-06-2022.csv\n```\nProfile an project configuration files also us Jinja2 templates. You can expend any environment variable with the `{{ environment-variable-name }}` syntax.\nFor example, you can define your database connection parameters like username or password in environment variables and use this profile configuration file:\n\n```yaml\ndatabases:\n  states_snowflake:\n    db.url: \"snowflake://{{ SNOWFLAKE_USER }}:{{ SNOWFLAKE_PASSWORD }}@{{ SNOWFLAKE_ACCOUNT_IDENTIFIER }}/{{ SNOWFLAKE_DB }}/{{ SNOWFLAKE_SCHEMA }}?warehouse={{SNOWFLAKE_WAREHOUSE }}\"\n  covid:\n    db.url: \"snowflake://{{ SNOWFLAKE_USER }}:{{ SNOWFLAKE_PASSWORD }}@{{ SNOWFLAKE_ACCOUNT_IDENTIFIER }}/{{ SNOWFLAKE_DB }}/{{ SNOWFLAKE_SCHEMA }}?warehouse={{SNOWFLAKE_WAREHOUSE }}\"\n```\n\n## Fast data loading mode\nAll supported database engines except SQLite support fast data loading mode. In this mode, data are loaded to a \ndatabase table using bulk load (SQL COPY) command instead of individual INSERT statements.\n\nMySQL and Redshift require additional configuration to enable fast data loading mode. \nWithout this extra configuration dbd reverts to slow inserting mode via INSERT statements.\n\n### MySQL \nTo enable fast loading mode, you need specify `local_infile=1` query parameter in the MySQL connection url.\nYou also must enable the LOCAL INFILE mode on your MySQL server. You can for example do this by executing this \nSQL statement:\n\n```mysql\nSET GLOBAL local_infile = true\n```\n\n### Redshift\nTo enable fast loading mode, you need specify `copy_stage` parameter in the `dbd.project` configuration file. \nThe `copy_stage` parameter must reference a storage definition in your `dbd.profile` configuration file.\nCheck the example configuration files in the `examples/redshift/covid_cz` directory. Here are the example definitions of the \nenvironment variables that these configuration files use:\n\n```shell\nexport AWS_COVID_STAGE_S3_URL=\"s3://covid/stage\"\nexport AWS_COVID_STAGE_S3_ACCESS_KEY=\"AKIA43SWERQGXMUYFIGMA\"\nexport AWS_COVID_STAGE_S3_S3_SECRET_KEY=\"iujI78eDuFFGJF6PSjY/4CIhEJdMNkuS3g4t0BRwX\"\n```\n\n## License\ndbd code is open-sourced under [BSD 3-clause license](LICENSE). \n\n## Resources and References\n- [dbd getting started](https://zsvoboda.medium.com/analyze-covid-data-in-less-than-5-minutes-9176f440dd1a)\n- [dbd github repo](https://github.com/zsvoboda/dbd)\n- [dbd PyPi](https://pypi.org/project/dbd/)\n- [Submit issue](https://github.com/zsvoboda/dbd/issues)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzsvoboda%2Fdbd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzsvoboda%2Fdbd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzsvoboda%2Fdbd/lists"}