{"id":17793711,"url":"https://github.com/darenasc/aeda","last_synced_at":"2026-03-05T05:30:49.234Z","repository":{"id":45229804,"uuid":"377449717","full_name":"darenasc/aeda","owner":"darenasc","description":"Build a data catalog by running a single line of code","archived":false,"fork":false,"pushed_at":"2025-03-12T13:17:02.000Z","size":3397,"stargazers_count":17,"open_issues_count":17,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-29T17:18:14.788Z","etag":null,"topics":["data-catalog","data-exploration","database","eda","metadata","metadata-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/darenasc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-06-16T09:57:54.000Z","updated_at":"2025-03-12T13:17:06.000Z","dependencies_parsed_at":"2024-10-27T11:12:30.297Z","dependency_job_id":"f0ac0d42-0075-42fd-95b1-3647001202de","html_url":"https://github.com/darenasc/aeda","commit_stats":{"total_commits":92,"total_committers":3,"mean_commits":"30.666666666666668","dds":0.03260869565217395,"last_synced_commit":"9c51382b165df22927896ce30046a0ad032447c5"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/darenasc/aeda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darenasc%2Faeda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darenasc%2Faeda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darenasc%2Faeda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darenasc%2Faeda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/darenasc","download_url":"https://codeload.github.com/darenasc/aeda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darenasc%2Faeda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30111743,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T03:40:26.266Z","status":"ssl_error","status_checked_at":"2026-03-05T03:39:15.902Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-catalog","data-exploration","database","eda","metadata","metadata-extraction"],"created_at":"2024-10-27T11:12:25.063Z","updated_at":"2026-03-05T05:30:49.216Z","avatar_url":"https://github.com/darenasc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AEDA stands for Automated Exploratory Data Analysis\n\n![](https://img.shields.io/github/license/darenasc/aeda)\n![](https://img.shields.io/github/last-commit/darenasc/aeda)\n![](https://img.shields.io/github/stars/darenasc/aeda?style=social)\n\n**AEDA** will automatically profile any [supported database](documentation/supported_databases.md) \nusing reading access priviledges. The results of the profiling will be stored \nin a second [supported database](documentation/supported_databases.md) with write \npriviledges.\n\nProfiling a database means **metadata extraction** from all the tables of a \ngiven database and storing this information into a second metadata database \nthat can be used to query information about the source database. The metadata \ndatabase is a **data catalog**.\n\n**AEDA** generates SQL queries to be executed in the source database and \nstore the results in a metadata database. The structure of the metadata \ndatabase can be found in this [document](documentation/sql_code.md).\n\n## Usage\n\n### 1. Clone and install the repository\n\nDownload or clone this repository and install the dependencies.\n\n```bash\ngit clone https://github.com/darenasc/aeda.git\ncd aeda\n```\n\nIf you don't have [pipenv](https://pipenv.pypa.io/en/latest/) installed, you \ncan install it with:\n\n```bash\npip install pipenv\n```\n\nThen, you can install the dependencies with:\n\n```bash\npipenv install Pipfile\n```\n\n### 2. Create a database connection file\n\n`aeda` requires a `databases.ini` file in the `src/aeda/connection_strings/` \nfolder to store the connections to databases. You can rename the \n[`databases.ini.template`](src/aeda/connection_strings/databases.ini.template) \nfile that is included with the repo and then add your connections there. \nThe `databases.ini` file is not syncronised with the repo.\n\n### 3. Add database connections\n\nThe database connections have the following format. \n\n```ini\n# databases.ini\n[my-source-database]\ndb_engine = \u003cA-SUPPORTED-DB-ENGINE\u003e\nhost = \u003cIP-OR-HOSTNAME-SOURCE-DATABASE\u003e\nschema = \u003cSCHEMA-SOURCE-DATABASE\u003e\ncatalog = \u003cCATALOG-SOURCE-DATABASE\u003e\nuser = \u003cSOURCE-USER\u003e\npassword = \u003cSOURCE-PASSWORD\u003e\nport = \u003cSOURCE-PORT\u003e\n\n[my-metadata-database]\ndb_engine = \u003cA-SUPPORTED-DB-ENGINE\u003e\nhost = \u003cIP-OR-HOSTNAME-METADATA-DATABASE\u003e\nschema = \u003cSCHEMA-METADATA-DATABASE\u003e\ncatalog = \u003cCATALOG-METADATA-DATABASE\u003e\nuser = \u003cMETADATA-USER\u003e\npassword = \u003cMETADATA-PASSWORD\u003e\nport = \u003cMETADATA-PORT\u003e\nmetadata_database = yes # yes or no optional parameter\n\n[\u003cSQLITE3-REFERENCE-NAME\u003e]\ndb_engine = sqlite3\nschema = \u003cSQLITE3-DATABASE-NAME\u003e\nfolder = \u003cPATH/TO/THE/FOLDER/OF/THE/SQLITE3/DATABASE\u003e\nmetadata_database = yes\n```\n\nA **`[connection-name]`** in square brackets that is used by `aeda` to identify \nwhat database you want to use. In the example above there are two database \nconnections `[my-source-database]` and `[my-metadata-database]`.\n\n`[my-source-database]` is the database that we want to profile, we need reading \npriviledges to that database.\n`[my-metadata-database]` is the database where we will store the metadata from \n`[my-source-database]`. The database defined by `[my-metadata-database]` \nrequires writing priviledges.\n\nYou can check the [SQL Code](docs/sql_code.md) documentation file to learn \nabout the database structure of the metadata database and what metadata is \nextracted from the profiled sources.\n\n\u003e Note: Do not use quotes in the `databases.ini` file and remove '\u003c' and '\u003e' chars.\n\nThe `metadata_database` parameter is optional. It is used by the streamlit app to \nshow the connection and presents the `metadata_database` as a dropdown list.\n\nThe supported database engines, to fill the `db_engine` property in the `databases.ini` \nfile are:\n\n* [x] `sqlite3`\n* [x] `mysql`\n* [x] `postgres`\n* [x] `mssqlserver`\n* [x] `mariadb`\n* [x] `snowflake`\n* [x] `aurora`\n* [x] `saphana`\n* [x] `saphana_odbc`\n\n#### 3.1 Create the metadata database\n\nYou could create a SQLite3 local database or create metadata databases using \n`MySQL`, `PostgreSQL`, or `MS SQL Server`. Using the following commands from \nthe terminal in the `src/aeda` folder:\n\n```shell\npython aeda_.py create_db sqlite3 --section \u003cYOUR-SQLITE3-DATABASE\u003e  # Creates a sqlite3 database, or\npython aeda_.py create_db mysql --section \u003cmy-metadata-database\u003e\n```\n\nA connection definition for a SQLite3 database has only three properties:\n\n```CONF\n[\u003cSQLITE3-REFERENCE-NAME\u003e]\ndb_engine = sqlite3\nschema = \u003cSQLITE3-DATABASE-NAME\u003e\nfolder = \u003cPATH/TO/THE/FOLDER/OF/THE/SQLITE3/DATABASE\u003e\n```\n\n#### 3.2. Check connections\n\nYou can check what connections are available using `list-connections` that will list the connections available. You can use the name in the `section` column to refer to that specific connection.\n\n```bash\npython aeda_.py list-connections\n```\n\n#### 3.3 Test the connections\n\nTo test the connections to the databases you have created, you can use the \nfollowing command:\n\n```bash\ncd src/aeda\npython aeda_.py test-connection my-source-database # or\npython aeda_.py test-connections my-source-database my-metadata-database # list of connection names from `databases.ini` separate by spaces\n```\n\nWhere `my-source-database` and `my-metadata-database` are the names of the \nconnection definitions in the `databases.ini` configuration file.\n\nThis should print the following:\n\n```bash\n[ OK ]  Connection to the ****.****.**** source tested successfully...\n[ OK ]  Connection to the ****.****.**** source tested successfully...\n```\n\n#### 3.3 List the connections\n\nOnce you add your connections, you can check them using the `list-connections`.\n\n```bash\ncd src/aeda\npython aeda_.py list-connections\n```\n\n### 4. Exploring the source database\n\nTo explore a database you need to run the following command from the terminal \nin the `src/aeda` folder:\n\n```bash\ncd src/aeda\npython aeda_.py explore --source my-source-database --metadata my-metadata-database\n```\n\nWhere `my-source-database` and `my-metadata-database` are the names of the \nconnection definitions in the `databases.ini` configuration file.\n\n### 5. Relax and wait for the results.\n\nThe process has 6 stages and will print `Done!` when the process is finished.\n\nThe phases of the profiling are six:\n\n1. It's going to get all the columns from the metadata.\n2. It's going to compute number of columns and number of rows per table.\n3. It's going to compute the number of unique values and number of `NULL` values per column.\n4. It's going to compute the data value frequency per column.\n5. It's going to compute the monthly frequency of the timestamp or date type columns.\n6. It's going to compute statistics of the numeric type columns.\n\nThe tables are processed by number of rows, so from step 3 it's going to process the tables with less rows first.\n\n### 6. Visualising the results\n\nYou can query the resulting database or use a minimalistic user interface \ndevelped with [streamlit](https://streamlit.io) from the `src/aeda/streamlit` \nfolder. It will publish the report in the port `5000` of your `localhost`.\n\n```bash\ncd src/aeda/streamlit\nstreamlit run aeda_app.py\n```\n\n## Feedback is appreciated!\n\n- Any questions or feedback? just create an [issue](https://github.com/darenasc/aeda/issues)\n- There are issues with `help wanted` to test commercial databases.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarenasc%2Faeda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdarenasc%2Faeda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarenasc%2Faeda/lists"}