{"id":18684816,"url":"https://github.com/j-sephb-lt-n/auto-database-structure-discovery","last_synced_at":"2026-05-06T03:31:41.547Z","repository":{"id":238751026,"uuid":"797403249","full_name":"J-sephB-lt-n/auto-database-structure-discovery","owner":"J-sephB-lt-n","description":"Automated Database Structure Discovery","archived":false,"fork":false,"pushed_at":"2024-05-14T21:02:35.000Z","size":386,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-18T18:17:20.898Z","etag":null,"topics":["database","dataviz","dbvisualizer","discovery","graph-algorithms","join","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/J-sephB-lt-n.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-07T19:01:07.000Z","updated_at":"2024-05-15T08:14:30.000Z","dependencies_parsed_at":"2024-11-07T10:19:46.477Z","dependency_job_id":"62bed010-b4c2-4bb1-ad68-721e64e37124","html_url":"https://github.com/J-sephB-lt-n/auto-database-structure-discovery","commit_stats":null,"previous_names":["j-sephb-lt-n/database-schema-discovery"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/J-sephB-lt-n/auto-database-structure-discovery","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J-sephB-lt-n%2Fauto-database-structure-discovery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J-sephB-lt-n%2Fauto-database-structure-discovery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J-sephB-lt-n%2Fauto-database-structure-discovery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J-sephB-lt-n%2Fauto-database-structure-discovery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/J-sephB-lt-n","download_url":"https://codeload.github.com/J-sephB-lt-n/auto-database-structure-discovery/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J-sephB-lt-n%2Fauto-database-structure-discovery/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32677892,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T02:33:58.958Z","status":"ssl_error","status_checked_at":"2026-05-06T02:33:39.611Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","dataviz","dbvisualizer","discovery","graph-algorithms","join","sql"],"created_at":"2024-11-07T10:19:26.409Z","updated_at":"2026-05-06T03:31:41.529Z","avatar_url":"https://github.com/J-sephB-lt-n.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Automatic Database Structure Discovery\n\nThis code arose out of a very squeezed consulting project, in which my team was required to provide automated transaction monitoring for a financial client in a very short period of time. \n\nTheir database contained 66 tables, and was in general a giant disaster - no data dictionary or data expert within the company, inconsistent column naming, NULL value traps everywhere.\n\nThis repo contains the code which I wrote to perform automated structure discovery within the database. Specifically, the code aims to discover the specific ID columns which can be used to join tables to one another.\n\nHere is a visualisation of the table connections which were discovered:\n\n![The discovered table connections visualised in dbvisualizer](./join_keys_visualised.png)\n\nWhat the code in this repo does is:\n\n1. Evaluate every possible pair of columns in the entire database as potential matching join keys.\n\n2. Calculate various matching metrics for each potentially matching column pair.\n\n3. Apply thresholding criteria to decide which pairs of columns constitute a useful joining pair (i.e. matching IDs useful for joining tables).\n\n4. Export the discovered structure as a SQLite database containing empty tables with foreign key constraints, so that the table relationships can be visualised, navigated and explored using any chosen database visualisation tool.\n\n5. Model the connected tables as a graph, so that multi-step join paths between tables can be discovered.\n\nThis code was written with great haste, and as a result code documentation is quite low (and testing is non-existent). I would love to return to this code again at a later stage and optimise, scale and improve it.\n\nIn particular, the join-key discovery part of the code uses for loops in native python and will be prohibitively slow on larger databases (I'd like to rewrite it in duckdb, or something similar). The code is fast enough for .jsonl files up to a few hundred megabytes each. \n\nI will definitely use this code again whenever I am faced with a giant mysterious documentationless database.\n\nThe expected input data format expected by this codebase is:\n\n* .jsonl files in the folder /data_input/\n\n* One .jsonl file per table in the database\n\n* Each line of the .jsonl file is a single row of the table e.g. {\"col1\":\"value1\", \"col2\":\"value2\"} etc. \n\nBelow is the (python) code which I used to run the process:\n\nSet up the logger:\n```python\nimport logging\nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"%(asctime)s - %(name)s - %(levelname)s - %(message)s\",\n)\n```\n\nPivot the row-wise .jsonl files into a column format:\n```python\nimport pathlib\nimport time\nimport src.transform_data\n\nstart_time = time.perf_counter()\nfor path in pathlib.Path(\"data_input\").iterdir():\n    if path.is_file():\n        src.transform_data.pivot_jsonl(\n            input_filepath=path,\n            output_filepath=f\"temp_storage/{path.stem}.json\",\n    )\n    elif path.is_dir():\n        print(f\"Skipped directory {path}/\") \n\nprint(f\"Finished pivoting .jsonl files in {(time.perf_counter()-start_time)/60:,.1f} minutes\")\n```\n\nCompare all possible column pairs:\n```python\nimport json\nimport pathlib\nimport time\n\nimport src.discover\n\ntable_data = {}\n\nstart_time = time.perf_counter()\nfor path in pathlib.Path(\"temp_storage\").iterdir():\n    if path.is_file() and path.suffix==\".json\":\n        with open(path, \"r\", encoding=\"utf-8\") as file:\n            table_data[path.stem] = json.load(file)\n\nprint(f\"Finished reading in data in {(time.perf_counter()-start_time)/60:,.1f} minutes\")\n\nstart_time = time.perf_counter()\nsrc.discover.join_keys(\n    tbl_contents=table_data,\n    n_samples=500,\n    allowed_key_types=(int, str),\n    output_path=\"output/discover/join_keys/all_matches.json\",\n)\nprint(f\"Finished discovering join keys in {(time.perf_counter()-start_time)/60:,.1f} minutes\")\n```\n\nDecide which column pairs are sufficiently matching to be considered as useful join keys:\n```python\nimport time\n\nimport src.decision\nfrom src.decision.comparison_operators import greater_than, less_than\n\nstart_time = time.perf_counter()\nsrc.decision.join_keys(\n    input_data_filepath=\"output/discover/join_keys/all_matches.json\",\n    min_match_criteria=(\n        (\"matches\", \"exactly_1_match_in_lookup\", \"percent\", (greater_than, 0.1)),\n        (\"sampled_col\", \"sample_size\", \"percent_null\", (less_than, 0.95)),\n        (\"sampled_col\", \"sample_size\", \"n_unique/n_rows\", (greater_than, 0.5)),\n        (\"lookup_col\", \"size\", \"percent_null\", (less_than, 0.95)),\n        (\"sampled_col\", \"sample_size\", \"n_unique\", (greater_than, 4)),\n        (\"lookup_col\", \"size\", \"n_unique\", (greater_than, 4)),\n    ),\n    output_filepath=\"output/decision/join_keys/identified_join_keys.json\"\n)\n\nprint(f\"Finished making join key decisions in {(time.perf_counter()-start_time)/60:,.1f} minutes\")\n```\n\nSave the identified join keys in a useable CSV format:\n```python\nimport src.dataviz\nsrc.dataviz.join_key_decisions_to_csv(\n    input_data_filepath=\"output/decision/join_keys/identified_join_keys.json\",\n    output_filepath=\"output/dataviz/join_key_decisions_to_csv/identified_join_keys.csv\",\n)\n```\n\nCreate a SQLite database of empty tables with the identified join keys defined foreign key constraints - this can be visualised using any database visualisation tool\n(I used [dbvisualiser](https://www.dbvis.com/), which worked amazingly):\n```python\nimport json\nimport time\n\nimport src.dataviz\n\nstart_time = time.perf_counter()\nwith open(\"output/decision/join_keys/identified_join_keys.json\", \"r\", encoding=\"utf-8\") as file:\n    join_cols_identified = json.load(file)\n\nsrc.dataviz.make_sqlite_skeleton(\n    col_pairs=join_cols_identified,\n    output_db_path=\"output/dataviz/make_sqlite_skeleton/identified_join_keys.db\",\n) \nprint(f\"Exported SQLite db skeleton in {(time.perf_counter()-start_time)/60:,.1f} minutes\")\n```\n\nDiscover multi-step table connections by modelling the whole system as a graph:\n```python\nimport src.discover.table_links\nimport src.transform_data.table_link_paths_to_csv\n\nsrc.discover.table_links.create_db(\n    input_data_filepath=\"output/decision/join_keys/identified_join_keys.json\",\n    max_path_len=4,\n    output_filepath=\"output/discover/table_links/create_db/table_link_paths.pickle\",\n)\n\nsrc.transform_data.table_link_paths_to_csv(\n    input_data_filepath=\"output/discover/table_links/create_db/table_link_paths.pickle\",\n    output_filepath=\"output/transform_data/table_links_to_csv/table_link_paths.csv\",\n)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj-sephb-lt-n%2Fauto-database-structure-discovery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fj-sephb-lt-n%2Fauto-database-structure-discovery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj-sephb-lt-n%2Fauto-database-structure-discovery/lists"}