{"id":22338783,"url":"https://github.com/tracesql/tracesql-py","last_synced_at":"2025-07-30T00:31:07.009Z","repository":{"id":262337164,"uuid":"886928379","full_name":"TraceSQL/tracesql-py","owner":"TraceSQL","description":"Python client for TraceSQL lineage analyzer","archived":false,"fork":false,"pushed_at":"2024-12-12T16:13:30.000Z","size":69,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-23T07:47:33.620Z","etag":null,"topics":["data-lineage","database","lineage","sql","sql-lineage"],"latest_commit_sha":null,"homepage":"https://tracesql.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TraceSQL.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-11T21:31:27.000Z","updated_at":"2025-07-07T01:00:38.000Z","dependencies_parsed_at":"2024-11-11T22:33:09.923Z","dependency_job_id":"90c3bac1-b459-4356-a6ea-397c10ce6a5f","html_url":"https://github.com/TraceSQL/tracesql-py","commit_stats":null,"previous_names":["tracesql/tracesql-py"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/TraceSQL/tracesql-py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TraceSQL%2Ftracesql-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TraceSQL%2Ftracesql-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TraceSQL%2Ftracesql-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TraceSQL%2Ftracesql-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TraceSQL","download_url":"https://codeload.github.com/TraceSQL/tracesql-py/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TraceSQL%2Ftracesql-py/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267785725,"owners_count":24144118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-lineage","database","lineage","sql","sql-lineage"],"created_at":"2024-12-04T07:05:14.789Z","updated_at":"2025-07-30T00:31:06.982Z","avatar_url":"https://github.com/TraceSQL.png","language":"Python","readme":"\n## TOC\n\n- [TraceSQL Python Package](#tracesql-python-package)\n  - [Features](#features)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Simple example](#simple-example)\n  - [`analyze_lineage` method](#analyze_lineage-method)\n    - [Parameters](#parameters)\n    - [Response - `ApiResponse`](#response---apiresponse)\n- [Data lineage](#data-lineage)\n  - [Data lineage in SQL (technical view)](#data-lineage-in-sql-technical-view)\n    - [Simple `SELECT`](#simple-select)\n    - [Wildcard](#wildcard)\n    - [Ambiguous Queries](#ambiguous-queries)\n\n\n\n# TraceSQL Python Package\n\nThe `tracesql` Python is client for [TraceSQL](https://tracesql.com). It allows you to easily analyze SQL code for data lineage.\n\n\nYou can currently use this client and the API it wraps without any limitations or tokens. This might be changed in the future.\n\n\n## Features\n\n- Connects to TraceSQL API.\n- Analyzes SQL code to generate data lineage.\n- Outputs the lineage in JSON format.\n- Generates an SVG image of the lineage.\n\n## Installation\n\nYou can install the `tracesql` package via pip:\n\n```bash\npip install tracesql\n```\n\n## Usage\n\n### Simple example\n\n```python\nfrom tracesql import analyze_lineage\n\ncode = \"\"\"\nCREATE TABLE active_customers AS\nSELECT customer_id, first_name || ' ' || last_name as fullname, email\nFROM customers\nWHERE status = 'active';\n\"\"\"\nresponse = analyze_lineage(code)\n\n# Save the SVG image of the lineage\nwith open(\"image.svg\", \"w\") as fw:\n    fw.write(response.svg)\n\n# Save the lineage data in JSON format\nwith open(\"lineage.json\", \"w\") as fw:\n    fw.write(response.lineage.model_dump_json(indent=2))\n\nprint(\"Lineage successfully saved in files.\")\n```\n\nHere is output for this example:\n![simple](examples/output/image.svg)\n\n\nOptionally, you can provide DB model, which will help resolving ambiguous queries:\n```\ndb_model = DbModel(\n    tables=[\n        DbModelTable(\n            name=\"ACTIVE_CUSTOMERS\",\n            columns=[\"customer_id\", \"full_name\", \"email\", \"status\"]\n        ),\n        DbModelTable(\n            name=\"CUSTOMERS\",\n            columns=[\"customer_id\", \"full_name\", \"email\", \"status\"]\n        )\n    ]\n)\nresponse = analyze_lineage(code, db_model=db_model)\n```\n\nWhen submitting database model, please ensure you use the correct case (upper/lower). The system is case-sensitive, and mismatches in casing (e.g., \"Customers\" vs \"CUSTOMERS\") may result in processing errors or unexpected behavior.\n\n## `analyze_lineage` method\n\nIt provides the most basic interface for analyzing lineage. Check the underlaying code if you want to build something more capable.\n\n### Parameters\n\n- `query (str)`: The SQL query whose lineage you want to analyze.\n- `db_model (Optional[DbModel])`: The database model containing the tables and columns used in the SQL query.\n\n### Response - `ApiResponse`\n- `svg`: A string representing the SVG image of the lineage.\n- `lineage`: An object containing the lineage data in a pydantic class.\n\n\nEach relation includes an attribute named `source_positions`, which provides detailed information about code, that is relevant for this relation:\n```json\n\"source_positions\": [\n    {\n        \"start_idx\": 109,\n        \"end_idx\": 118\n    }\n    ...\n]\n```\nThe `start_idx` and `end_idx` represent character indices in the input SQL code. Together, they define a range that pinpoints the specific section of the code corresponding to the relation. These indices effectively serve as \"pointers\" to the relevant portion of the analyzed SQL query.\n\n\n# Data lineage\n\nLineage traditionally refers to a person’s or group’s ancestry, tracing their origins and heritage through generations. It embodies the historical path that defines their roots and connections.\n\nSimilarly, data lineage traces the lifecycle of data, mapping its origins, transformations, and usage. It answers critical questions such as:\n\n- Where did this data originate?\n- How is this data used across systems or processes?\n- What steps were involved in constructing this data?\n\nMore specific examples include:\n\n- Can I safely delete this column?\n- Are there scripts or processes that depend on this table?\n- Who created or modified this data?\n\nIn a world where data is often more valuable than gold, it’s crucial to answer these questions quickly and accurately. This is why data lineage plays a vital role in effective data governance.\n\n## Data lineage in SQL (technical view)\n\nHow is lineage created in SQL? It is as simple as creating a table:\n```\nCREATE TABLE NEW_TABLE\nAS SELECT first_name, last_name from OLD_TABLE;\n```\nWith this query, we have create lineage from `OLD_TABLE` to `NEW_TABLE`. On column level, this would look like this:\n```\nOLD_TABLE.first_name -\u003e NEW_TABLE.first_name\nOLD_TABLE.last_name -\u003e NEW_TABLE.last_name\n```\n\nLineage is created whenever data is moved or transformed. In SQL, this typically involves a `SELECT` statement to retrieve data before moving it, making `SELECT` the cornerstone of lineage analysis.\n\n1. **Analyze `targets`** – Identify the destination of the `SELECT` statement, usually a single table.\n2. **Analyze `sources`** – Identify the data sources, typically multiple tables referenced in the `FROM` clause.\n3. **Connect `sources` and `targets`** – Establish relationships between sources and their corresponding targets.\n\n### Simple `SELECT`\n\nWhat happens when a `SELECT` statement has no explicit target?\n\n```sql\nSELECT name FROM accounts;\n```\n\nIn most IDEs, this query simply displays the results. To model this behavior in lineage, we create an artificial target called `SELECT-RESULT`:\n\n```\naccounts.name -\u003e SELECT-RESULT.name\n```\n\nThis approach ensures the lineage remains consistent, even without a defined target.\n\n\n### Wildcard\n\n```sql\nSELECT * FROM events;\n```\n\nThis case is impossible to analyze without any extra info. We need to check the database model - view the columns of table `events`. You can either do this by providing the `CREATE TABLE` statement or you can provide the database model in JSON format directly to the API.\n\n\n### Ambiguous Queries\n\nConsider the following query:\n```sql\nSELECT price, name FROM products NATURAL JOIN suppliers;\n```\nAnalyzing lineage in this case is challenging due to the absence of table aliases and fully qualified column names. Without these, it becomes unclear which table each column originates from.\n\nWhile providing the database model to the lineage analyzer can help resolve this ambiguity, the best practice is to use explicit column references and table aliases to avoid confusion\n\n```sql\nSELECT p.price, s.name FROM products p\nNATURAL JOIN suppliers s\n```\n\nThis approach ensures clearer lineage analysis and reduces the risk of misinterpreting the data's origins.\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftracesql%2Ftracesql-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftracesql%2Ftracesql-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftracesql%2Ftracesql-py/lists"}