{"id":17981135,"url":"https://github.com/starlake-ai/starlake","last_synced_at":"2025-04-05T22:03:38.080Z","repository":{"id":37015747,"uuid":"407764022","full_name":"starlake-ai/starlake","owner":"starlake-ai","description":"Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.","archived":false,"fork":false,"pushed_at":"2025-03-28T22:09:59.000Z","size":178001,"stargazers_count":84,"open_issues_count":42,"forks_count":23,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-29T21:02:47.179Z","etag":null,"topics":["bigquery","data-engineering","data-integration","data-pipeline","etl","hdfs","redshift","snowflake","spark","synapse"],"latest_commit_sha":null,"homepage":"http://starlake.ai/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/starlake-ai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-18T05:20:26.000Z","updated_at":"2025-03-28T22:10:03.000Z","dependencies_parsed_at":"2023-10-10T22:13:55.977Z","dependency_job_id":"bdcb1415-c67f-42be-8f5c-b85afae1756b","html_url":"https://github.com/starlake-ai/starlake","commit_stats":null,"previous_names":[],"tags_count":66,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starlake-ai%2Fstarlake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starlake-ai%2Fstarlake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starlake-ai%2Fstarlake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/starlake-ai%2Fstarlake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/starlake-ai","download_url":"https://codeload.github.com/starlake-ai/starlake/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247406084,"owners_count":20933803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-engineering","data-integration","data-pipeline","etl","hdfs","redshift","snowflake","spark","synapse"],"created_at":"2024-10-29T18:07:51.482Z","updated_at":"2025-04-05T22:03:38.046Z","avatar_url":"https://github.com/starlake-ai.png","language":"Scala","funding_links":[],"categories":["大数据"],"sub_categories":[],"readme":"![Build Status](https://github.com/starlake-ai/starlake/workflows/Build/badge.svg)\n[![Maven Central Starlake Spark 3](https://maven-badges.herokuapp.com/maven-central/ai.starlake/starlake-core_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ai.starlake/starlake-core_2.12)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n# Sister projects\n\n- [Starlake Docker](https://github.com/starlake-ai/starlake-docker): Run Starlake in a docker container\n- [Starlake JSQLTranspiler](https://github.com/starlake-ai/jsqltranspiler): Starlake powerful toolbox, also available at [labs.starlake.ai](https://labs.starlake.ai)\n\n\n\n\u003cimg src=\"docs/static/img/intent.png\" /\u003e\n\nStarlake is a declarative text based tool that enables analysts and engineers to extract, load, transform and orchestrate their data pipelines.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/static/img/starlake-draw.png\" /\u003e\n\u003c/p\u003e\n\n\nStarlake is a configuration only Extract, Load, Transform and Orchestration Declarative Data Pipeline Tool.\nThe workflow below is a typical use case:\n* **Extract** your data as a set of Fixed Position, DSV (Delimiter-separated values) or JSON or XML files\n* Define or infer table schemas fom text files (csv, json, xml, fixed-width ...)\n* **Load**: Define transformations at load time using YAML and start **loading** files into your datawarehouse.\n* **Transform**: Build aggregates using regular SQL SELECT statements and let Starlake build your tables with respect to your selected strategy (Append, Overwrite, Merge ...).\n* **Orchestrate**: Let Starlake handle your data lineage and run your data pipelines on your favorite orchestrator (Airflow, Dagster ... ).\n\nYou may use Starlake for Extract, Load and Transform steps or any combination of these steps.\n\n# How it works\n\nThe advent of declarative programming, exemplified by tools like Ansible and Terraform,\nhas revolutionized infrastructure deployment by allowing developers to express intended goals without specifying the order of code execution.\nThis paradigm shift brings forth benefits such as reduced error prone coding tasks, significantly shortened development cycles,\nenhanced code readability, and increased accessibility for developers of all levels.\n\nStarlake is a YAML-based declarative tool designed for expressing Extract, Load, Transform, and Orchestration tasks.\nDrawing inspiration from the successes of declarative programming in infrastructure,\nStarlake aims to bring similar advantages to the realm of data engineering.\n\nThis paradigm shift  encourages a focus on defining goals for data warehouses,\nrather than the intricacies of implementation details.\n\n\nThe YAML DSL is self-explanatory and easy to understand. This is best explained with an example:\n\n## Extract\n\nLet's say we want to extract data from a Postgres Server database on a daily basis\n```yaml\nextract:\n  connectionRef: \"pg-adventure-works-db\" # or mssql-adventure-works-db i extracting from SQL Server\n  jdbcSchemas:\n    - schema: \"sales\"\n      tables:\n        - name: \"salesorderdetail\"              # table name or simple \"*\" to extract all tables\n          partitionColumn: \"salesorderdetailid\" # (optional)  you may parallelize the extraction based on this field\n          fetchSize: 100                        # (optional)  the number of rows to fetch at a time\n          timestamp: salesdatetime              # (optional) the timestamp field to use for incremental extraction\n      tableTypes:\n        - \"TABLE\"\n        #- \"VIEW\"\n        #- \"SYSTEM TABLE\"\n        #- \"GLOBAL TEMPORARY\"\n        #- \"LOCAL TEMPORARY\"\n        #- \"ALIAS\"\n        #- \"SYNONYM\"\n```\n\nThat's it, we have defined our extraction pipeline.\n\n## Load\n\nLet's say we want to load the data extracted from the previous example into a datawarehouse\n\n```yaml\n---\ntable:\n  pattern: \"salesorderdetail.*.psv\" # This property is a regular expression that will be used to match the file name.\n  schedule: \"when_available\"        # (optional) cron expression to schedule the loading\n  metadata:\n    mode: \"FILE\"\n    format: \"CSV\"       # (optional) auto-detected if not specified\n    encoding: \"UTF-8\"\n    withHeader: yes     # (optional) auto-detected if not specified\n    separator: \"|\"      # (optional) auto-detected if not specified\n    writeStrategy:\n      type: \"UPSERT_BY_KEY_AND_TIMESTAMP\"\n      timestamp: signup\n      key: [id]\n                        # Please replace it by the adequate file pattern eq. customers-.*.psv if required\n  attributes:           # Description of the fields to recognize\n    - name: \"id\"        # attribute name and column name in the destination table if no rename attribute is defined\n      type: \"string\"    # expected type\n      required: false   # Is this field required in the source (false by default, change it accordingly) ?\n      privacy: \"NONE\"   # Should we encrypt this field before loading to the warehouse (No encryption by default )?\n      ignore: false     # Should this field be excluded (false by default) ?\n    - name: \"signup\"    # second attribute\n      type: \"timestamp\" # auto-detected if  specified\n    - name: \"contact\"\n      type: \"string\"\n      ...\n```\n\nThat's it, we have defined our loading pipeline.\n\n\n## Transform\n\nLet's say we want to build aggregates from the previously loaded data\n\n```yaml\n\ntransform:\n  default:\n    writeStrategy:\n      type: \"OVERWRITE\"\n  tasks:\n    - name: most_profitable_products\n      writeStrategy:\n        type: \"UPSERT_BY_KEY_AND_TIMESTAMP\"\n        timestamp: signup\n        key: [id]\n```\n```sql\nSELECT          # the SQL query will be translated into the appropriate MERGE INTO or INSERT OVERWRITE statement\n    productid,\n    SUM(unitprice * orderqty) AS total_revenue\nFROM salesorderdetail\nGROUP BY productid\nORDER BY total_revenue DESC\n```\n\nStarlake will automatically apply the right merge strategy (INSERT OVERWRITE or MERGE INTO) based on `writeStrategy` property and the input /output tables .\n\n## Orchestrate\n\nStarlake will take care of generating the corresponding DAG (Directed Acyclic Graph) and will run it\nwhenever  the tables referenced in the SQL query are updated.\n\nStarlake comes with a set of DAG templates that can be used to orchestrate your data pipelines on your favorite orchestrator (Airflow, Dagster, ...).\nSimply reference them in your YAML files  and optionally customize them to your needs.\n\n\nThe following dependencies are extracted from your SQL query and used to generate the corresponding DAG:\n![](docs/static/img/transform-viz.svg)\n\n\nThe resulting DAG is shown below:\n\n![](docs/static/img/transform-dags.png)\n\n# Supported platforms\n\nThe Load \u0026 Transform steps support multiple configurations for inputs and outputs.\n\n![Anywhere](docs/static/img/data-star.png \"Anywhere\")\n\n\n# Documentation\nComplete documentation available [here](https://docs.starlake.ai/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstarlake-ai%2Fstarlake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstarlake-ai%2Fstarlake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstarlake-ai%2Fstarlake/lists"}