{"id":24112453,"url":"https://github.com/surister/datasaurus","last_synced_at":"2025-10-06T09:31:21.374Z","repository":{"id":167888020,"uuid":"637055799","full_name":"surister/datasaurus","owner":"surister","description":"Data Engineering framework written in Python based in Polars.","archived":false,"fork":false,"pushed_at":"2024-05-01T14:46:23.000Z","size":386,"stargazers_count":14,"open_issues_count":16,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-20T21:57:40.633Z","etag":null,"topics":["classes","data","dataframes","datamodeling","framework","library","orm","polars","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/surister.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-06T11:19:17.000Z","updated_at":"2024-12-30T22:28:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"fe9bff22-c761-4bbf-af87-505704a5d5b5","html_url":"https://github.com/surister/datasaurus","commit_stats":null,"previous_names":["surister/datasaurus"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surister%2Fdatasaurus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surister%2Fdatasaurus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surister%2Fdatasaurus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surister%2Fdatasaurus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/surister","download_url":"https://codeload.github.com/surister/datasaurus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235515431,"owners_count":19002481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classes","data","dataframes","datamodeling","framework","library","orm","polars","python"],"created_at":"2025-01-11T03:31:49.612Z","updated_at":"2025-10-06T09:31:16.040Z","avatar_url":"https://github.com/surister.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Datasaurus is a Data Engineering framework written in Python 3.8, 3.9, 3.10 and 3.11\n\nIt is based in Polars and heavily influenced by Django.\n\nDatasaurus offers an opinionated, feature-rich and powerful framework to help you write\ndata pipelines, ETLs or data manipulation programs.\n\n[Documentation]() (TODO)\n## It supports:\n- ✅ Fully support read/write operations.\n- ⭕ Not yet but will be implemented.\n- 💀 Won't be implemented in the near future.\n\n### Storages:\n- Sqlite ✅\n- PostgresSQL ✅\n- MySQL ✅\n- Mariadb ✅\n- Local Storage ✅\n- Azure blob storage ⭕\n- AWS S3 ⭕\n\n\n### Formats:\n- CSV ✅\n- JSON ✅\n- PARQUET ✅\n- EXCEL ✅\n- AVRO ✅\n- TSV ⭕\n- SQL ⭕ (Like sql inserts)\n- \n### Features:\n- Delta Tables ⭕\n- Field validations ⭕\n\n## Simple example\n```python\n# settings.py \nfrom datasaurus.core.storage import PostgresStorage, StorageGroup, SqliteStorage\nfrom datasaurus.core.models import StringColumn, IntegerColumn\n\n# We set the environment that will be used.\nos.environ['DATASAURUS_ENVIRONMENT'] = 'dev'\n\nclass ProfilesData(StorageGroup):\n    dev = SqliteStorage(path='/data/data.sqlite')\n    live = PostgresStorage(username='user', password='user', host='localhost', database='postgres')\n\n    \n# models.py\nfrom datasaurus.core.models import Model, StringColumn, IntegerColumn\n\nclass ProfileModel(Model):\n    id = IntegerColumn()\n    username = StringColumn()\n    mail = StringColumn()\n    sex = StringColumn()\n\n    class Meta:\n        storage = ProfilesData\n        table_name = 'PROFILE'\n\n```\n\nWe can access the raw Polars dataframe with 'Model.df', it's lazy, meaning it will only load the\ndata if we access the attribute.\n\n```py\n\u003e\u003e\u003e ProfileModel.df\nshape: (100, 4)\n┌─────┬────────────────────┬──────────────────────────┬─────┐\n│ id  ┆ username           ┆ mail                     ┆ sex │\n│ --- ┆ ---                ┆ ---                      ┆ --- │\n│ i64 ┆ str                ┆ str                      ┆ str │\n╞═════╪════════════════════╪══════════════════════════╪═════╡\n│ 1   ┆ ehayes             ┆ colleen63@hotmail.com    ┆ F   │\n│ 2   ┆ thompsondeborah    ┆ judyortega@hotmail.com   ┆ F   │\n│ 3   ┆ orivera            ┆ iperkins@hotmail.com     ┆ F   │\n│ 4   ┆ ychase             ┆ sophia92@hotmail.com     ┆ F   │\n│ …   ┆ …                  ┆ …                        ┆ …   │\n│ 97  ┆ mary38             ┆ sylvia80@yahoo.com       ┆ F   │\n│ 98  ┆ charlessteven      ┆ usmith@gmail.com         ┆ F   │\n│ 99  ┆ plee               ┆ powens@hotmail.com       ┆ F   │\n│ 100 ┆ elliottchristopher ┆ wilsonbenjamin@yahoo.com ┆ M   │\n└─────┴────────────────────┴──────────────────────────┴─────┘\n\n```\n\nWe could now create a new model whose data is created from ProfileModel\n\n```python\nclass FemaleProfiles(Model):\n    id = IntegerField()\n    profile_id = IntegerField()\n    mail = StringField()\n\n    def calculate_data(self):\n        return (\n            ProfileModel.df\n            .filter(ProfileModel.sex == 'F')\n            .with_row_count('new_id')\n            .with_columns(\n                pl.col('new_id')\n            )\n            .with_columns(\n                pl.col('id').alias('profile_id')\n            )\n        )\n\n    class Meta:\n        recalculate = 'if_no_data_in_storage'\n        storage = ProfilesData\n        table_name = 'PROFILE_FEMALES'\n```\nEt voilá! the columns will be auto selected from the column definitions (id, profile_id and email).\n\nIf we now call:\n```python\nFemaleProfiles.df\n```\n\nIt will check if the dataframe exists in the storage and if it does not, it will 'calculate' it again\nfrom calculate_data and save it to the Storage, this parameter can also be set to 'always'.\n\n\nYou can also move data to different environments or storages, making it easy to change formats or\nmove data around:\n\n```python\nFemaleProfiles.save(to=ProfilesData.live)\n```\n\nEffectively moving data from SQLITE (dev) to PostgreSQL (live), \n\n```python\n# Can also change formats\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.JSON)\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.CSV)\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.PARQUET)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsurister%2Fdatasaurus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsurister%2Fdatasaurus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsurister%2Fdatasaurus/lists"}