{"id":19488820,"url":"https://github.com/mrpowers-io/falsa","last_synced_at":"2025-04-25T18:32:50.951Z","repository":{"id":251275153,"uuid":"835589907","full_name":"mrpowers-io/falsa","owner":"mrpowers-io","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-24T16:32:29.000Z","size":570,"stargazers_count":6,"open_issues_count":12,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-04T02:11:30.332Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrpowers-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T06:31:02.000Z","updated_at":"2025-02-24T16:32:04.000Z","dependencies_parsed_at":"2025-01-26T15:23:05.980Z","dependency_job_id":"245c030c-53c2-4146-a958-8c9d32cf4fc5","html_url":"https://github.com/mrpowers-io/falsa","commit_stats":null,"previous_names":["mrpowers-io/falsa"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ffalsa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ffalsa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ffalsa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ffalsa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrpowers-io","download_url":"https://codeload.github.com/mrpowers-io/falsa/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250872413,"owners_count":21500814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T21:05:22.513Z","updated_at":"2025-04-25T18:32:50.606Z","avatar_url":"https://github.com/mrpowers-io.png","language":"Python","readme":"# falsa\n\nfalsa makes it easy to generate sample datasets.\n\nHere is how to generate a Parquet file with 100 million rows and 9 columns of data for example:\n\n```\nfalsa groupby --path-prefix=~/data --size MEDIUM\n```\n\n![falsa example](https://github.com/mrpowers-io/falsa/blob/main/images/falsa_example.png)\n\nHere are the first three rows of data in the file:\n\n```\n┌───────┬──────────┬──────────────┬─────┬─────┬────────┬─────┬─────┬───────────┐\n│ id1   ┆ id2      ┆ id3          ┆ id4 ┆ id5 ┆ id6    ┆ v1  ┆ v2  ┆ v3        │\n│ ---   ┆ ---      ┆ ---          ┆ --- ┆ --- ┆ ---    ┆ --- ┆ --- ┆ ---       │\n│ str   ┆ str      ┆ str          ┆ i64 ┆ i64 ┆ i64    ┆ i64 ┆ i64 ┆ f64       │\n╞═══════╪══════════╪══════════════╪═════╪═════╪════════╪═════╪═════╪═══════════╡\n│ id038 ┆ id850817 ┆ id0000837021 ┆ 90  ┆ 8   ┆ 898164 ┆ 4   ┆ 15  ┆ 28.133477 │\n│ id095 ┆ id73309  ┆ id0000312443 ┆ 3   ┆ 75  ┆ 177193 ┆ 1   ┆ 12  ┆ 91.555302 │\n│ id055 ┆ id248099 ┆ id0000141631 ┆ 12  ┆ 94  ┆ 132406 ┆ 1   ┆ 3   ┆ 64.543029 │\n└───────┴──────────┴──────────────┴─────┴─────┴────────┴─────┴─────┴───────────┘\n```\n\nWith falsa, you can generate many sample datasets.\n\n## Installation\n\n### Pip install\n\nIn virtualenv with python 3.9+:\n\n```sh\npip install git+https://github.com/mrpowers-io/falsa.git@main\nfalsa --help\n```\n\n### Maturin build\n\nIn virtualenv with python 3.9+:\n\n```sh\nmaturin develop --release\nfalsa --help\n```\n\n## h2o datasets\n\nThe h2o datasets are used to benchmark query engines on a single machine, [see here](https://duckdblabs.github.io/db-benchmark/).\n\nHere are [the original R Scripts](https://github.com/duckdblabs/db-benchmark/tree/main/_data) to generate the sample datasets.  These still work if you know how to run R (the large dataset generation can error out if you machine doesn't have sufficient memory).\n\nfalsa is good if you want to generate these datasets with a Python interface or if you are facing memory issues with the R scripts.\n\n### h2o groupby dataset\n\nThe h2o groupby dataset has 9 columns and 10 million/100 million/1 billion rows of data.\n\nHere are three representative rows of data:\n\n```\n┌───────┬──────────┬──────────────┬─────┬─────┬────────┬─────┬─────┬───────────┐\n│ id1   ┆ id2      ┆ id3          ┆ id4 ┆ id5 ┆ id6    ┆ v1  ┆ v2  ┆ v3        │\n│ ---   ┆ ---      ┆ ---          ┆ --- ┆ --- ┆ ---    ┆ --- ┆ --- ┆ ---       │\n│ str   ┆ str      ┆ str          ┆ i64 ┆ i64 ┆ i64    ┆ i64 ┆ i64 ┆ f64       │\n╞═══════╪══════════╪══════════════╪═════╪═════╪════════╪═════╪═════╪═══════════╡\n│ id038 ┆ id850817 ┆ id0000837021 ┆ 90  ┆ 8   ┆ 898164 ┆ 4   ┆ 15  ┆ 28.133477 │\n│ id095 ┆ id73309  ┆ id0000312443 ┆ 3   ┆ 75  ┆ 177193 ┆ 1   ┆ 12  ┆ 91.555302 │\n│ id055 ┆ id248099 ┆ id0000141631 ┆ 12  ┆ 94  ┆ 132406 ┆ 1   ┆ 3   ┆ 64.543029 │\n└───────┴──────────┴──────────────┴─────┴─────┴────────┴─────┴─────┴───────────┘\n```\n\nHere's a short description of the columns:\n\n* id1: 100 distinct values between id001 and id100\n* id2: 100 distinct values between id001 and id100\n* id3: 1_000_000 distinct values\n* id4: random float values between zero and 100\n* id5: random integer values between zero and 100\n* id6: random integer values between 1 and 1_000_000\n* v1: integer values between 1 and 5\n* v2: integer valuees between 1 and 15\n* v3: floating values between zero and 100\n\nHere's the detailed description of the table:\n\n```\n┌────────────┬───────────┬───────────┬──────────────┬───────────┬───┬───────────────┬──────────┬───────────┬───────────┐\n│ statistic  ┆ id1       ┆ id2       ┆ id3          ┆ id4       ┆ … ┆ id6           ┆ v1       ┆ v2        ┆ v3        │\n│ ---        ┆ ---       ┆ ---       ┆ ---          ┆ ---       ┆   ┆ ---           ┆ ---      ┆ ---       ┆ ---       │\n│ str        ┆ str       ┆ str       ┆ str          ┆ f64       ┆   ┆ f64           ┆ f64      ┆ f64       ┆ f64       │\n╞════════════╪═══════════╪═══════════╪══════════════╪═══════════╪═══╪═══════════════╪══════════╪═══════════╪═══════════╡\n│ count      ┆ 100000000 ┆ 100000000 ┆ 100000000    ┆ 1e8       ┆ … ┆ 1e8           ┆ 1e8      ┆ 1e8       ┆ 1e8       │\n│ null_count ┆ 0         ┆ 0         ┆ 0            ┆ 0.0       ┆ … ┆ 0.0           ┆ 0.0      ┆ 0.0       ┆ 0.0       │\n│ mean       ┆ null      ┆ null      ┆ null         ┆ 50.500471 ┆ … ┆ 499977.133559 ┆ 3.000173 ┆ 8.0002679 ┆ 50.000731 │\n│ std        ┆ null      ┆ null      ┆ null         ┆ 28.864911 ┆ … ┆ 288668.423121 ┆ 1.414225 ┆ 4.320694  ┆ 28.868118 │\n│ min        ┆ id001     ┆ id001     ┆ id0000000001 ┆ 1.0       ┆ … ┆ 1.0           ┆ 1.0      ┆ 1.0       ┆ 0.000002  │\n│ 25%        ┆ null      ┆ null      ┆ null         ┆ 26.0      ┆ … ┆ 249956.0      ┆ 2.0      ┆ 4.0       ┆ 24.999205 │\n│ 50%        ┆ null      ┆ null      ┆ null         ┆ 51.0      ┆ … ┆ 499949.0      ┆ 3.0      ┆ 8.0       ┆ 50.002307 │\n│ 75%        ┆ null      ┆ null      ┆ null         ┆ 75.0      ┆ … ┆ 749987.0      ┆ 4.0      ┆ 12.0      ┆ 75.002693 │\n│ max        ┆ id100     ┆ id999999  ┆ id0001000000 ┆ 100.0     ┆ … ┆ 1e6           ┆ 5.0      ┆ 15.0      ┆ 100.0     │\n└────────────┴───────────┴───────────┴──────────────┴───────────┴───┴───────────────┴──────────┴───────────┴───────────┘\n```\n\nThe h2o dataset is useful for group by benchmarks.  For example, you can use id1 to do an aggregation on a low cardinality column and id3 to do an aggreation on a high cardinality column.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Ffalsa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrpowers-io%2Ffalsa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Ffalsa/lists"}