{"id":21964473,"url":"https://github.com/ul-mds/gecko","last_synced_at":"2025-04-24T02:23:22.757Z","repository":{"id":227424428,"uuid":"770932669","full_name":"ul-mds/gecko","owner":"ul-mds","description":"Python library for the generation and mutation of realistic personal identification data at scale","archived":false,"fork":false,"pushed_at":"2025-01-30T16:22:02.000Z","size":5779,"stargazers_count":6,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-12T18:09:59.799Z","etag":null,"topics":["data-science","numpy","pandas","python","record-linkage"],"latest_commit_sha":null,"homepage":"https://ul-mds.github.io/gecko/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ul-mds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-12T12:12:25.000Z","updated_at":"2025-03-28T12:22:49.000Z","dependencies_parsed_at":"2024-03-28T10:51:02.656Z","dependency_job_id":"c2fa6d2c-0105-4093-b185-00337397982b","html_url":"https://github.com/ul-mds/gecko","commit_stats":null,"previous_names":["ul-mds/gecko"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ul-mds%2Fgecko","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ul-mds%2Fgecko/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ul-mds%2Fgecko/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ul-mds%2Fgecko/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ul-mds","download_url":"https://codeload.github.com/ul-mds/gecko/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250546546,"owners_count":21448355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","numpy","pandas","python","record-linkage"],"created_at":"2024-11-29T12:23:03.023Z","updated_at":"2025-04-24T02:23:22.751Z","avatar_url":"https://github.com/ul-mds.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Gecko is a Python library for the bulk generation and mutation of realistic personal data.\nIt is a spiritual successor to the GeCo framework which was initially published by Tran, Vatsalan and Christen.\nGecko reimplements the most promising aspects of the original framework for modern Python with a simplified API, adds\nextra features and massively improves performance thanks to NumPy and Pandas.\n\n# Installation\n\nInstall with pip:\n\n```bash\npip install gecko-syndata\n```\n\nInstall with [Poetry](https://python-poetry.org/):\n\n```bash\npoetry add gecko-syndata\n```\n\n# Basic usage\n\n[Please see the docs for an in-depth guide on how to use the library.](https://ul-mds.github.io/gecko/)\n\nWriting a data generation script with Gecko is usually split into two consecutive steps.\nIn the first step, data is generated based on information that you provide.\nMost commonly, Gecko pulls the information it needs from frequency tables, although other means of generating data\nare possible.\nGecko will then output a dataset to your specifications.\n\nIn the second step, a copy of this dataset is mutated.\nGecko provides functions which deliberately introduce errors into your dataset.\nThese errors can take shape in typos, edit errors and other common data sources.\nBy the end, you will have a generated dataset and a mutated copy thereof.\n\n![Common workflow with Gecko](https://ul-mds.github.io/gecko/img/gecko-workflow.png)\n\nGecko exposes two modules, `generator` and `mutator`, to help you write data generation scripts.\nBoth contain built-in functions covering the most common use cases for generating data from frequency information and\nmutating data based on common error sources, such as typos, OCR errors and much more.\n\nThe following example gives a very brief overview of what a data generation script with Gecko might look like.\nIt uses frequency tables from the [Gecko data repository](https://github.com/ul-mds/gecko-data) which has been cloned\ninto a directory next to the script itself.\n\n```python\nfrom pathlib import Path\n\nimport numpy as np\n\nfrom gecko import generator, mutator\n\n# create a RNG with a set seed for reproducible results\nrng = np.random.default_rng(727)\n# path to the Gecko data repository\ngecko_data_dir = Path(\"gecko-data\")\n\n# create a data frame with 10,000 rows and a single column called \"last_name\" \n# which sources its values from the frequency table with the same name\ndf_generated = generator.to_data_frame(\n    [\n        (\"last_name\", generator.from_frequency_table(\n            gecko_data_dir / \"de_DE\" / \"last-name.csv\",\n            value_column=\"last_name\",\n            freq_column=\"count\",\n            rng=rng,\n        )),\n    ],\n    10_000,\n)\n\n# mutate this data frame by randomly deleting characters in 1% of all rows\ndf_mutated = mutator.mutate_data_frame(\n    df_generated,\n    [\n        (\"last_name\", (.01, mutator.with_delete(rng))),\n    ],\n)\n\n# export both data frames using Pandas' to_csv function\ndf_generated.to_csv(\"german-generated.csv\", index_label=\"id\")\ndf_mutated.to_csv(\"german-mutated.csv\", index_label=\"id\")\n```\n\nFor a more extensive usage guide, [refer to the docs](https://ul-mds.github.io/gecko/).\n\n# Rationale\n\nThe GeCo framework was originally conceived to facilitate the generation and mutation of personal data to validate\nrecord linkage algorithms.\nIn the field of record linkage, acquiring real-world personal data to test new algorithms on is hard to come by.\nHence, GeCo went for a synthetic approach using statistical models from publicly available data.\nGeCo was built for Python 2.7 and has not seen any active development since its last publication in 2013.\nThe general idea of providing shareable and reproducible Python scripts to generate personal data however still holds a\nlot of promise.\nThis has led to the development of the Gecko library.\n\nA lot of GeCo's weaknesses were rectified with this library.\nVectorized functions from Pandas and NumPy provide significant performance boosts and aid integration into existing\ndata science applications.\nA simplified API allows for a much easier development of custom generators and mutators.\nNumPy's random number generation routines instead of Python's built-in `random` module make fine-tuned reproducible\nresults a breeze.\nGecko therefore seeks to be GeCo's \"bigger brother\" and aims to provide a much more refined experience to generate\nrealistic personal data.\n\n# Disclaimer\n\nGecko is still very much in a \"beta\" state.\nAs it stands, it satisfies our internal use cases within the Medical Data Science group, but we also seek wider\nadoption.\nIf you find any issues or improvements with the library, do not hesitate to contact us.\n\n# Citing Gecko\n\nIf you found Gecko useful, then we highly appreciate proper citations of our work in your own publications.\nGitHub supports the [Citation File Format (CFF)](https://citation-file-format.github.io/) and can parse the \ncorresponding file contained within this project. \nSimply click \"Cite this repository\" on this project's GitHub page.\n[We also provide extensive information on how to cite Gecko in our documentation](https://ul-mds.github.io/gecko/citing-gecko/), \nas well as links to all of our original publications and presentations. \n\n# License\n\nGecko is released under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ful-mds%2Fgecko","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ful-mds%2Fgecko","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ful-mds%2Fgecko/lists"}