{"id":15063984,"url":"https://github.com/contextlab/data-wrangler","last_synced_at":"2025-04-10T11:26:39.997Z","repository":{"id":48236258,"uuid":"382869106","full_name":"ContextLab/data-wrangler","owner":"ContextLab","description":"Wrangle messy numerical, image, and text data into consistent well-organized formats","archived":false,"fork":false,"pushed_at":"2022-07-25T21:00:26.000Z","size":1321,"stargazers_count":10,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T10:12:21.363Z","etag":null,"topics":["data","data-analysis","data-science","data-wrangling","hugging-face","image-data","machine-learning","nlp","numpy","pandas","python","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ContextLab.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null}},"created_at":"2021-07-04T14:15:02.000Z","updated_at":"2024-12-27T15:43:24.000Z","dependencies_parsed_at":"2022-08-24T18:30:53.971Z","dependency_job_id":null,"html_url":"https://github.com/ContextLab/data-wrangler","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ContextLab%2Fdata-wrangler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ContextLab%2Fdata-wrangler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ContextLab%2Fdata-wrangler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ContextLab%2Fdata-wrangler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ContextLab","download_url":"https://codeload.github.com/ContextLab/data-wrangler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248208634,"owners_count":21065203,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","data-science","data-wrangling","hugging-face","image-data","machine-learning","nlp","numpy","pandas","python","scikit-learn"],"created_at":"2024-09-25T00:09:44.434Z","updated_at":"2025-04-10T11:26:39.980Z","avatar_url":"https://github.com/ContextLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Overview\n================\n\n|build-status|  |docs|  |doi|\n\nDatasets come in all shapes and sizes, and are often *messy*:\n\n  - Observations come in different formats\n  - There are missing values\n  - Labels are missing and/or aren't consistent\n  - Datasets need to be wrangled 🐄 🐑 🚜\n\nThe main goal of ``data-wrangler`` is to turn messy data into clean(er) data, defined as either a ``DataFrame`` or a\nlist of ``DataFrame`` objects.  The package provides code for easily wrangling data from a variety of formats into\n``DataFrame`` objects, manipulating ``DataFrame`` objects in useful ways (that can be tricky to implement, but that\napply to many analysis scenarios), and decorating Python functions to make them more flexible and/or easier to write.\n\nThe ``data-wrangler`` package supports a variety of datatypes.  There is a special emphasis on text data, whereby\n``data-wrangler`` provides a simple API for interacting with natural language processing tools and datasets provided by\n``scikit-learn``, ``hugging-face``, and ``flair``.  The package is designed to provide sensible defaults, but also\nimplements convenient ways of deeply customizing how different datatypes are wrangled.\n\nFor more information, including a formal API and tutorials, check out https://data-wrangler.readthedocs.io\n\nQuick start\n================\n\nInstall datawrangler using:\n\n.. code-block:: console\n\n    $ pip install pydata-wrangler\n\nSome quick natural language processing examples::\n\n    import datawrangler as dw\n\n    # load in sample text\n    text_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/home_on_the_range.txt'\n    text = dw.io.load(text_url)\n\n    # embed text using scikit-learn's implementation of Latent Dirichlet Allocation, trained on a curated subset of\n    # Wikipedia, called the 'minipedia' corpus.  Return the fitted model so that it can be applied to new text.\n    lda = {'model': ['CountVectorizer', 'LatentDirichletAllocation'], 'args': [], 'kwargs': {}}\n    lda_embeddings, lda_fit = dw.wrangle(text, text_kwargs={'model': lda, 'corpus': 'minipedia'}, return_model=True)\n\n    # apply the minipedia-trained LDA model to new text\n    new_text = 'how much wood could a wood chuck chuck if a wood chuck could check wood?'\n    new_embeddings = dw.wrangle(new_text, text_kwargs={'model': lda_fit})\n\n    # embed text using hugging-face's pre-trained GPT2 model\n    gpt2 = {'model': 'TransformerDocumentEmbeddings', 'args': ['gpt2'], 'kwargs': {}}\n    gpt2_embeddings = dw.wrangle(text, text_kwargs={'model': gpt2})\n\nThe ``data-wrangler`` package also provides powerful decorators that can modify existing functions to support new\ndatatypes.  Just write your function as though its inputs are guaranteed to be Pandas DataFrames, and decorate it with\n``datawrangler.decorate.funnel`` to enable support for other datatypes without any new code::\n\n  image_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/wrangler.jpg'\n  image = dw.io.load(image_url)\n\n  # define your function and decorate it with \"funnel\"\n  @dw.decorate.funnel\n  def binarize(x):\n    return x \u003e np.mean(x.values)\n\n  binarized_image = binarize(image)  # rgb channels will be horizontally concatenated to create a 2D DataFrame\n\n\nSupported data formats\n----------------------\n\nOne package can't accommodate every foreseeable format or input source, but ``data-wrangler`` provides a framework for adding support for new datatypes in a straightforward way.  Essentially, adding support for a new data type entails writing two functions:\n\n  - An ``is_\u003cdatatype\u003e`` function, which should return ``True`` if an object is compatible with the given datatype (or format), and ``False`` otherwise\n  - A ``wrangle_\u003cdatatype\u003e`` function, which should take in an object of the given type or format and return a ``pandas`` ``DataFrame`` with numerical entries\n\nCurrently supported datatypes are limited to:\n\n  - ``array``-like objects (including images)\n  - ``DataFrame``-like or ``Series``-like objects\n  - text data (text is embedded using natural language processing models)\nor lists of mixtures of the above.\n\nMissing observations (e.g., nans, empty strings, etc.) may be filled in using imputation and/or interpolation.\n\n.. |build-status| image:: https://github.com/ContextLab/data-wrangler/actions/workflows/ci.yaml/badge.svg\n    :alt: build status\n    :target: https://github.com/ContextLab/data-wrangler\n\n.. |docs| image:: https://readthedocs.org/projects/data-wrangler/badge/\n    :alt: docs status\n    :target: https://data-wrangler.readthedocs.io/\n\n.. |doi| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.5123310.svg\n   :target: https://doi.org/10.5281/zenodo.5123310\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcontextlab%2Fdata-wrangler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcontextlab%2Fdata-wrangler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcontextlab%2Fdata-wrangler/lists"}