{"id":15009492,"url":"https://github.com/martinthoma/edapy","last_synced_at":"2025-04-09T17:25:01.006Z","repository":{"id":27218054,"uuid":"112925601","full_name":"MartinThoma/edapy","owner":"MartinThoma","description":"Exploratory Data Analysis with Python","archived":false,"fork":false,"pushed_at":"2023-02-08T02:39:52.000Z","size":279,"stargazers_count":22,"open_issues_count":16,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-23T19:38:55.952Z","etag":null,"topics":["csv","data-analysis","data-analytics","data-science","eda","exploratory-data-analysis","pandas","pdf","python","python-3","python-3-5"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MartinThoma.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-03T12:22:53.000Z","updated_at":"2024-08-14T12:04:03.000Z","dependencies_parsed_at":"2024-10-12T09:21:59.053Z","dependency_job_id":null,"html_url":"https://github.com/MartinThoma/edapy","commit_stats":{"total_commits":60,"total_committers":2,"mean_commits":30.0,"dds":"0.050000000000000044","last_synced_commit":"782e3c1bb799f4deecadf9a91ea997f9a03aee0e"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinThoma%2Fedapy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinThoma%2Fedapy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinThoma%2Fedapy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinThoma%2Fedapy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MartinThoma","download_url":"https://codeload.github.com/MartinThoma/edapy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248075765,"owners_count":21043646,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","data-analysis","data-analytics","data-science","eda","exploratory-data-analysis","pandas","pdf","python","python-3","python-3-5"],"created_at":"2024-09-24T19:25:39.582Z","updated_at":"2025-04-09T17:25:00.973Z","avatar_url":"https://github.com/MartinThoma.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/edapy.svg)](https://badge.fury.io/py/edapy)\n[![Python Support](https://img.shields.io/pypi/pyversions/edapy.svg)](https://pypi.org/project/edapy/)\n[![Build Status](https://travis-ci.org/MartinThoma/edapy.svg?branch=master)](https://travis-ci.org/MartinThoma/edapy)\n[![Coverage Status](https://coveralls.io/repos/github/MartinThoma/edapy/badge.svg?branch=master)](https://coveralls.io/github/MartinThoma/edapy?branch=master)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n![GitHub last commit](https://img.shields.io/github/last-commit/MartinThoma/edapy)\n![GitHub commits since latest release (by SemVer)](https://img.shields.io/github/commits-since/MartinThoma/edapy/0.3.0)\n[![CodeFactor](https://www.codefactor.io/repository/github/martinthoma/edapy/badge/master)](https://www.codefactor.io/repository/github/martinthoma/edapy/overview/master)\n\nedapy is a first resource to analyze a new dataset.\n\n## Installation\n\n```\n$ pip install git+https://github.com/MartinThoma/edapy.git\n```\n\nFor the pdf part, you also need `pdftotext`:\n\n```\n$ sudo apt-get install poppler-utils\n```\n\n\n## Usage\n\n```\n$ edapy --help\nUsage: edapy [OPTIONS] COMMAND [ARGS]...\n\n  edapy is a tool for exploratory data analysis with Python.\n\n  You can use it to get a first idea what a CSV is about or to get an\n  overview over a directory of PDF files.\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  csv     Analyze CSV files.\n  images  Analyze image files.\n  pdf     Analyze PDF files.\n```\n\nThe workflow is as follows:\n\n* `edapy pdf find --path . --output results.csv` creates a `results.csv`\n  for you. This `results.csv` contains meta data about all PDF files in the\n  `path` directory.\n* `edapy csv predict --csv_path my-new.csv --types types.yaml` will start /\n  resume a process in which the user is lead through a series of questions. In\n  those questions, the user has to decide which delimiter, quotechar is used\n  and which types the columns have.\n* `edapy` generates a `types.yaml` file which can be used to load the CSV in\n  other applications with `df = edapy.load_csv(csv_path, yaml_path)`.\n\n\n## Example types.yaml\n\nFor the [Titanic Dataset](https://www.kaggle.com/c/titanic/data), the resulting\n`types.yaml` looks as follows:\n\n```\ncolumns:\n- dtype: other\n  name: Name\n- dtype: int\n  name: Parch\n- dtype: float\n  name: Age\n- dtype: other\n  name: Ticket\n- dtype: float\n  name: Fare\n- dtype: int\n  name: PassengerId\n- dtype: other\n  name: Cabin\n- dtype: other\n  name: Embarked\n- dtype: int\n  name: Pclass\n- dtype: int\n  name: Survived\n- dtype: other\n  name: Sex\n- dtype: int\n  name: SibSp\ncsv_meta:\n  delimiter: ','\n```\n\nA sample run then would look like this:\n\n```\n$ edapy csv predict --types types_titanik.yaml --csv_path train.csv\nNumber of datapoints: 891\n2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'\n2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'\n2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'\n2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'\n2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'\n2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'\n\n## Integer Columns\nColumn name: Non-nan  mean   std   min   25%   50%   75%   max\nPassengerId:     891  446.00  257.35     1   224   446   668   891\nSurvived   :     891  0.38  0.49     0     0     0     1     1\nPclass     :     891  2.31  0.84     1     2     3     3     3\nSibSp      :     891  0.52  1.10     0     0     0     1     8\nParch      :     891  0.38  0.81     0     0     0     0     6\n\n## Float Columns\nColumn name: Non-nan   mean    std    min    25%    50%    75%    max\nAge        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00\nFare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33\n\n## Other Columns\nColumn name: Non-nan   unique   top (count)\nName       :     891      891   Goldschmidt, Mr. George B (1)\nSex        :     891        2   male (577)\nTicket     :     891      681   347082 (7)\nCabin      :     204      148   C23 C25 C27 (4)\nEmbarked   :     889        4   S (644)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartinthoma%2Fedapy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmartinthoma%2Fedapy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartinthoma%2Fedapy/lists"}