{"id":13468158,"url":"https://github.com/alan-turing-institute/CleverCSV","last_synced_at":"2025-03-26T05:30:56.403Z","repository":{"id":34835313,"uuid":"168330293","full_name":"alan-turing-institute/CleverCSV","owner":"alan-turing-institute","description":"CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.","archived":false,"fork":false,"pushed_at":"2024-10-28T18:39:55.000Z","size":3746,"stargazers_count":1255,"open_issues_count":14,"forks_count":73,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-10-29T14:56:34.623Z","etag":null,"topics":["csv","csv-converter","csv-export","csv-files","csv-format","csv-import","csv-parser","csv-parsing","csv-reader","csv-reading","data-analysis","data-mining","data-science","datascience","machine-learning","python","python-library","python3"],"latest_commit_sha":null,"homepage":"https://clevercsv.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alan-turing-institute.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-30T11:15:17.000Z","updated_at":"2024-10-23T20:02:00.000Z","dependencies_parsed_at":"2024-01-13T15:21:44.004Z","dependency_job_id":"e8fcdc1a-25c3-4e8b-ae0d-75e0d5bbc791","html_url":"https://github.com/alan-turing-institute/CleverCSV","commit_stats":{"total_commits":748,"total_committers":7,"mean_commits":"106.85714285714286","dds":0.03074866310160429,"last_synced_commit":"f48ab1a4b187b3f870e40c50071ab18db2f6f8b2"},"previous_names":[],"tags_count":106,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alan-turing-institute%2FCleverCSV","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alan-turing-institute%2FCleverCSV/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alan-turing-institute%2FCleverCSV/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alan-turing-institute%2FCleverCSV/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alan-turing-institute","download_url":"https://codeload.github.com/alan-turing-institute/CleverCSV/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245597195,"owners_count":20641859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","csv-converter","csv-export","csv-files","csv-format","csv-import","csv-parser","csv-parsing","csv-reader","csv-reading","data-analysis","data-mining","data-science","datascience","machine-learning","python","python-library","python3"],"created_at":"2024-07-31T15:01:06.385Z","updated_at":"2025-03-26T05:30:55.732Z","avatar_url":"https://github.com/alan-turing-institute.png","language":"Python","funding_links":[],"categories":["Python","Datenbankwerkzeuge","📦 Additional Python Libraries","Data Gathering"],"sub_categories":["Documentation \u0026 File Processing","Ranking/Recommender"],"readme":"\u003cp align=\"center\"\u003e\n        \u003cimg width=\"60%\" src=\"https://raw.githubusercontent.com/alan-turing-institute/CleverCSV/eea72549195e37bd4347d87fd82bc98be2f1383d/.logo.png\"\u003e\n        \u003cbr\u003e\n        \u003ca href=\"https://github.com/alan-turing-institute/CleverCSV/actions\"\u003e\n                \u003cimg src=\"https://github.com/alan-turing-institute/CleverCSV/workflows/build/badge.svg\" alt=\"Github Actions Build Status\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://pypi.org/project/clevercsv/\"\u003e\n                \u003cimg src=\"https://badge.fury.io/py/clevercsv.svg\" alt=\"PyPI version\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://clevercsv.readthedocs.io/en/latest/?badge=latest\"\u003e\n                \u003cimg src=\"https://readthedocs.org/projects/clevercsv/badge/?version=latest\" alt=\"Documentation Status\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://pepy.tech/project/clevercsv\"\u003e\n                \u003cimg src=\"https://pepy.tech/badge/clevercsv\" alt=\"Downloads\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://mybinder.org/v2/gh/alan-turing-institute/CleverCSVDemo/master?filepath=CSV_dialect_detection_with_CleverCSV.ipynb\"\u003e\n                \u003cimg src=\"https://mybinder.org/badge_logo.svg\" alt=\"Binder\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://rdcu.be/bLVur\"\u003e\n                \u003cimg src=\"https://img.shields.io/badge/DOI-10.1007%2Fs10618--019--00646--y-blue\"\u003e\n        \u003c/a\u003e\n\u003c/p\u003e\n\n*CleverCSV provides a drop-in replacement for the Python* ``csv`` *package \nwith improved dialect detection for messy CSV files. It also provides a handy \ncommand line tool that can standardize a messy file or generate Python code to \nimport it.*\n\n**Useful links:**\n\n- [CleverCSV on Github](https://github.com/alan-turing-institute/CleverCSV)\n- [CleverCSV on PyPI](https://pypi.org/project/clevercsv/)\n- [Documentation on ReadTheDocs](https://clevercsv.readthedocs.io/en/latest/)\n- [Demo of CleverCSV on Binder (interactive!)](https://mybinder.org/v2/gh/alan-turing-institute/CleverCSVDemo/master?filepath=CSV_dialect_detection_with_CleverCSV.ipynb)\n- [Research Paper on CSV dialect detection \n  (PDF)](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf) \n- [Reproducible Research Repo](https://github.com/alan-turing-institute/CSV_Wrangling/)\n- [Blog post on messy CSV files](https://towardsdatascience.com/handling-messy-csv-files-2ef829aa441d)\n- [Discussion \n  forum](https://github.com/alan-turing-institute/CleverCSV/discussions): a \n  place to ask questions and share ideas!\n\n---\n\n*Contents:* \u003ca href=\"#quick-start\"\u003e\u003cb\u003eQuick Start\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"#introduction\"\u003e\u003cb\u003eIntroduction\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"#installation\"\u003e\u003cb\u003eInstallation\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"#usage\"\u003e\u003cb\u003eUsage\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"#python-library\"\u003ePython Library\u003c/a\u003e | \u003ca href=\"#command-line-tool\"\u003eCommand-Line Tool\u003c/a\u003e | \u003ca href=\"#version-control-integration\"\u003eVersion Control Integration\u003c/a\u003e | \u003ca href=\"#contributing\"\u003e\u003cb\u003eContributing\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"#notes\"\u003e\u003cb\u003eNotes\u003c/b\u003e\u003c/a\u003e\n\n---\n\n## Quick Start\n\n[Click here](#introduction) to go to the introduction with more details about \nCleverCSV. If you're in a hurry, below is a quick overview of how to get \nstarted with the CleverCSV Python package and the command line interface. \n\nFor the Python package:\n\n```python\n# Import the package\n\u003e\u003e\u003e import clevercsv\n\n# Load the file as a list of rows\n# This uses the imdb.csv file in the examples directory\n\u003e\u003e\u003e rows = clevercsv.read_table('./imdb.csv')\n\n# Load the file as a Pandas Dataframe\n# Note that df = pd.read_csv('./imdb.csv') would fail here\n\u003e\u003e\u003e df = clevercsv.read_dataframe('./imdb.csv')\n\n# Use CleverCSV as drop-in replacement for the Python CSV module\n# This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer\n# Note that csv.Sniffer would fail here\n\u003e\u003e\u003e with open('./imdb.csv', newline='') as csvfile:\n...     dialect = clevercsv.Sniffer().sniff(csvfile.read())\n...     csvfile.seek(0)\n...     reader = clevercsv.reader(csvfile, dialect)\n...     rows = list(reader)\n```\n\nAnd for the command line interface:\n\n```python\n# Install the full version of CleverCSV (this includes the command line interface)\n$ pip install clevercsv[full]\n\n# Detect the dialect\n$ clevercsv detect ./imdb.csv\nDetected: SimpleDialect(',', '', '\\\\')\n\n# Generate code to import the file\n$ clevercsv code ./imdb.csv\n\nimport clevercsv\n\nwith open(\"./imdb.csv\", \"r\", newline=\"\", encoding=\"utf-8\") as fp:\n    reader = clevercsv.reader(fp, delimiter=\",\", quotechar=\"\", escapechar=\"\\\\\")\n    rows = list(reader)\n\n# Explore the CSV file as a Pandas dataframe\n$ clevercsv explore -p imdb.csv\nDropping you into an interactive shell.\nCleverCSV has loaded the data into the variable: df\n\u003e\u003e\u003e df\n```\n\n## Introduction\n\n- CSV files are awesome! They are lightweight, easy to share, human-readable, \n  version-controllable, and supported by many systems and tools!\n- CSV files are terrible! They can have many different formats, multiple \n  tables, headers or no headers, escape characters, and there's no support for \n  recording metadata!\n\nCleverCSV is a Python package that aims to solve some of the pain points of \nCSV files, while maintaining many of the good things. The package \nautomatically detects (with high accuracy) the format (*dialect*) of CSV \nfiles, thus making it easier to simply point to a CSV file and load it, \nwithout the need for human inspection. In the future, we hope to solve some of \nthe other issues of CSV files too.\n\nCleverCSV is [based on \nscience](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf). \nWe investigated thousands of real-world CSV files to find a robust way to \nautomatically detect the dialect of a file. This may seem like an easy \nproblem, but to a computer a CSV file is simply a long string, and every \ndialect will give you *some* table. In CleverCSV we use a technique based on \nthe patterns of row lengths of the parsed file and the data type of the \nresulting cells. With our method we achieve 97% accuracy for dialect \ndetection, with a 21% improvement on non-standard (*messy*) CSV files compared \nto the Python standard library.\n\nWe think this kind of work can be very valuable for working data scientists \nand programmers and we hope that you find CleverCSV useful (if there's a \nproblem, please open an issue!) Since the academic world counts citations, \nplease **cite CleverCSV if you use the package**. Here's a BibTeX entry you \ncan use:\n\n```bib\n@article{van2019wrangling,\n        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},\n        author = {{van den Burg}, G. J. J. and Naz{\\'a}bal, A. and Sutton, C.},\n        journal = {Data Mining and Knowledge Discovery},\n        year = {2019},\n        volume = {33},\n        number = {6},\n        pages = {1799--1820},\n        issn = {1573-756X},\n        doi = {10.1007/s10618-019-00646-y},\n}\n```\n\nAnd of course, if you like the package please *spread the word!* You can do \nthis by Tweeting about it \n([#CleverCSV](https://twitter.com/hashtag/clevercsv)) or clicking the ⭐️ [on \nGitHub](https://github.com/alan-turing-institute/CleverCSV)!\n\n## Installation\n\nCleverCSV is available on PyPI. You can install either the full version, which \nincludes the command line interface and all optional dependencies, using\n\n```bash\n$ pip install clevercsv[full]\n```\n\nor you can install a lighter, core version of CleverCSV with\n\n```bash\n$ pip install clevercsv\n```\n\n## Usage\n\nCleverCSV consists of a Python library and a command line tool called \n``clevercsv``.\n\n### Python Library\n\nWe designed CleverCSV to provide a drop-in replacement for the built-in CSV \nmodule, with some useful functionality added to it. Therefore, if you simply \nwant to replace the builtin CSV module with CleverCSV, you can import \nCleverCSV as follows, and use it as you would use the builtin [csv \nmodule](https://docs.python.org/3/library/csv.html).\n\n```python\nimport clevercsv\n```\n\nCleverCSV provides an improved version of the dialect sniffer in the CSV \nmodule, but it also adds some useful wrapper functions. These functions \nautomatically detect the dialect and aim to make working with CSV files \neasier. We currently have the following helper functions:\n\n* [detect_dialect](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.detect_dialect): \n  takes a path to a CSV file and returns the detected dialect\n* [read_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_table): \n  automatically detects the dialect and encoding of the file, and returns the \n  data as a list of rows. A version that returns a generator is also \n  available: \n  [stream_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.stream_table)\n* [read_dataframe](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_dataframe): \n  detects the dialect and encoding of the file and then uses \n  [Pandas](https://pandas.pydata.org/) to read the CSV into a DataFrame. Note \n  that this function requires Pandas to be installed.\n* [read_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_dicts): \n  detect the dialect and return the rows of the file as dictionaries, assuming \n  the first row contains the headers. A streaming version called \n  [stream_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.stream_dicts) \n  is also available.\n* [write_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.write_table): \n  write a table (a list of lists) to a file using the \n  [RFC-4180](https://tools.ietf.org/html/rfc4180) dialect.\n* [write_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.write_dicts): \n  write a list of dictionaries to a file using the \n  [RFC-4180](https://tools.ietf.org/html/rfc4180) dialect.\n\nOf course, you can also use the traditional way of loading a CSV file, as in \nthe Python CSV module:\n\n```python\nimport clevercsv\n\nwith open(\"data.csv\", \"r\", newline=\"\") as fp:\n  # you can use verbose=True to see what CleverCSV does\n  dialect = clevercsv.Sniffer().sniff(fp.read(), verbose=False)\n  fp.seek(0)\n  reader = clevercsv.reader(fp, dialect)\n  rows = list(reader)\n```\n\nSince CleverCSV v0.8.0, dialect detection is a lot faster than in previous \nversions. However, for **large files**, you can speed up detection even more \nby supplying a sample of the document to the sniffer instead of the whole \nfile, for example:\n```python\ndialect = clevercsv.Sniffer().sniff(fp.read(10000))\n```\nYou can also speed up encoding detection by installing \n[cCharDet](https://github.com/PyYoshi/cChardet), it will automatically be used \nwhen it is available on the system.\n\nThat's the basics! If you want more details, you can look at the code of the \npackage, the test suite, or the [API \ndocumentation](https://clevercsv.readthedocs.io/en/latest/source/modules.html). \nIf you run into any issues or have comments or suggestions, please open an \nissue [on GitHub](https://github.com/alan-turing-institute/CleverCSV).\n\n### Command-Line Tool\n\n*To use the command line tool, make sure that you install the full version of \nCleverCSV (see above).*\n\nThe ``clevercsv`` command line application has a number of handy features to \nmake working with CSV files easier. For instance, it can be used to view a CSV \nfile on the command line while automatically detecting the dialect. It can \nalso generate Python code for importing data from a file with the correct \ndialect. The full help text is as follows:\n\n```text\nusage: clevercsv [-h] [-V] [-v] command ...\n\nAvailable commands:\n  help         Display help information\n  detect       Detect the dialect of a CSV file\n  view         View the CSV file on the command line using TabView\n  standardize  Convert a CSV file to one that conforms to RFC-4180\n  code         Generate Python code to import a CSV file\n  explore      Explore the CSV file in an interactive Python shell\n```\n\nEach of the commands has further options (for instance, the ``code`` and \n``explore`` commands have support for importing the CSV file as a Pandas \nDataFrame). Use ``clevercsv help \u003ccommand\u003e`` or ``man clevercsv \u003ccommand\u003e`` \nfor more information. Below are some examples for each command.\n\nNote that each command accepts the ``-n`` or ``--num-chars`` flag to set the \nnumber of characters used to detect the dialect. This can be especially \nhelpful to speed up dialect detection on large files.\n\n#### Code\n\nCode generation is useful when you don't want to detect the dialect of the \nsame file over and over again. You simply run the following command and copy \nthe generated code to a Python script!\n\n```text\n$ clevercsv code imdb.csv\n\n# Code generated with CleverCSV\n\nimport clevercsv\n\nwith open(\"imdb.csv\", \"r\", newline=\"\", encoding=\"utf-8\") as fp:\n    reader = clevercsv.reader(fp, delimiter=\",\", quotechar=\"\", escapechar=\"\\\\\")\n    rows = list(reader)\n```\n\nWe also have a version that reads a Pandas dataframe:\n\n```text\n$ clevercsv code --pandas imdb.csv\n\n# Code generated with CleverCSV\n\nimport clevercsv\n\ndf = clevercsv.read_dataframe(\"imdb.csv\", delimiter=\",\", quotechar=\"\", escapechar=\"\\\\\")\n```\n\n#### Detect\n\nDetection is useful when you only want to know the dialect.\n\n```text\n$ clevercsv detect imdb.csv\nDetected: SimpleDialect(',', '', '\\\\')\n```\n\nThe ``--plain`` flag gives the components of the dialect on separate lines, \nwhich makes combining it with ``grep`` easier.\n\n```text\n$ clevercsv detect --plain imdb.csv\ndelimiter = ,\nquotechar =\nescapechar = \\\n```\n\n#### Explore\n\nThe ``explore`` command is great for a command-line based workflow, or when \nyou quickly want to start working with a CSV file in Python. This command \ndetects the dialect of a CSV file and starts an interactive Python shell with \nthe file already loaded! You can either have the file loaded as a list of \nlists:\n\n```text\n$ clevercsv explore milk.csv\nDropping you into an interactive shell.\n\nCleverCSV has loaded the data into the variable: rows\n\u003e\u003e\u003e\n\u003e\u003e\u003e len(rows)\n381\n```\n\nor you can load the file as a Pandas dataframe:\n\n```text\n$ clevercsv explore -p imdb.csv\nDropping you into an interactive shell.\n\nCleverCSV has loaded the data into the variable: df\n\u003e\u003e\u003e\n\u003e\u003e\u003e df.head()\n                   fn        tid  ... War Western\n0  titles01/tt0012349  tt0012349  ...   0       0\n1  titles01/tt0015864  tt0015864  ...   0       0\n2  titles01/tt0017136  tt0017136  ...   0       0\n3  titles01/tt0017925  tt0017925  ...   0       0\n4  titles01/tt0021749  tt0021749  ...   0       0\n\n[5 rows x 44 columns]\n```\n\n#### Standardize\n\nUse the ``standardize`` command when you want to rewrite a file using the \n[RFC-4180 standard](https://tools.ietf.org/html/rfc4180):\n\n```text\n$ clevercsv standardize --output imdb_standard.csv imdb.csv\n```\n\nIn this particular example the use of the escape character is replaced by \nusing quotes.\n\n#### View\n\nThis command allows you to view the file in the terminal. The dialect is of \ncourse detected using CleverCSV! Both this command and the ``standardize`` \ncommand support the ``--transpose`` flag, if you want to transpose the file \nbefore viewing or saving:\n\n```text\n$ clevercsv view --transpose imdb.csv\n```\n\n### Version Control Integration\n\nIf you'd like to make sure that you never commit a messy (non-standard) CSV \nfile to your repository, you can install a \n[pre-commit](https://pre-commit.com/) hook. First, install pre-commit using \nthe [installation instructions](https://pre-commit.com/#install). Next, add \nthe following configuration to the ``.pre-commit-config.yaml`` file in your \nrepository:\n\n```yaml\nrepos:\n  - repo: https://github.com/alan-turing-institute/CleverCSV-pre-commit\n    rev: v0.6.6   # or any later version\n    hooks:\n      - id: clevercsv-standardize\n```\n\nFinally, run ``pre-commit install`` to set up the git hook. Pre-commit will \nnow use CleverCSV to standardize your CSV files following \n[RFC-4180](https://tools.ietf.org/html/rfc4180) whenever you commit a CSV file \nto your repository.\n\n## Contributing\n\nIf you want to encourage development of CleverCSV, the best thing to do now is \nto *spread the word!*\n\nIf you encounter an issue in CleverCSV, please [open an \nissue](https://help.github.com/en/github/managing-your-work-on-github/creating-an-issue) \nor [submit a pull \nrequest](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request). \nDon't hesitate, you're helping to make this project better for everyone! If \nGitHub's not your thing but you still want to contact us, you can send an \nemail to ``gertjanvandenburg at gmail dot com`` instead. You can also ask \nquestions [on Gitter](https://gitter.im/alan-turing-institute/CleverCSV).\n\nNote that all contributions to the project must adhere to the [Code of \nConduct](https://github.com/alan-turing-institute/CleverCSV/blob/master/CODE_OF_CONDUCT.md).\n\nThe CleverCSV package was originally written by [Gertjan van den \nBurg](https://gertjan.dev) and came out of [scientific \nresearch](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf) \non wrangling messy CSV files by [Gertjan van den Burg](https://gertjan.dev), \n[Alfredo Nazabal](https://scholar.google.com/citations?user=IanHvT4AAAAJ), and\n[Charles Sutton](https://homepages.inf.ed.ac.uk/csutton/).\n\n## Notes\n\nCleverCSV is licensed under the [MIT license](./LICENSE). Please [cite our \nresearch](https://link.springer.com/article/10.1007/s10618-019-00646-y) if you \nuse CleverCSV in your work.\n\nCopyright (c) 2018-2021 [The Alan Turing Institute](https://turing.ac.uk).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falan-turing-institute%2FCleverCSV","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falan-turing-institute%2FCleverCSV","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falan-turing-institute%2FCleverCSV/lists"}