{"id":37731761,"url":"https://github.com/graphext/lector","last_synced_at":"2026-01-16T13:51:41.193Z","repository":{"id":47124714,"uuid":"499447543","full_name":"graphext/lector","owner":"graphext","description":"A fast reader for messy CSV files with optional type inference.","archived":false,"fork":false,"pushed_at":"2025-06-04T13:52:45.000Z","size":251,"stargazers_count":17,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-04T23:55:04.319Z","etag":null,"topics":["apache-arrow","csv","data-types","parser","python","type-inference"],"latest_commit_sha":null,"homepage":"https://lector.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/graphext.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-06-03T09:09:16.000Z","updated_at":"2025-06-04T13:38:46.000Z","dependencies_parsed_at":"2023-02-02T15:48:26.909Z","dependency_job_id":"8cdc2a2c-0a75-4239-ad5b-b5666ff9cfc7","html_url":"https://github.com/graphext/lector","commit_stats":null,"previous_names":[],"tags_count":48,"template":false,"template_full_name":null,"purl":"pkg:github/graphext/lector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphext%2Flector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphext%2Flector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphext%2Flector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphext%2Flector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/graphext","download_url":"https://codeload.github.com/graphext/lector/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graphext%2Flector/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","csv","data-types","parser","python","type-inference"],"created_at":"2026-01-16T13:51:41.058Z","updated_at":"2026-01-16T13:51:41.177Z","avatar_url":"https://github.com/graphext.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/graphext/lector/HEAD?labpath=notebooks%2Fusage.ipynb)\n\n# Lector\n\n[Lector](https://github.com/graphext/lector) aims to be a fast reader for potentially messy CSV files with configurable column type inference. It combines automatic detection of file encodings, CSV dialects (separator, escaping etc.) and preambles (initial lines containing metadata or junk unrelated to the actual tabular data). Its goal is to just-read-the-effing-CSV file without manual configuration in most cases. Each of the detection components is configurable and can be swapped out easily with custom implementations.\n\nAlso, since both [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and Apache [Arrow](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html) will destructively cast columns to the wrong type in some cases (e.g. large ID-like integer strings to floats), it provides an alternative and customisable inference and casting mechanism.\n\nUnder the hood it uses pyarrow's [CSV parser](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html) for reading, and its [compute functions](https://arrow.apache.org/docs/python/api/compute.html) for optional type inference.\n\nLector is used at [Graphext](https://www.graphext.com) behind the scenes whenever a user uploads a new dataset, and so implicitly has been validated across 1000s of different CSV files from all kinds of sources. Note, however, that this is Graphext's first foray into open-sourcing our code and still _work-in-progress_. So at least initially we won't provide any guarantees as to support of this library.\n\nFor quick usage examples see the [Usage](#usage) section below or the [notebook](notebooks/usage.ipynb) in this repo.\n\nFor detailed documentation visit https://lector.readthedocs.io/.\n\n## Installing\n\nWhile this library is not available yet on pypi, you can easily install it from Github with\n\n```\npip install git+https://github.com/graphext/lector\n```\n\n## Usage\n\nLet's assume we receive a CSV file containing some initial metadata, using the semicolon as separator, having some missing fields, and being encoded in Latin-1 (you'd be surprised how common such files are in the real world).\n\n\u003cdetails\u003e\n\u003csummary\u003eCreate example CSV file\u003c/summary\u003e\n\n``` python\ncsv = \"\"\"\nSome preamble content here\nThis is still \"part of the metadata preamble\"\nid;genre;metric;count;content;website;tags;vecs;date\n1234982348728374;a;0.1;1;; http://www.graphext.com;\"[a,b,c]\";\"[1.3, 1.4, 1.67]\";11/10/2022\n;b;0.12;;\"Natural language text is different from categorical data.\"; https://www.twitter.com;[d];\"[0, 1.9423]\";01/10/2022\n9007199254740993;a;3.14;3;\"The Project · Gutenberg » EBook « of Die Fürstin.\";http://www.google.com;\"['e', 'f']\";[\"84.234, 12509.99\"];13/10/2021\n\"\"\".encode(\"ISO-8859-1\")\n\nwith open(\"example.csv\", \"wb\") as fp:\n    fp.write(csv)\n```\n\u003c/details\u003e\n\u003cbr\u003e\n\nTo read this with lector into a pandas DataFrame, simply use\n\n``` python\ndf = lector.read_csv(\"example.csv\", to_pandas=True)\n```\n\nPrinting the DataFrame and its column types produces the following output:\n\n```\n                 id genre  metric  count  \\\n0  1234982348728374     a    0.10      1\n1              \u003cNA\u003e     b    0.12   \u003cNA\u003e\n2  9007199254740993     a    3.14      3\n\n                                             content                  website  \\\n0                                               \u003cNA\u003e  http://www.graphext.com\n1  Natural language text is different from catego...  https://www.twitter.com\n2  The Project · Gutenberg » EBook « of Die Fürstin.    http://www.google.com\n\n        tags                vecs       date\n0  [a, b, c]    [1.3, 1.4, 1.67] 2022-10-11\n1        [d]       [0.0, 1.9423] 2022-10-01\n2     [e, f]  [84.234, 12509.99] 2021-10-13\n\nid                  Int64\ngenre            category\nmetric            float64\ncount               UInt8\ncontent            string\nwebsite          category\ntags               object\nvecs               object\ndate       datetime64[ns]\ndtype: object\n```\n\nThis is pretty sweet, because\n\n- we didn't have to tell lector _how_ to read this file (text encoding, lines to skip, separator etc.)\n- we didn't have to tell lector the _data types_ of the columns, but it inferred the correct and most efficient ones automatically, e.g.:\n    - a nullable `Int64` extension type was necessary to correctly represent values in the `id` column\n    - the `category` column was automatically converted to the efficient `dictionary` (categorical) type\n    - the `count` column uses the _smallest_ integer type necessary\n    - the `text` column, containing natural language text, has _not_ been converted to a categortical type, but kept as string values (it is unlikely to benefit from dictionary-encoding)\n    - the `date` column was converted to datetime's correctly, even though the original\n      strings are not in an ISO format\n    - the `tags` and `vecs` columns have been imported with `object` dtype (since pandas\n      doesn't officially support iterables as elements in a column), but its values are in fact numpy array of the correct dtype!\n\nNeither pandas nor arrow will do this. In fact, they cannot even import this data correctly, _without_ attempting to do any smart type inference. Compare e.g. with pandas attempt to read the same CSV file:\n\n\u003cdetails\u003e\n\u003csummary\u003ePandas and Arrow fail\u003c/summary\u003e\nFirstly, to get something close to the above, you'll have to spend a good amount of time manually inspecting the CSV file and come up with the following verbose pandas call:\n\n``` python\ndtypes = {\n    \"id\": \"Int64\",\n    \"genre\": \"category\",\n    \"metric\": \"float\",\n    \"count\": \"UInt8\",\n    \"content\": \"string\",\n    \"website\": \"category\",\n    \"tags\": \"object\",\n    \"vecs\": \"object\"\n}\n\ndf = pd.read_csv(\n    fp,\n    encoding=\"ISO-8859-1\",\n    skiprows=3,\n    sep=\";\",\n    dtype=dtypes,\n    parse_dates=[\"date\"],\n    infer_datetime_format=True\n)\n\n```\n\nWhile this _parses_ the CSV file alright, the result is, urm, lacking. Let's see:\n\n```\n                 id genre  metric  count  \\\n0  1234982348728374     a    0.10      1\n1              \u003cNA\u003e     b    0.12   \u003cNA\u003e\n2  9007199254740992     a    3.14      3\n\n                                             content  \\\n0                                               \u003cNA\u003e\n1  Natural language text is different from catego...\n2  The Project · Gutenberg » EBook « of Die Fürstin.\n\n                    website        tags                  vecs       date\n0   http://www.graphext.com     [a,b,c]      [1.3, 1.4, 1.67] 2022-11-10\n1   https://www.twitter.com         [d]           [0, 1.9423] 2022-01-10\n2     http://www.google.com  ['e', 'f']  [\"84.234, 12509.99\"] 2021-10-13\n\n id                  Int64\ngenre            category\nmetric            float64\ncount               UInt8\ncontent            string\nwebsite          category\ntags               object\nvecs               object\ndate       datetime64[ns]\ndtype: object\n```\n\nA couple of observations:\n\n- Pandas _will_ cast numeric columns with missing data to the float type always, before any of our custom types are applied. This is a big problem, as we can see in the `id` column, since not all integers can be represented exactly by a 64 bit floating type (the correct value in our file is `9007199254740993` 👀). It is also a sneaky problem, because this happens silently, and so you may not realize you've got wrong IDs, and may produce totally wrong analyses if you use them down the line for joins etc. The only way to import CSV files like this with pandas correctly is to inspect the actual data in a text editor, guess the best data type, import the data without any type inference, and then individually cast to the correct types. There is no way to configure pandas to import the data correctly.\n- Pandas has messed up the dates. While at least warning us about it, pandas doesn't try to infer a consistent date format across all rows. While the CSV file contains all dates in a single consistent format (`%d/%m/%Y`), pandas has used mixed formats and so imported some dates wrongly.\n- The `category` and `text` columns have been imported with the `object` dtype, which is not particularly useful, but not necessarily a problem either.\n- Since pandas doesn't support iterable dtypes, the tags and vecs columns haven't been parsed into any useful structures\n\nNote that Arrow doesn't fare much better. It doesn't parse and infer its own `list` data type, it doesn't know how to parse dates in any format other than ISO 8601, and commits the same integer-as-float conversion error.\n\u003c/details\u003e\n\u003cbr\u003e\n\n## Development\n\nTo install a local copy for development, including all dependencies for test, documentation and code quality, use the following commands:\n\n``` bash\nclone git+https://github.com/graphext/lector\ncd lector\npip install -v -e \".[dev]\"\npre-commit install\n```\n\nThe [pre-commit](https://pre-commit.com/) command will make sure that whenever you try to commit changes to this repo code quality and formatting tools will be executed. This ensures e.g. a common coding style, such that any changes to be commited are functional changes only, not changes due to different personal coding style preferences. This in turn makes it either to collaborate via pull requests etc.\n\nTo test installation you may execute the [pytest](https://docs.pytest.org/) suite to make sure everything's setup correctly, e.g.:\n\n``` bash\npytest -v .\n```\n\n## Documentation\n\nThe documentation is created using Sphinx and is available here: https://lector.readthedocs.io/.\n\nYou can build and view the static html locally like any other Sphinx project:\n\n``` bash\n(cd docs \u0026\u0026 make clean html)\n(cd docs/build/html \u0026\u0026 python -m http.server)\n```\n\n\n## To Do\n\n- _Parallelize type inference_? While type inference is already pretty fast, it can potentially be sped up by processing columns in parallel.\n- _Testing_. The current pytest setup is terrible. I've given `hypothesis_csv` a try here,\nbut I'm probably making bad use of it. Tests are convoluted and probably not even good a catching corner cases.\n\n## License\n\nThis project is licensed under the terms of the Apache License 2.0.\n\n## Links\n\n- Documentation: https://lector.readthedocs.io/\n- Source: https://github.com/graphext/lector\n- Graphext: https://www.graphext.com\n- Graphext on Twitter: https://twitter.com/graphext\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraphext%2Flector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgraphext%2Flector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraphext%2Flector/lists"}