{"id":19938437,"url":"https://github.com/hunyadi/tsv2py","last_synced_at":"2025-08-23T01:07:35.087Z","repository":{"id":179936914,"uuid":"659951164","full_name":"hunyadi/tsv2py","owner":"hunyadi","description":"Parser and generator for PostgreSQL-compatible tab-separated values (TSV)","archived":false,"fork":false,"pushed_at":"2024-10-19T10:31:01.000Z","size":90,"stargazers_count":0,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-15T22:58:57.606Z","etag":null,"topics":["python-extension","python3","tsv","tsv-parser"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hunyadi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-28T23:33:17.000Z","updated_at":"2024-10-19T10:31:05.000Z","dependencies_parsed_at":"2023-11-20T13:46:49.816Z","dependency_job_id":"6803b416-7f13-4db5-860a-bb546a28463b","html_url":"https://github.com/hunyadi/tsv2py","commit_stats":null,"previous_names":["hunyadi/tsv2py"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/hunyadi/tsv2py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Ftsv2py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Ftsv2py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Ftsv2py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Ftsv2py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hunyadi","download_url":"https://codeload.github.com/hunyadi/tsv2py/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Ftsv2py/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271727585,"owners_count":24810561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python-extension","python3","tsv","tsv-parser"],"created_at":"2024-11-12T23:40:09.477Z","updated_at":"2025-08-23T01:07:35.062Z","avatar_url":"https://github.com/hunyadi.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parse and generate tab-separated values (TSV) data\n\n[Tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL [COPY](https://www.postgresql.org/docs/current/sql-copy.html) moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its `text` format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like [asyncpg](https://magicstack.github.io/asyncpg/current/index.html) help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.\n\nThis package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.\n\n## Installation\n\nEven though *tsv2py* contains native code, the package is already pre-built for several target architectures. In most cases, you can install directly from a binary wheel, selected automatically by `pip`:\n\n```sh\npython3 -m pip install tsv2py\n```\n\nIf a binary wheel is not available for the target platform, `pip` will attempt to install *tsv2py* from the source distribution. This will build the package on the fly as part of the installation process, which requires a C compiler such as `gcc` or `clang`. The following commands install a C compiler and the Python development headers on AWS Linux:\n\n```sh\nsudo yum groupinstall -y \"Development Tools\"\nsudo yum install -y python3-devel python3-pip\n```\n\nIf you lack a C compiler or the Python development headers, you will get error messages similar to the following:\n\n```\nerror: command 'gcc' failed: No such file or directory\nlib/tsv_parser.c:2:10: fatal error: Python.h: No such file or directory\n```\n\n## Quick start\n\n```python\nfrom tsv.helper import Parser\n\n# specify the column structure\nparser = Parser(fields=(bytes, date, datetime, float, int, str, UUID, bool))\n\n# read and parse an entire file\nwith open(tsv_path, \"rb\") as f:\n    py_records = parser.parse_file(f)\n\n# read and parse a file line by line\nwith open(tsv_path, \"rb\") as f:\n    for line in f:\n        py_record = parser.parse_line(line)\n```\n\n## TSV format\n\nText format is a simple tabular format in which each record (table row) occupies a single line.\n\n* Output always begins with a header row, which lists data field names.\n* Fields (table columns) are delimited by *tab* characters.\n* Non-printable characters and special values are escaped with *backslash* (`\\`), as shown below:\n\n| Escape | Interpretation               |\n| ------ | ---------------------------- |\n| `\\N`   | NULL value                   |\n| `\\0`   | NUL character (ASCII 0)      |\n| `\\b`   | Backspace (ASCII 8)          |\n| `\\f`   | Form feed (ASCII 12)         |\n| `\\n`   | Newline (ASCII 10)           |\n| `\\r`   | Carriage return (ASCII 13)   |\n| `\\t`   | Tab (ASCII 9)                |\n| `\\v`   | Vertical tab (ASCII 11)      |\n| `\\\\`   | Backslash (single character) |\n\nThis format allows data to be easily imported into a database engine, e.g. with PostgreSQL [COPY](https://www.postgresql.org/docs/current/sql-copy.html).\n\nOutput in this format is transmitted as media type `text/plain` or `text/tab-separated-values` in UTF-8 encoding.\n\n## Parser\n\nThe parser understands the following Python types:\n\n* `None`. This special value is returned for the TSV escape sequence `\\N`.\n* `bool`. A literal `true` or `false` is converted into a boolean value.\n* `bytes`. TSV escape sequences are reversed before the data is passed to Python as a `bytes` object. NUL bytes are permitted.\n* `datetime`. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffix `Z`).\n* `date`. The input has to conform to the format `YYYY-MM-DD`.\n* `time`. The input has to conform to the format `hh:mm:ssZ` with no fractional seconds, or `hh:mm:ss.ffffffZ` with fractional seconds. Fractional seconds allow up to 6 digits of precision.\n* `float`. Interpreted as double precision floating point numbers.\n* `int`. Arbitrary-length integers are allowed.\n* `str`. TSV escape sequences are reversed before the data is passed to Python as a `str`. NUL bytes are not allowed.\n* `uuid.UUID`. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.\n* `decimal.Decimal`. Interpreted as arbitrary precision decimal numbers.\n* `ipaddress.IPv4Address`.\n* `ipaddress.IPv6Address`.\n* `list` and `dict`, which are understood as JSON, and invoke the equivalent of `json.loads` to parse a serialized JSON string.\n\nThe backslash character `\\` is both a TSV and a JSON escape sequence initiator. When JSON data is written to TSV, several backslash characters may be needed, e.g. `\\\\n` in a quoted JSON string translates to a single newline character. First, `\\\\` in `\\\\n` is understood as an escape sequence by the TSV parser to produce a single `\\` character followed by an `n` character, and in turn `\\n` is understood as a single newline embedded in a JSON string by the JSON parser. Specifically, you need four consecutive backslash characters in TSV to represent a single backslash in a JSON quoted string.\n\nInternally, the implementation uses AVX2 instructions to\n\n* parse RFC 3339 date-time strings into Python `datetime` objects,\n* parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python `UUID` objects,\n* and find `\\t` delimiters between fields in a line.\n\nFor parsing integers up to the range of the `long` type, the parser calls the C standard library function [strtol](https://en.cppreference.com/w/c/string/byte/strtol).\n\nFor parsing IPv4 and IPv6 addresses, the parser calls the C function [inet_pton](https://man7.org/linux/man-pages/man3/inet_pton.3.html) in libc or Windows Sockets (WinSock2).\n\nIf installed, the parser employs [orjson](https://github.com/ijl/orjson) to improve parsing speed of nested JSON structures. If not available, the library falls back to the [built-in JSON decoder](https://docs.python.org/3/library/json.html).\n\n### Date-time format\n\n```\nYYYY-MM-DDThh:mm:ssZ\nYYYY-MM-DDThh:mm:ss.fZ\nYYYY-MM-DDThh:mm:ss.ffZ\nYYYY-MM-DDThh:mm:ss.fffZ\nYYYY-MM-DDThh:mm:ss.ffffZ\nYYYY-MM-DDThh:mm:ss.fffffZ\nYYYY-MM-DDThh:mm:ss.ffffffZ\n```\n\n### Date format\n\n```\nYYYY-MM-DD\n```\n\n### Time format\n\n```\nhh:mm:ssZ\nhh:mm:ss.fZ\nhh:mm:ss.ffZ\nhh:mm:ss.fffZ\nhh:mm:ss.ffffZ\nhh:mm:ss.fffffZ\nhh:mm:ss.ffffffZ\n```\n\n## Performance\n\nDepending on the field types, *tsv2py* is up to 7 times faster to parse TSV records than a functionally equivalent Python implementation based on the Python standard library. Savings in execution time are more substantial for dates, UUIDs and longer strings with special characters (up to 90% savings), and they are more moderate for simple types like small integers (approx. 60% savings).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhunyadi%2Ftsv2py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhunyadi%2Ftsv2py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhunyadi%2Ftsv2py/lists"}