{"id":15583510,"url":"https://github.com/osiegmar/tabular-data-interchange-format","last_synced_at":"2025-03-29T08:43:39.092Z","repository":{"id":218107637,"uuid":"745451709","full_name":"osiegmar/tabular-data-interchange-format","owner":"osiegmar","description":"Home of the Tabular Data Interchange Format (TDIF)","archived":false,"fork":false,"pushed_at":"2024-01-28T12:31:17.000Z","size":102,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-20T21:15:57.879Z","etag":null,"topics":["csv"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/osiegmar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-19T11:12:17.000Z","updated_at":"2024-01-21T10:43:49.000Z","dependencies_parsed_at":"2024-01-27T17:26:07.480Z","dependency_job_id":"575f1c4c-d3fb-4fe0-957c-0170e9883b66","html_url":"https://github.com/osiegmar/tabular-data-interchange-format","commit_stats":null,"previous_names":["osiegmar/tabular-data-interchange-format"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osiegmar%2Ftabular-data-interchange-format","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osiegmar%2Ftabular-data-interchange-format/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osiegmar%2Ftabular-data-interchange-format/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osiegmar%2Ftabular-data-interchange-format/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/osiegmar","download_url":"https://codeload.github.com/osiegmar/tabular-data-interchange-format/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246162111,"owners_count":20733354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv"],"created_at":"2024-10-02T20:08:47.808Z","updated_at":"2025-03-29T08:43:39.074Z","avatar_url":"https://github.com/osiegmar.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tabular Data Interchange Format (TDIF)\n\nTabular Data Interchange Format (TDIF) is a lightweight and language-independent data interchange format for the\nexchange of tabular data.\n\n\u003e [!WARNING]\n\u003e This document is currently in draft form. Do not implement it yet.\n\n## Rationale\n\nWith formats like XML, JSON, YAML, CBOR and many others, there seems to be no shortage of standardized, hierarchical\ndata interchange and serialization formats.\nFor simple tabular data, however, there is a lack of standardized formats.\n\nThe closest thing to a standard for tabular data is CSV (comma-separated values) as described\nin [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180).\nUnfortunately, this RFC is only an informative, non-normative and thus nonbinding memo.\nIt does not strictly define the CSV format and leaves room for interpretation and implementation-specific behavior.\nThis is because CSV has been in use for decades before RFC 4180 was published.\n\nThose differences can lead to unexpected, yet incorrect results. This is a continuous source of frustration for\ndevelopers and financial damage for organizations.\n\nTypical differences between CSV implementations:\n\n- File encoding (US-ASCII, UTF-8, UTF-16, UTF-32, ...)\n- Use of a BOM (byte order mark) header\n- End of line character (LF, CR, CRLF)\n- Field separator (comma, semicolon, tab, ...)\n- Quoting character (double-quote, single-quote, ...)\n- Escaping of the quoting character (doubling, backslash, ...)\n- Presence of a header record\n- Uniqueness of header field names including case-sensitivity\n- Use of comments\n- When to quote a field\n- Handling of empty lines\n- Handling of empty fields\n- Interpretation of null values\n- Handling of whitespace characters\n- Handling of leading and trailing whitespaces\n- Handling of linebreaks within a field\n- Handling of multibyte characters for field separator, quoting character, escape character, ...\n- Handling of non-CSV data within the file (e.g., information headers, copyright, ...)\n\nMany of these differences are incorporated in the DRAFT\nof [RFC-4180-bis](https://datatracker.ietf.org/doc/html/draft-shafranovich-rfc4180-bis)\nbut only to describe the current state of CSV implementations and related concerns.\nThe main problem remains: The RFC will stay non-normative and thus nonbinding, and when people talk about CSV, they\nwill still have different things in mind.\n\nTDIF, while similar to CSV, is a strict and unambiguous format.\nDue to its similarity to CSV, it is not only straightforward to read and write, but also compatible with many existing\ntools.\n\nBy definition, any TDIF file can be read from any existing RFC 4180 compliant CSV parser as long as no comments are used\nor the CSV parser supports comments in the most common way currently found in CSV implementations \n(lines starting with a hash character (`#`)).\n\n## Specification\n\n### Glossary\n\n- **TDIF file**: A file that contains a sequence of records and optionally comments.\n- **record**: A sequence of fields.\n- **header**: The first record in a TDIF file that contains the field names.\n- **field**: A value within a record.\n- **comment**: Lines in a TDIF file that are for commentary.\n\nThe term **TDIF file** is for descriptive purposes only. TDIF files can be stored in files, but they can also be\ntransmitted over the network or stored in a database. The term **TDIF file** is used throughout this document for\nsimplicity.\n\n### Encoding, Content, and Line Endings\n\nTDIF files MUST be encoded using the UTF-8 character encoding and MUST NOT include a BOM (byte order mark) header.\n\nFields and comments MAY contain binary data, including NUL characters.\n\nRecords and comments MUST be terminated by one of the following end-of-line characters:\n\n- Line-feed character (LF)\n- Carriage-return character (CR)\n- Carriage-return character followed by a line-feed character (CRLF)\n\nAs the end-of-line characters are dependent on the operating system, the TDIF format does not specify any preference.\nFor simplicity, the rest of this document uses the term \"linebreak\" to refer to any of the above characters.\n\nTDIF files MUST not contain any data outside of records and comments.\nThis includes empty lines that are not part of a multi-line field.\nAn \"empty line\" is defined as a sequence of linebreaks.\n\nTDIF files MUST contain at least the header record.\n\n### Records\n\nA record consists of fields, separated by a comma (`,`) and is terminated by a linebreak.\n\nThe first record in a TDIF file MUST be a header record, which contains the names of the fields for the succeeding\nrecords.\nField names MUST be unique (case-insensitive), and each field name MUST be enclosed in double quotation marks (`\"`).\nThe header record MUST have at least one field.\n\nEach record MUST have the same number of fields as the header record.\n\nFields MUST be either `\\N` to denote a null value or they MUST be enclosed in double quotation marks (`\"`).\nRecords MUST NOT contain empty fields (`,,`).\nWhitespace characters outside a field MUST NOT occur (`\"foo\", \"bar\"`).\nDouble quotation marks within a field MUST be escaped by doubling it (`\"\"`).\n\nField values may span multiple lines and may contain any character including binary data.\n\n**Examples:**\n\nThe following example contains a header record and three data records with one field each.\n\n```\n\"header\"CRLF\n\"123\"CRLF\n\"foo \"\"is\"\" bar\"CRLF\n\\NCRLF\n```\n\nThe first data record contains a field with the value `123`.\nThe second data record contains a field with the value `foo \"is\" bar`.\nThe quotation marks had to be escaped by doubling them.\nThe third data and last record contains a field with a null value.\n\nThe following example demonstrates the use of a multi-line field.\n\n```\n\"first name\",\"last name\",\"address\",\"country\"CRLF\n\"John\",\"Doe\",\"123 Maple StreetLF\nAnytownLF\nPA 17101\",\"US\"CRLF\n\"Max\",\"Mustermann\",\\N,\"DE\"CRLF\n```\n\nNote that the line-feed characters in the multi-line \"address\"-field are part of the field value and not record\nseparators.\n\n### Comments\n\nA line starting with a hash character (`#`) is considered a comment.\nThe comment is terminated by a linebreak.\n\nThe hash character MUST be the first character of the line.\nUntil the terminating linebreak, any character is allowed.\n\nComments MAY be inserted anywhere in the file, except within a record.\n\n**Example:**\n\n```\n# This is a commentCRLF\n\"header1\",\"header2\",\"header3\"CRLF\n# This is another commentCRLF\n\"value1\",\"value2\",\"value3\"CRLF\n\"# This is not a comment\",\\N,\"# also not a comment\"CRLF\n# This is a third commentCRLF\n```\n\n### ABNF Grammar\n\nThe following shows the specification in ABNF (Augmented Backus-Naur Form).\nIn case of ambiguity, the ABNF grammar is the authoritative source of the specification and takes precedence over the\ntext.\n\n```abnf\n;;; file\n\nfile            = *comment header *(comment / record)\n\n;;; header and record\n\nheader          = value *(comma value) linebreak\n\nrecord          = field *(comma field) linebreak\n\nfield           = null / value\n\nvalue           = DQUOTE *(textdata / 2DQUOTE) DQUOTE\n\ntextdata        = %x00-21 / %x23-7F / UTF8-data\n                    ; all characters except quotation mark\n\n;;; comment\n\ncomment         = hash *commentdata linebreak\n\ncommentdata     = %x00-09 / %x0B-0C / %x0E-7F / UTF8-data\n                    ; all characters except LF, CR\n\n;;; Common rules\n\nnull            = %x5C.4E       ; \\N\n\ncomma           = %x2C          ; ,\n\nhash            = %x23          ; #\n\nlinebreak       = CR / LF / CRLF\n\n;;; Basic rules\n\nLF              = %x0A\n                    ; as per section B.1 of [RFC5234]\n\nCR              = %x0D\n                    ; as per section B.1 of [RFC5234]\n\nCRLF            = CR LF\n                    ; as per section B.1 of [RFC5234]\n\nDQUOTE          = %x22\n                    ; as per section B.1 of [RFC5234]\n\nUTF8-data       = UTF8-2 / UTF8-3 / UTF8-4\n                    ; as per section 4 of [RFC3629]\n```\n\n## Considerations\n\nWhile TDIF aims to be as close to CSV as possible, there are some intentional differences. These differences are made\nto circumvent the ambiguities of CSV and address some very often used features that sometimes lead to interchange\nproblems and unexpected results (see [Rationale section](#rationale)).\n\n- Specify these features (mentioned but not specified in RFC 4180-bis)\n    - Explicit null values\n    - Comments\n- Make the format unambiguous\n    - Specify the file encoding\n    - Make the header record mandatory and its fields unique\n    - Require the same number of fields in each record\n    - Allow only one way to separate fields\n    - Specify the difference between null values and empty fields\n\nThese differences come with a few consequences:\n\n- Empty lines are no longer necessary and thus not allowed\n- Empty fields (`,,`) are no longer necessary/meaningful and thus not allowed\n- Fields are always enclosed in double quotation marks\n\n### Null values\n\nTDIF defines `\\N` for null values. This decision was made because other mechanisms, such as an empty unquoted field\n(`,,`), are ambiguous. This is especially true for files that contain only one field per record.\n\nIn the following example, it's not clear if the file contains a null value in the third line.\n\n```\nheader1CRLF\nvalue1CRLF\n```\n\nIn TDIF this is unambiguous:\n\n```\n\"header1\"CRLF\n\"value1\"CRLF\n\\NCRLF\n```\n\nThis approach is similar to that of the PostgreSQL database, which uses \\N to represent null values in CSV files.\n\n### Comments\n\nComments are not mentioned in RFC 4180. However, they are a common feature in CSV implementations. They are used to\nprovide additional information about the file like the author, the creation date, the source, usage instructions, etc.\nThe TDIF format specifies comments to make it easier to embed additional information in the file. In RFC 4180-bis,\ncomments are mentioned as a possible extension.\n\n### Always enclose fields in double quotation marks\n\nIn CSV, fields need to be enclosed in double quotation marks only if they contain control characters such as a field\nseparator, a linebreak or a quotation mark.\nIn TDIF, this list has to be extended by the hash character (`#`) to support comments.\n\nStill, there are situations where ambiguity arises. For example:\n\n**Example 1**: Empty field in the last record of a file that contains only one field per record:\n\nIn the following example, it's not clear if the file contains an empty value in the third line.\n\n```\nheader1CRLF\nvalue1CRLF\n```\n\nIn TDIF this is unambiguous:\n\n```\n\"header1\"CRLF\n\"value1\"CRLF\n\"\"CRLF\n```\n\nRFC 4180-bis mentions this ambiguity and requires and end-of-line character event after the last field of the last\nrecord. But which application uses the new and which the old rule?\n\n**Example 2**: Desired whitespaces:\n\n```\nheader1,_header2CRLF\n_foo,bar_\n```\n\nUnderscores are used to represent whitespaces to prevent the Markdown editor/renderer from removing them.\n\nWhich of the whitespaces are desired and which are not? This is not clear. The following example is unambiguous.\n\n```\n\"header1\",\" header2\"CRLF\n\" foo\",\"bar \"CRLF\n```\n\nFor the sake of simplicity and clarity, TDIF requires that all fields are enclosed in double quotation marks. This makes\nthe format unambiguous and easy to read and write.\n\n## Request for Comments\n\nThis document is currently in draft form. Comments and suggestions are welcome. Please start\na [discussion](https://github.com/osiegmar/tabular-data-interchange-format/discussions) or create a pull request.\n\n## Implementation\n\nSee [implementations](implementations.md).\n\n## License\n\nThis document is licensed under the [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](LICENSE) license.\n\nThe reference implementations in this repository are licensed under the [MIT](reference-impl/LICENSE) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosiegmar%2Ftabular-data-interchange-format","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fosiegmar%2Ftabular-data-interchange-format","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosiegmar%2Ftabular-data-interchange-format/lists"}