{"id":22546511,"url":"https://github.com/deephaven/deephaven-csv","last_synced_at":"2025-04-06T15:13:19.102Z","repository":{"id":37964301,"uuid":"449827139","full_name":"deephaven/deephaven-csv","owner":"deephaven","description":"Deephaven CSV","archived":false,"fork":false,"pushed_at":"2024-12-17T12:27:38.000Z","size":517,"stargazers_count":58,"open_issues_count":10,"forks_count":12,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-30T14:09:59.174Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deephaven.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-19T19:26:09.000Z","updated_at":"2025-02-28T08:31:41.000Z","dependencies_parsed_at":"2023-10-14T17:07:27.868Z","dependency_job_id":"9149644a-21fd-4ff5-a9de-ff7e5ebec5e3","html_url":"https://github.com/deephaven/deephaven-csv","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deephaven%2Fdeephaven-csv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deephaven%2Fdeephaven-csv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deephaven%2Fdeephaven-csv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deephaven%2Fdeephaven-csv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deephaven","download_url":"https://codeload.github.com/deephaven/deephaven-csv/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247500469,"owners_count":20948880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-07T15:08:02.887Z","updated_at":"2025-04-06T15:13:19.077Z","avatar_url":"https://github.com/deephaven.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Deephaven High-Performance CSV Parser\n\n## Introduction\n\nThe Deephaven CSV Library is a high-performance, column-oriented, type inferencing CSV parser. It differs from other CSV\nlibraries in that it organizes data into columns rather than rows, which allows for more efficient storage and\nretrieval. It also can dynamically infer the types of those columns based on the input, so the caller is not required to\nspecify the column types beforehand. Finally it provides a way for the caller to specify the underlying data structures\nused for columnar storage, This allows the library to store its data directly in the caller's preferred data structure,\nwithout the inefficiency of going through intermediate temporary objects.\n\nThe Deephaven CSV Library is agnostic about what data sink you use, and it works equally well with Java arrays, your own\ncustom column type, or perhaps even streaming to a file. But along with this flexibility comes extra programming effort\non the part of the implementor: instead of telling the library what column data structures to use, the caller provides a\n\"factory\" capable of constructing any requested column type, and the library then dynamically decides which ones it\nneeds as it parses the input data. While it is tempting to just use ArrayList or some other catch-all collection, this\nis not as efficient as type-specific collectors, and makes a large impact on performance as data sizes increase.\nInstead, it is common practice in high-performance libraries to provide multiple, very similar but distinct\nimplementations, one for each primitive type. For example, your high-performance application might have\nYourCharColumnType, YourIntColumnType, YourDoubleColumnType, and the like. Unfortunately this translates into a certain\namount of tedium for the implementor, who needs to provide implementations for each type and code to move data from the\nCSV library to them.\n\nWith this guide we hope to make it clear what the caller needs to implement, and also to provide a reference\nimplementation for people to use as a starting point.\n\n## Using the Reference Implementation\n\nTo help you get started, the library provides a \"sink factory\" that uses Java arrays for the underlying column\nrepresentation. This version is best suited for simple examples and for learning how to use the library. Developers of\nproduction applications will likely want to define their own column representations and to create the sink factory that\nsupplies them. The documentation in [ADVANCED.md](ADVANCED.md) describes how to do this. For now, we show how to process\ndata using the builtin sink factory for arrays:\n\n```java\nfinal InputStream inputStream = ...;\nfinal CsvSpecs specs = CsvSpecs.csv();\nfinal CsvReader.Result result = CsvReader.read(specs, inputStream, SinkFactory.arrays());\nfinal long numRows = result.numRows();\nfor (CsvReader.ResultColumn col : result) {\n    switch (col.dataType()) {\n        case BOOLEAN_AS_BYTE: {\n            byte[] data = (byte[]) col.data();\n            // Process this boolean-as-byte column.\n            // Be sure to use numRows rather than data.length, because\n            // the underlying array might have excess capacity.\n            process(data, numRows);\n            break;\n        }\n        case SHORT: {\n            short[] data = (short[]) col.data();\n            // Process this short column.\n            process(data, numRows);\n            break;\n        }\n        // etc...\n    }\n}\n```\n\nIf your application uses reserved null sentinel values, there is an overload of SinkFactory.arrays() that allows you to\nspecify those values.\n\n\n## Using\n\nThis project produces two JARs:\n\n1. `deephaven-csv`: the primary dependency\n2. (optional, but recommended) `deephaven-csv-fast-double-parser`: a fast double parser\n\n### Gradle\n\nTo depend on Deephaven CSV from Gradle, add the following dependency(s) to your build.gradle file:\n\n```groovy\nimplementation 'io.deephaven:deephaven-csv:0.15.0'\n\n// Optional dependency for faster double parsing\n// runtimeOnly 'io.deephaven:deephaven-csv-fast-double-parser:0.15.0'\n```\n\n### Maven\n\nTo depend on Deephaven CSV from Maven, add the following dependency(s) to your pom.xml file:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eio.deephaven\u003c/groupId\u003e\n    \u003cartifactId\u003edeephaven-csv\u003c/artifactId\u003e\n    \u003cversion\u003e0.15.0\u003c/version\u003e\n\u003c/dependency\u003e\n\n\u003c!-- Optional dependency for faster double parsing --\u003e\n\u003c!--\u003cdependency\u003e--\u003e\n\u003c!--    \u003cgroupId\u003eio.deephaven\u003c/groupId\u003e--\u003e\n\u003c!--    \u003cartifactId\u003edeephaven-csv-fast-double-parser\u003c/artifactId\u003e--\u003e\n\u003c!--    \u003cversion\u003e0.15.0\u003c/version\u003e--\u003e\n\u003c!--    \u003cscope\u003eruntime\u003c/scope\u003e--\u003e\n\u003c!--\u003c/dependency\u003e--\u003e\n```\n\n## Testing\n\nTo run the main tests:\n\n```shell\n./gradlew check\n```\n\n## Building\n\n```shell\n./gradlew build\n```\n\n## Code style\n\n[Spotless](https://github.com/diffplug/spotless/tree/main/plugin-gradle) is used for code formatting.\n\nTo auto-format your code, you can run:\n```shell\n./gradlew spotlessApply\n```\n\n## Local development\n\nIf you are doing local development and want to consume `deephaven-csv` changes in other components, you can publish to maven local:\n\n```shell\n./gradlew publishToMavenLocal\n```\n\n## Benchmarks\n\nTo run the all of the [JMH](https://github.com/openjdk/jmh) benchmarks locally, you can run:\n\n```shell\n./gradlew jmh\n```\n\nThis will produce a textual output to the screen, as well as machine-readable results at `build/results/jmh/results.json`.\n\nTo run specific JMH benchmarks, you can run:\n\n```shell\n./gradlew jmh -Pjmh.includes=\"\u003cregex\u003e\"\n```\n\nIf you prefer, you can run the benchmarks directly via the JMH jar:\n\n```shell\n./gradlew jmhJar\n```\n\n```shell\njava -jar build/libs/deephaven-csv-0.16.0-SNAPSHOT-jmh.jar -prof gc -rf JSON\n```\n\n```shell\njava -jar build/libs/deephaven-csv-0.16.0-SNAPSHOT-jmh.jar -prof gc -rf JSON \u003cregex\u003e\n```\n\nThe JMH jar is the preferred way to run official benchmarks, and provides a common bytecode for sharing the benchmarks\namong multiple environments.\n\n[JMH Visualizer](https://github.com/jzillmann/jmh-visualizer) is a convenient tool for visualizing JMH results.\n\n## Benchmark Tests\n\nThe benchmarks have tests to ensure that the benchmark implementations are producing the correct results.\nTo run the benchmark tests, run:\n\n```shell\n./gradlew jmhTest\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeephaven%2Fdeephaven-csv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeephaven%2Fdeephaven-csv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeephaven%2Fdeephaven-csv/lists"}