{"id":13400477,"url":"https://github.com/multiprocessio/dsq","last_synced_at":"2025-05-14T15:05:31.000Z","repository":{"id":36988529,"uuid":"446613820","full_name":"multiprocessio/dsq","owner":"multiprocessio","description":"Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.","archived":false,"fork":false,"pushed_at":"2023-09-30T14:49:59.000Z","size":27103,"stargazers_count":3801,"open_issues_count":22,"forks_count":163,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-04-11T20:44:28.691Z","etag":null,"topics":["cli","csv","excel","golang","json","openoffice-calc","parquet","sql","tsv"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/multiprocessio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-01-10T23:26:55.000Z","updated_at":"2025-04-07T08:40:39.000Z","dependencies_parsed_at":"2022-07-08T18:44:39.804Z","dependency_job_id":"252ef701-acef-4771-b270-4ecb10faa0e8","html_url":"https://github.com/multiprocessio/dsq","commit_stats":{"total_commits":95,"total_committers":14,"mean_commits":6.785714285714286,"dds":"0.17894736842105263","last_synced_commit":"7095c77c28cf7a9ffb0492ed5799b03b68cf06e4"},"previous_names":[],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multiprocessio%2Fdsq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multiprocessio%2Fdsq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multiprocessio%2Fdsq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multiprocessio%2Fdsq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/multiprocessio","download_url":"https://codeload.github.com/multiprocessio/dsq/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254168753,"owners_count":22026206,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","csv","excel","golang","json","openoffice-calc","parquet","sql","tsv"],"created_at":"2024-07-30T19:00:52.454Z","updated_at":"2025-05-14T15:05:26.902Z","avatar_url":"https://github.com/multiprocessio.png","language":"Go","readme":"# Not under active development\n\nWhile development may continue in the future with a different\narchitecture, for the moment you should probably instead use\n[DuckDB](https://github.com/duckdb/duckdb),\n[ClickHouse-local](https://clickhouse.com/docs/en/operations/utilities/clickhouse-local),\nor [GlareDB (based on\nDataFusion)](https://github.com/GlareDB/glaredb).\n\nThese are built on stronger analytics foundations than projects like\ndsq based on SQLite. For example, column-oriented storage and\nvectorized execution, let alone JIT-compiled expression evaluation,\nare possible with these other projects.\n\n[More here](https://twitter.com/eatonphil/status/1708130091425784146).\n\n# Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more\n\nSince Github doesn't provide a great way for you to learn about new\nreleases and features, don't just star the repo, join the [mailing\nlist](https://docs.google.com/forms/d/e/1FAIpQLSfYF3AZivacRrQWanC-skd0iI23ermwPd17T_64Xc4etoL_Tw/viewform).\n\n- [About](#about)\n- [Install](#install)\n    - [macOS Homebrew](#macos-homebrew)\n    - [Binaries on macOS, Linux, WSL](#binaries-on-macos-linux-wsl)\n    - [Binaries on Windows (not WSL)](#binaries-on-windows-not-wsl)\n    - [Build and install from source](#build-and-install-from-source)\n- [Usage](#usage)\n    - [Pretty print](#pretty-print)\n    - [Piping data to dsq](#piping-data-to-dsq)\n    - [Multiple files and joins](#multiple-files-and-joins)\n    - [SQL query from file](#sql-query-from-file)\n    - [Transforming data to JSON without querying](#transforming-data-to-json-without-querying)\n    - [Array of objects nested within an object](#array-of-objects-nested-within-an-object)\n        - [Multiple Excel sheets](#multiple-excel-sheets)\n        - [Limitation: nested arrays](#limitation-nested-arrays)\n    - [Nested object values](#nested-object-values)\n        - [Caveat: PowerShell, CMD.exe](#caveat-powershell-cmdexe)\n        - [Nested objects explained](#nested-objects-explained)\n        - [Limitation: whole object retrieval](#limitation-whole-object-retrieval)\n    - [Nested arrays](#nested-arrays)\n        - [JSON operators](#json-operators)\n    - [REGEXP](#regexp)\n    - [Standard Library](#standard-library)\n    - [Output column order](#output-column-order)\n    - [Dumping inferred schema](#dumping-inferred-schema)\n    - [Caching](#caching)\n    - [Interactive REPL](#interactive-repl)\n    - [Converting numbers in CSV and TSV files](#converting-numbers-in-csv-and-tsv-files)\n- [Supported Data Types](#supported-data-types)\n- [Engine](#engine)\n- [Comparisons](#comparisons)\n- [Benchmark](#benchmark)\n    - [Notes](#notes)\n- [Third-party integrations](#third-party-integrations)\n- [Community](#community)\n- [How can I help?](#how-can-i-help)\n- [License](#license)\n\n## About\n\nThis is a CLI companion to\n[DataStation](https://github.com/multiprocessio/datastation) (a GUI)\nfor running SQL queries against data files. So if you want the GUI\nversion of this, check out DataStation.\n\n## Install\n\nBinaries for amd64 (x86_64) are provided for each release.\n\n### macOS Homebrew\n\n`dsq` is available on macOS Homebrew:\n\n```bash\n$ brew install dsq\n```\n\n### Binaries on macOS, Linux, WSL\n\nOn macOS, Linux, and WSL you can run the following:\n\n```bash\n$ VERSION=\"v0.23.0\"\n$ FILE=\"dsq-$(uname -s | awk '{ print tolower($0) }')-x64-$VERSION.zip\"\n$ curl -LO \"https://github.com/multiprocessio/dsq/releases/download/$VERSION/$FILE\"\n$ unzip $FILE\n$ sudo mv ./dsq /usr/local/bin/dsq\n```\n\nOr install manually from the [releases\npage](https://github.com/multiprocessio/dsq/releases), unzip and add\n`dsq` to your `$PATH`.\n\n### Binaries on Windows (not WSL)\n\nDownload the [latest Windows\nrelease](https://github.com/multiprocessio/dsq/releases), unzip it,\nand add `dsq` to your `$PATH`.\n\n### Build and install from source\n\nIf you are on another platform or architecture or want to grab the\nlatest release, you can do so with Go 1.18+:\n\n```bash\n$ go install github.com/multiprocessio/dsq@latest\n```\n\n`dsq` will likely work on other platforms that Go is ported to such as\nAARCH64 and OpenBSD, but tests and builds are only run against x86_64\nWindows/Linux/macOS.\n\n## Usage\n\nYou can either pipe data to `dsq` or you can pass a file name to\nit. NOTE: piping data doesn't work on Windows.\n\nIf you are passing a file, it must have the usual extension for its\ncontent type.\n\nFor example:\n\n```bash\n$ dsq testdata.json \"SELECT * FROM {} WHERE x \u003e 10\"\n```\n\nOr:\n\n```bash\n$ dsq testdata.ndjson \"SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC\"\n```\n\n### Pretty print\n\nBy default `dsq` prints ugly JSON. This is the most efficient mode.\n\n```bash\n$ dsq testdata/userdata.parquet 'select count(*) from {}'\n[{\"count(*)\":1000}\n]\n```\n\nIf you want prettier JSON you can pipe `dsq` to `jq`.\n\n```bash\n$ dsq testdata/userdata.parquet 'select count(*) from {}' | jq\n[\n  {\n    \"count(*)\": 1000\n  }\n]\n```\n\nOr you can enable pretty printing with `-p` or `--pretty` in `dsq`\nwhich will display your results in an ASCII table.\n\n```bash\n$ dsq --pretty testdata/userdata.parquet 'select count(*) from {}'\n+----------+\n| count(*) |\n+----------+\n|     1000 |\n+----------+\n```\n\n### Piping data to dsq\n\nWhen piping data to `dsq` you need to set the `-s` flag and specify\nthe file extension or MIME type.\n\nFor example:\n\n```bash\n$ cat testdata.csv | dsq -s csv \"SELECT * FROM {} LIMIT 1\"\n```\n\nOr:\n\n```bash\n$ cat testdata.parquet | dsq -s parquet \"SELECT COUNT(1) FROM {}\"\n```\n\n### Multiple files and joins\n\nYou can pass multiple files to DSQ. As long as they are supported data\nfiles in a valid format, you can run SQL against all files as\ntables. Each table can be accessed by the string `{N}` where `N` is the\n0-based index of the file in the list of files passed on the\ncommandline.\n\nFor example this joins two datasets of differing origin types (CSV and\nJSON).\n\n```bash\n$ dsq testdata/join/users.csv testdata/join/ages.json \\\n  \"select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id\"\n[{\"age\":88,\"name\":\"Ted\"},\n{\"age\":56,\"name\":\"Marjory\"},\n{\"age\":33,\"name\":\"Micah\"}]\n```\n\nYou can also give file-table-names aliases since `dsq` uses standard\nSQL:\n\n```bash\n$ dsq testdata/join/users.csv testdata/join/ages.json \\\n  \"select u.name, a.age from {0} u join {1} a on u.id = a.id\"\n[{\"age\":88,\"name\":\"Ted\"},\n{\"age\":56,\"name\":\"Marjory\"},\n{\"age\":33,\"name\":\"Micah\"}]\n```\n\n### SQL query from file\n\nAs your query becomes more complex, it might be useful to store it in a file\nrather than specify it on the command line. To do so replace the query argument\nwith `--file` or `-f` and the path to the file. \n\n```bash\n$ dsq data1.csv data2.csv -f query.sql\n```\n\n### Transforming data to JSON without querying\n\nAs a shorthand for `dsq testdata.csv \"SELECT * FROM {}\"` to convert\nsupported file types to JSON you can skip the query and the converted\nJSON will be dumped to stdout.\n\nFor example:\n\n```bash\n$ dsq testdata.csv\n[{...some csv data...},{...some csv data...},...]\n```\n\n### Array of objects nested within an object\n\nDataStation and `dsq`'s SQL integration operates on an array of\nobjects. If your array of objects happens to be at the top-level, you\ndon't need to do anything. But if your array data is nested within an\nobject you can add a \"path\" parameter to the table reference.\n\nFor example if you have this data:\n\n```bash\n$ cat api-results.json\n{\n  \"data\": {\n    \"data\": [\n      {\"id\": 1, \"name\": \"Corah\"},\n      {\"id\": 3, \"name\": \"Minh\"}\n    ]\n  },\n  \"total\": 2\n}\n```\n\nYou need to tell `dsq` that the path to the array data is `\"data.data\"`:\n\n```bash\n$ dsq --pretty api-results.json 'SELECT * FROM {0, \"data.data\"} ORDER BY id DESC'\n+----+-------+\n| id | name  |\n+----+-------+\n|  3 | Minh  |\n|  1 | Corah |\n+----+-------+\n```\n\nYou can also use the shorthand `{\"path\"}` or `{'path'}` if you only have one table:\n\n```bash\n$ dsq --pretty api-results.json 'SELECT * FROM {\"data.data\"} ORDER BY id DESC'\n+----+-------+\n| id | name  |\n+----+-------+\n|  3 | Minh  |\n|  1 | Corah |\n+----+-------+\n```\n\nYou can use either single or double quotes for the path.\n\n#### Multiple Excel sheets\n\nExcel files with multiple sheets are stored as an object with key\nbeing the sheet name and value being the sheet data as an array of\nobjects.\n\nIf you have an Excel file with two sheets called `Sheet1` and `Sheet2`\nyou can run `dsq` on the second sheet by specifying the sheet name as\nthe path:\n\n```bash\n$ dsq data.xlsx 'SELECT COUNT(1) FROM {\"Sheet2\"}'\n```\n\n#### Limitation: nested arrays\n\nYou cannot specify a path through an array, only objects.\n\n### Nested object values\n\nIt's easiest to show an example. Let's say you have the following JSON file called `user_addresses.json`:\n\n```bash\n$ cat user_addresses.json\n[\n  {\"name\": \"Agarrah\", \"location\": {\"city\": \"Toronto\", \"address\": { \"number\": 1002 }}},\n  {\"name\": \"Minoara\", \"location\": {\"city\": \"Mexico City\", \"address\": { \"number\": 19 }}},\n  {\"name\": \"Fontoon\", \"location\": {\"city\": \"New London\", \"address\": { \"number\": 12 }}}\n]\n```\n\nYou can query the nested fields like so:\n\n```sql\n$ dsq user_addresses.json 'SELECT name, \"location.city\" FROM {}'\n```\n\nAnd if you need to disambiguate the table:\n\n```sql\n$ dsq user_addresses.json 'SELECT name, {}.\"location.city\" FROM {}'\n```\n\n#### Caveat: PowerShell, CMD.exe\n\nOn PowerShell and CMD.exe you must escape inner double quotes with backslashes:\n\n```powershell\n\u003e dsq user_addresses.json 'select name, \\\"location.city\\\" from {}'\n[{\"location.city\":\"Toronto\",\"name\":\"Agarrah\"},\n{\"location.city\":\"Mexico City\",\"name\":\"Minoara\"},\n{\"location.city\":\"New London\",\"name\":\"Fontoon\"}]\n```\n\n#### Nested objects explained\n\nNested objects are collapsed and their new column name becomes the\nJSON path to the value connected by `.`. Actual dots in the path must\nbe escaped with a backslash. Since `.` is a special character in SQL\nyou must quote the whole new column name.\n\n#### Limitation: whole object retrieval\n\nYou cannot query whole objects, you must ask for a specific path that\nresults in a scalar value.\n\nFor example in the `user_addresses.json` example above you CANNOT do this:\n\n```sql\n$ dsq user_addresses.json 'SELECT name, {}.\"location\" FROM {}'\n```\n\nBecause `location` is not a scalar value. It is an object.\n\n### Nested arrays\n\nNested arrays are converted to a JSON string when stored in\nSQLite. Since SQLite supports querying JSON strings you can access\nthat data as structured data even though it is a string.\n\nSo if you have data like this in `fields.json`:\n\n```json\n[\n  {\"field1\": [1]},\n  {\"field1\": [2]},\n]\n```\n\nYou can request the entire field:\n\n```\n$ dsq fields.json \"SELECT field1 FROM {}\" | jq\n[\n  {\n    \"field1\": \"[1]\"\n  },\n  {\n    \"field1\": \"[2]\",\n  }\n]\n```\n\n#### JSON operators\n\nYou can get the first value in the array using SQL JSON operators.\n\n```\n$ dsq fields.json \"SELECT field1-\u003e0 FROM {}\" | jq\n[\n  {\n    \"field1-\u003e0\": \"1\"\n  },\n  {\n    \"field1-\u003e0\": \"2\"\n  }\n]\n```\n\n### REGEXP\n\nSince DataStation and `dsq` are built on SQLite, you can filter using\n`x REGEXP 'y'` where `x` is some column or value and `y` is a REGEXP\nstring. SQLite doesn't pick a regexp implementation. DataStation and\n`dsq` use Go's regexp implementation which is more limited than PCRE2\nbecause Go support for PCRE2 is not yet very mature.\n\n```sql\n$ dsq user_addresses.json \"SELECT * FROM {} WHERE name REGEXP 'A.*'\"\n[{\"location.address.number\":1002,\"location.city\":\"Toronto\",\"name\":\"Agarrah\"}]\n```\n\n### Standard Library\n\ndsq registers\n[go-sqlite3-stdlib](https://github.com/multiprocessio/go-sqlite3-stdlib)\nso you get access to numerous statistics, url, math, string, and\nregexp functions that aren't part of the SQLite base.\n\nView that project docs for all available extended functions.\n\n### Output column order\n\nWhen emitting JSON (i.e. without the `--pretty` flag) keys within an\nobject are unordered.\n\nIf order is important to you you can filter with `jq`: `dsq x.csv\n'SELECT a, b FROM {}' | jq --sort-keys`.\n\nWith the `--pretty` flag, column order is purely alphabetical. It is\nnot possible at the moment for the order to depend on the SQL query\norder.\n\n### Dumping inferred schema\n\nFor any supported file you can dump the inferred schema rather than\ndumping the data or running a SQL query. Set the `--schema` flag to do\nthis.\n\nThe inferred schema is very simple, only JSON types are supported. If\nthe underlying format (like Parquet) supports finer-grained data types\n(like int64) this will not show up in the inferred schema. It will\nshow up just as `number`.\n\nFor example:\n\n```\n$ dsq testdata/avro/test_data.avro --schema --pretty\nArray of\n  Object of\n    birthdate of\n      string\n    cc of\n      Varied of\n        Object of\n          long of\n            number or\n        Unknown\n    comments of\n      string\n    country of\n      string\n    email of\n      string\n    first_name of\n      string\n    gender of\n      string\n    id of\n      number\n    ip_address of\n      string\n    last_name of\n      string\n    registration_dttm of\n      string\n    salary of\n      Varied of\n        Object of\n          double of\n            number or\n        Unknown\n    title of\n      string\n```\n\nYou can print this as a structured JSON string by omitting the\n`--pretty` flag when setting the `--schema` flag.\n\n### Caching\n\nSometimes you want to do some exploration on a dataset that isn't\nchanging frequently. By turning on the `--cache` or `-C` flag\nDataStation will store the imported data on disk and not delete it\nwhen the run is over.\n\nWith caching on, DataStation calculates a SHA1 sum of all the files you\nspecified. If the sum ever changes then it will reimport all the\nfiles. Otherwise when you run additional queries with the cache\nflag on it will reuse that existing database and not reimport the files.\n\nSince without caching on DataStation uses an in-memory database, the\ninitial query with caching on may take slightly longer than with\ncaching off. Subsequent queries will be substantially faster though\n(for large datasets).\n\nFor example, in the first run with caching on this query might take 30s:\n\n```\n$ dsq some-large-file.json --cache 'SELECT COUNT(1) FROM {}'\n```\n\nBut when you run another query it might only take 1s.\n\n```\n$ dsq some-large-file.json --cache 'SELECT SUM(age) FROM {}'\n```\n\nNot because we cache any result but because we cache importing the\nfile into SQLite.\n\nSo even if you change the query, as long as the file doesn't change,\nthe cache is effective.\n\nTo make this permanent you can export `DSQ_CACHE=true` in your environment.\n\n### Interactive REPL\n\nUse the `-i` or `--interactive` flag to enter an interactive REPL\nwhere you can run multiple SQL queries.\n\n```\n$ dsq some-large-file.json -i\ndsq\u003e SELECT COUNT(1) FROM {};\n+----------+\n| COUNT(1) |\n+----------+\n|     1000 |\n+----------+\n(1 row)\ndsq\u003e SELECT * FROM {} WHERE NAME = 'Kevin';\n(0 rows)\n```\n\n### Converting numbers in CSV and TSV files\n\nCSV and TSV files do not allow to specify the type of the individual\nvalues contained in them. All values are treated as strings by default.\n\nThis can lead to unexpected results in queries. Consider the following\nexample:\n\n```\n$ cat scores.csv\nname,score\nFritz,90\nRainer,95.2\nFountainer,100\n\n$ dsq scores.csv \"SELECT * FROM {} ORDER BY score\"\n[{\"name\":\"Fountainer\",\"score\":\"100\"},\n{\"name\":\"Fritz\",\"score\":\"90\"},\n{\"name\":\"Rainer\",\"score\":\"95.2\"}]\n```\n\nNote how the `score` column contains numerical values only. Still,\nsorting by that column yields unexpected results because the values are\ntreated as strings, and sorted lexically. (You can tell that the\nindividual scores were imported as strings because they're quoted in the\nJSON result.)\n\nUse the `-n` or `--convert-numbers` flag to auto-detect and convert\nnumerical values (integers and floats) in imported files:\n\n```\n$ dsq ~/scores.csv --convert-numbers \"SELECT * FROM {} ORDER BY score\"\n[{\"name\":\"Fritz\",\"score\":90},\n{\"name\":\"Rainer\",\"score\":95.2},\n{\"name\":\"Fountainer\",\"score\":100}]\n```\n\nNote how the scores are imported as numbers now and how the records in\nthe result set are sorted by their numerical value. Also note that the\nindividual scores are no longer quoted in the JSON result.\n\nTo make this permanent you can export `DSQ_CONVERT_NUMBERS=true` in\nyour environment. Turning this on disables some optimizations.\n\n## Supported Data Types\n\n| Name                   | File Extension(s) | Mime Type                                        | Notes                                                                                                                                           |   |\n|------------------------|-------------------|--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---|\n| CSV                    | `csv`             | `text/csv`                                       |                                                                                                                                                 |   |\n| TSV                    | `tsv`, `tab`      | `text/tab-separated-values`                      |                                                                                                                                                 |   |\n| JSON                   | `json`            | `application/json`                               | Must be an array of objects or a [path to an array of objects](https://github.com/multiprocessio/dsq#array-of-objects-nested-within-an-object). |   |\n| Newline-delimited JSON | `ndjson`, `jsonl` | `application/jsonlines`                          |                                                                                                                                                 |   |\n| Concatenated JSON      | `cjson`           | `application/jsonconcat`                         |                                                                                                                                                 |   |\n| ORC                    | `orc`             | `orc`                                            |                                                                                                                                                 |   |\n| Parquet                | `parquet`         | `parquet`                                        |                                                                                                                                                 |   |\n| Avro                   | `avro`            | `application/avro`                               |                                                                                                                                                 |   |\n| YAML                   | `yaml`, `yml`     | `application/yaml`                               |                                                                                                                                                 |   |\n| Excel                  | `xlsx`, `xls`     | `application/vnd.ms-excel`                       | If you have multiple sheets, you must [specify a sheet path](https://github.com/multiprocessio/dsq#multiple-excel-sheets).                      |   |\n| ODS                    | `ods`             | `application/vnd.oasis.opendocument.spreadsheet` | If you have multiple sheets, you must [specify a sheet path](https://github.com/multiprocessio/dsq#multiple-excel-sheets).                      |   |\n| Apache Error Logs      | NA                | `text/apache2error`                              | Currently only works if being piped in.                                                                                                         |   |\n| Apache Access Logs     | NA                | `text/apache2access`                             | Currently only works if being piped in.                                                                                                         |   |\n| Nginx Access Logs      | NA                | `text/nginxaccess`                               | Currently only works if being piped in.                                                                                                         |   |\n| LogFmt Logs            | `logfmt`          | `text/logfmt`                                    |                                                                                                                                                 |   |\n\n## Engine\n\nUnder the hood dsq uses\n[DataStation](https://github.com/multiprocessio/datastation) as a\nlibrary and under that hood DataStation uses SQLite to power these\nkinds of SQL queries on arbitrary (structured) data.\n\n## Comparisons\n\n| Name | Link | Caching | Engine | Supported File Types | Binary Size |\n|-|---|-|-|------------------------------------------------------------------------|-|\n| dsq | Here | Yes | SQLite | CSV, TSV, a few variations of JSON, Parquet, Excel, ODS (OpenOffice Calc), ORC, Avro, YAML, Logs | 49M |\n| q | http://harelba.github.io/q/ | Yes | SQLite | CSV, TSV | 82M |\n| textql | https://github.com/dinedal/textql | No | SQLite | CSV, TSV | 7.3M |\n| octoql | https://github.com/cube2222/octosql | No | Custom engine | JSON, CSV, Excel, Parquet | 18M |\n| csvq | https://github.com/mithrandie/csvq | No | Custom engine | CSV | 15M |\n| sqlite-utils | https://github.com/simonw/sqlite-utils | No | SQLite | CSV, TSV | N/A, Not a single binary |\n| trdsql | https://github.com/noborus/trdsql | No | SQLite, MySQL or PostgreSQL | Few variations of JSON, TSV, LTSV, TBLN, CSV | 14M |\n| spysql | https://github.com/dcmoura/spyql | No | Custom engine | CSV, JSON, TEXT | N/A, Not a single binary |\n| duckdb | https://github.com/duckdb/duckdb | ? | Custom engine | CSV, Parquet | 35M |\n\nNot included:\n\n* clickhouse-local: fastest of anything listed here but so gigantic (over 2GB) that it can't reasonably be considered a good tool for any environment\n* sqlite3: requires multiple commands to ingest CSV, not great for one-liners\n* datafusion-cli: very fast (slower only than clickhouse-local) but requires multiple commands to ingest CSV, so not great for one-liners\n\n## Benchmark\n\nThis benchmark was run June 19, 2022. It is run on a [dedicated bare\nmetal instance on\nOVH](https://us.ovhcloud.com/bare-metal/rise/rise-1/) with:\n\n* 64 GB DDR4 ECC 2,133 MHz\n* 2x450 GB SSD NVMe in Soft RAID\n* Intel Xeon E3-1230v6 - 4c/8t - 3.5 GHz/3.9 GHz\n\nIt runs a `SELECT passenger_count, COUNT(*), AVG(total_amount) FROM\ntaxi.csv GROUP BY passenger_count` query against the well-known NYC\nYellow Taxi Trip Dataset. Specifically, the CSV file from April 2021\nis used. It's a 200MB CSV file with ~2 million rows, 18 columns, and\nmostly numerical values.\n\nThe script is [here](./scripts/benchmark.sh). It is an adaptation of\nthe [benchmark that the octosql devs\nrun](https://github.com/cube2222/octosql#Benchmarks).\n\n| Program   | Version             |       Mean [s] | Min [s] | Max [s] |     Relative |\n|:----------|:--------------------|---------------:|--------:|--------:|-------------:|\n| dsq       | 0.20.1 (caching on) |  1.151 ± 0.010 |   1.131 |   1.159 |         1.00 |\n| duckdb    | 0.3.4               |  1.723 ± 0.023 |   1.708 |   1.757 |  1.50 ± 0.02 |\n| octosql   | 0.7.3               |  2.005 ± 0.008 |   1.991 |   2.015 |  1.74 ± 0.02 |\n| q         | 3.1.6 (caching on)  |  2.028 ± 0.010 |   2.021 |   2.055 |  1.76 ± 0.02 |\n| sqlite3 * | 3.36.0              |  4.204 ± 0.018 |   4.177 |   4.229 |  3.64 ± 0.04 |\n| trdsql    | 0.10.0              | 12.972 ± 0.225 |  12.554 |  13.392 | 11.27 ± 0.22 |\n| dsq       | 0.20.1 (default)    | 15.030 ± 0.086 |  14.895 |  15.149 | 13.06 ± 0.13 |\n| textql    | fca00ec             | 19.148 ± 0.183 |  18.865 |  19.500 | 16.63 ± 0.21 |\n| spyql     | 0.6.0               | 16.985 ± 0.105 |  16.854 |  17.161 | 14.75 ± 0.16 |\n| q         | 3.1.6 (default)     | 24.061 ± 0.095 |  23.954 |  24.220 | 20.90 ± 0.20 |\n\n\\* While dsq and q are built on top of sqlite3 there is not a builtin way in sqlite3 to cache ingested files without a bit of scripting\n\nNot included:\n* clickhouse-local: faster than any of these but over 2GB so not a reasonable general-purpose CLI\n* datafusion-cli: slower only than clickhouse-local but requires multiple commands to ingest CSV, can't do one-liners\n* sqlite-utils: takes minutes to finish\n\n### Notes\n\nOctoSQL, duckdb, and SpyQL implement their own SQL engines.\ndsq, q, trdsql, and textql copy data into SQLite and depend on the\nSQLite engine for query execution.\n\nTools that implement their own SQL engines can do better on 1)\ningestion and 2) queries that act on a subset of data (such as limited\ncolumns or limited rows). These tools implement ad-hoc subsets of SQL\nthat may be missing or differ from your favorite syntax. On the other\nhand, tools that depend on SQLite have the benefit of providing a\nwell-tested and well-documented SQL engine. DuckDB is exceptional\nsince there is a dedicated company behind it.\n\ndsq also comes with numerous [useful\nfunctions](https://github.com/multiprocessio/go-sqlite3-stdlib)\n(e.g. best-effort date parsing, URL parsing/extraction, statistics\nfunctions, etc.) on top of [SQLite builtins](https://www.sqlite.org/lang_corefunc.html).\n\n## Third-party integrations\n\n* [ob-dsq](https://github.com/fritzgrabo/ob-dsq)\n\n## Community\n\n[Join us at #dsq on the Multiprocess Discord](https://discord.gg/9BRhAMhDa5).\n\n## How can I help?\n\nDownload dsq and use it! Report bugs on\n[Discord](https://discord.gg/f2wQBc4bXX).\n\nIf you're a developer with some Go experience looking to hack on open\nsource, check out\n[GOOD_FIRST_PROJECTS.md](https://github.com/multiprocessio/datastation/blob/main/GOOD_FIRST_PROJECTS.md)\nin the DataStation repo.\n\n## License\n\nThis software is licensed under an Apache 2.0 license.\n","funding_links":[],"categories":["Go","HarmonyOS","Using SQL","Command-line tools","其他_大数据","cli","SQL"],"sub_categories":["Windows Manager","CLI","资源传输下载","Über SQL"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultiprocessio%2Fdsq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultiprocessio%2Fdsq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultiprocessio%2Fdsq/lists"}