{"id":21620771,"url":"https://github.com/miku/makta","last_synced_at":"2025-08-10T15:15:58.029Z","repository":{"id":64307665,"uuid":"412089136","full_name":"miku/makta","owner":"miku","description":"Create an sqlite3 database from tabular data (2-TSV).","archived":false,"fork":false,"pushed_at":"2022-11-11T00:14:52.000Z","size":2766,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-11T10:06:22.074Z","etag":null,"topics":["database","sqlite","sqlite3","tsv"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-30T14:05:04.000Z","updated_at":"2024-07-01T20:59:12.000Z","dependencies_parsed_at":"2023-01-15T11:00:44.342Z","dependency_job_id":null,"html_url":"https://github.com/miku/makta","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/miku/makta","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fmakta","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fmakta/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fmakta/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fmakta/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/makta/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fmakta/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269740472,"owners_count":24467781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","sqlite","sqlite3","tsv"],"created_at":"2024-11-24T23:12:46.177Z","updated_at":"2025-08-10T15:15:57.974Z","avatar_url":"https://github.com/miku.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MAKTA\n\n\u003e **mak**e a database from **ta**bular data\n\n[![DOI](https://zenodo.org/badge/412089136.svg)](https://zenodo.org/badge/latestdoi/412089136)\n\n![](static/table.jpg)\n\nTurn [tabular data](https://en.wikipedia.org/wiki/Tab-separated_values) into a\nlookup table using [sqlite3](https://sqlite.org/). This is a working PROTOTYPE\nwith limitations, e.g. no customizations, the table definition is fixed, etc.\n\n\u003e CREATE TABLE IF NOT EXISTS map (k TEXT, v TEXT)\n\nAs a performance data point, an example dataset with 1B+ rows can be inserted\nand indexed in less than two hours (on a [recent\nCPU](https://ark.intel.com/content/www/us/en/ark/products/122589/intel-core-i7-8550u-processor-8m-cache-up-to-4-00-ghz.html)\nand an [nvme](https://en.wikipedia.org/wiki/NVM_Express) drive; database file\nsize: 400G).\n\n![](static/443238.gif)\n\n## Installation\n\n\u003e [https://github.com/miku/makta/releases](https://github.com/miku/makta/releases)\n\n```sh\n$ go install github.com/miku/makta/cmd/makta@latest\n```\n\n## How it works\n\nData is chopped up into smaller chunks (defaults to about 64MB) and imported with\nthe `.import` [command](https://www.sqlite.org/cli.html). Indexes are created\nonly after all data has been imported.\n\n## Example\n\n```sh\n$ cat fixtures/sample-xs.tsv | column -t\n10.1001/10-v4n2-hsf10003                    10.1177/003335490912400218\n10.1001/10-v4n2-hsf10003                    10.1097/01.bcr.0000155527.76205.a2\n10.1001/amaguidesnewsletters.1996.novdec01  10.1056/nejm199312303292707\n10.1001/amaguidesnewsletters.1996.novdec01  10.1016/s0363-5023(05)80265-5\n10.1001/amaguidesnewsletters.1996.novdec01  10.1001/jama.1994.03510440069036\n10.1001/amaguidesnewsletters.1997.julaug01  10.1097/00007632-199612150-00003\n10.1001/amaguidesnewsletters.1997.mayjun01  10.1164/ajrccm/147.4.1056\n10.1001/amaguidesnewsletters.1997.mayjun01  10.1136/thx.38.10.760\n10.1001/amaguidesnewsletters.1997.mayjun01  10.1056/nejm199507133330207\n10.1001/amaguidesnewsletters.1997.mayjun01  10.1378/chest.88.3.376\n\n$ makta -o xs.db \u003c fixtures/sample-xs.tsv\n2021/10/04 16:13:06 [ok] initialized database · xs.db\n2021/10/04 16:13:06 [io] written 679B · 361.3K/s\n2021/10/04 16:13:06 [ok] 1/2 created index · xs.db\n2021/10/04 16:13:06 [ok] 2/2 created index · xs.db\n\n$ sqlite3 xs.db 'select * from map'\n10.1001/10-v4n2-hsf10003|10.1177/003335490912400218\n10.1001/10-v4n2-hsf10003|10.1097/01.bcr.0000155527.76205.a2\n10.1001/amaguidesnewsletters.1996.novdec01|10.1056/nejm199312303292707\n10.1001/amaguidesnewsletters.1996.novdec01|10.1016/s0363-5023(05)80265-5\n10.1001/amaguidesnewsletters.1996.novdec01|10.1001/jama.1994.03510440069036\n10.1001/amaguidesnewsletters.1997.julaug01|10.1097/00007632-199612150-00003\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376\n\n$ sqlite3 xs.db 'select * from map where k = \"10.1001/amaguidesnewsletters.1997.mayjun01\" '\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207\n10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376\n```\n\n## Motivation\n\n\u003e SQLite is likely used more than all other database engines combined. Billions\n\u003e and billions of copies of SQLite exist in the wild. -- [https://www.sqlite.org/mostdeployed.html](https://www.sqlite.org/mostdeployed.html)\n\nSometimes, programs need lookup tables to map values between two domains. A\n[dictionary](https://xlinux.nist.gov/dads/HTML/dictionary.html) is a perfect\ndata structure as long as the data fits in memory. For larger sets (hundreds of\nmillions of entries), a dictionary may not work.\n\nThe *makta* tool currently takes a two-column TSV and turns it into an sqlite3\ndatabase, which you can query in your program. Depending on a couple of\nfactors, you maybe be able to query the lookup database with about 1-50K\nqueries per second.\n\nFinally, sqlite3 is just an awesome database and [recommeded storage\nformat](https://www.sqlite.org/locrsf.html).\n\n## Usage\n\n```sh\n$ makta -h\nUsage of makta:\n  -B int\n        buffer size (default 67108864)\n  -C int\n        sqlite3 cache size, needs memory = C x page size (default 1000000)\n  -I int\n        index mode: 0=none, 1=k, 2=v, 3=kv (default 3)\n  -o string\n        output filename (default \"data.db\")\n  -version\n        show version and exit\n```\n\n## Performance\n\n```sh\n$ wc -l fixtures/sample-10m.tsv\n10000000 fixtures/sample-10m.tsv\n\n$ stat --format \"%s\" fixtures/sample-10m.tsv\n548384897\n\n$ time makta \u003c fixtures/sample-10m.tsv\n2021/09/30 16:58:07 [ok] initialized database -- data.db\n2021/09/30 16:58:17 [io] written 523M · 56.6M/s\n2021/09/30 16:58:21 [ok] 1/2 created index -- data.db\n2021/09/30 16:58:34 [ok] 2/2 created index -- data.db\n\nreal    0m26.267s\nuser    0m24.122s\nsys     0m3.224s\n```\n\n* 10M rows stored, with indexed keys and values in 27s, 370370 rows/s\n\n## TODO\n\n* [ ] allow tab-importing to be done programmatically, for any number of columns\n* [x] a better name: mktabdb, mktabs, dbize - go with makta for now\n* [ ] could write a tool for *burst* queries, e.g. split data into N shard,\n      create N databases and distribute queries across files - e.g. `dbize db.json`\n      with the same repl, etc. -- if we've seen 300K inserts per db, we may see 0.X x CPU x 300K, maybe millions/s.\n\n## Design ideas\n\nA design that works with 50M rows per database, e.g. 20 files for 1B rows;\ngrouped under a single directory. Every interaction only involves the\ndirectory, not the individual files.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fmakta","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Fmakta","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fmakta/lists"}