{"id":19768675,"url":"https://github.com/siara-cc/sqlite_blaster","last_synced_at":"2025-04-30T17:30:41.838Z","repository":{"id":86959839,"uuid":"521862000","full_name":"siara-cc/sqlite_blaster","owner":"siara-cc","description":"Create huge Sqlite indexes at breakneck speeds","archived":false,"fork":false,"pushed_at":"2023-04-26T05:09:46.000Z","size":25179,"stargazers_count":175,"open_issues_count":1,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2023-11-07T15:51:02.208Z","etag":null,"topics":["document-store","embedded-database","kv-store","nosql","portable","sqlite"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/siara-cc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["siara-cc"]}},"created_at":"2022-08-06T06:02:32.000Z","updated_at":"2024-05-30T07:05:18.608Z","dependencies_parsed_at":"2023-07-19T23:20:50.331Z","dependency_job_id":null,"html_url":"https://github.com/siara-cc/sqlite_blaster","commit_stats":null,"previous_names":[],"tags_count":1,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siara-cc%2Fsqlite_blaster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siara-cc%2Fsqlite_blaster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siara-cc%2Fsqlite_blaster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siara-cc%2Fsqlite_blaster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/siara-cc","download_url":"https://codeload.github.com/siara-cc/sqlite_blaster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251751049,"owners_count":21637851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-store","embedded-database","kv-store","nosql","portable","sqlite"],"created_at":"2024-11-12T04:39:45.337Z","updated_at":"2025-04-30T17:30:36.809Z","avatar_url":"https://github.com/siara-cc.png","language":"C++","readme":"# Sqlite Index Blaster\n\n[![Codacy Badge](https://api.codacy.com/project/badge/Grade/6ab783c325cb4e199a01ff6280a38bd8)](https://app.codacy.com/gh/siara-cc/sqlite_blaster?utm_source=github.com\u0026utm_medium=referral\u0026utm_content=siara-cc/sqlite_blaster\u0026utm_campaign=Badge_Grade_Settings)\n[![C/C++ CI](https://github.com/siara-cc/sqlite_blaster/actions/workflows/c-cpp.yml/badge.svg)](https://github.com/siara-cc/sqlite_blaster/actions/workflows/c-cpp.yml)\n[![DOI](https://zenodo.org/badge/521862000.svg)](https://zenodo.org/badge/latestdoi/521862000)\n\nThis library provides API for creating huge Sqlite indexes at breakneck speeds for millions of records much faster than the official SQLite library by leaving out crash recovery.\n\nThis repo exploits a [lesser known feature of the Sqlite database file format](https://www.sqlite.org/withoutrowid.html) to store records as key-value pairs or documents or regular tuples.\n\n# Statement of need\n\nThere are a number of choices available for fast insertion of records, such as Rocks DB, LMDB and MongoDB but even they are slow due to overheads of using logs or journals for providing durability.  These overheads are significant for indexing huge datasets.\n\nThis library was created for inserting/updating billions of entries for arriving at word/phrase frequencies for building dictionaries for the [Unishox](https://github.com/siara-cc/Unishox) project using publicly available texts and conversations.\n\nFurthermore, the other choices don't have the same number of IDEs or querying abilities of the most popular Sqlite data format.\n\n# Applications\n\n- Lightning fast index creation for huge datasets\n- Fast database indexing for embedded systems\n- Fast data set creation and loading for Data Science and Machine Learning\n\n# Python wrapper\n\nSqlite blaster is also available as a Python wrapper here (https://github.com/siara-cc/sqlite_blaster_python) \n\n# Performance\n\nThe performance of this repo was compared with the Sqlite official library, LMDB and RocksDB under similar conditions of CPU, RAM and NVMe disk and the results are shown below:\n\n![Performance](misc/performance.png?raw=true)\n\nRocksDB performs much better than other choices and performs consistently for over billion entries, but it is quite slow initially.\n\nThe chart data can be found [here](https://github.com/siara-cc/sqlite_blaster/blob/main/misc/SqliteBlasterPerformanceLineChart.xlsx?raw=true).\n\n# Building and running tests\n\nClone this repo and run `make` to build the executable `test_sqlite_blaster` for testing.  To run tests, invoke with `-t` parameter from shell console.\n\n```sh\nmake\n./test_sqlite_blaster -t\n```\n\n# Getting started\n\nEssentially, the library provides 2 methods `put()` and `get()` for inserting and retrieving records.  Shown below are examples of how this library can be used to create a key-value store, or a document store or a regular table.\n\nNote: The cache size is used as 40kb in these examples, but in real life 32mb or 64mb would be ideal.  The higher this number, better the performance.\n\n## Creating a Key-Value store\n\nIn this mode, a table is created with just 2 columns, `key` and `value` as shown below:\n\n```c++\n#include \"sqlite_index_blaster.h\"\n#include \u003cstring\u003e\n\nint main() {\n\n    std::string col_names = \"key, value\";\n    sqib::sqlite_index_blaster sqib(2, 1, col_names, \"kv_index\", 4096, 40, \"kv_idx.db\");\n    std::string key = \"hello\";\n    std::string val = \"world\";\n    sqib.put_string(key, val);\n    sqib.close();\n    return 0;\n\n}\n```\n\nA file `kv_idx.db` is created and can be verified by opening it using `sqlite3` official console program:\n\n```sh\nsqlite3 kv_idx.db \".dump\"\n```\n\nand the output would be:\n\n```sql\nPRAGMA foreign_keys=OFF;\nBEGIN TRANSACTION;\nCREATE TABLE kv_index (key, value, PRIMARY KEY (key)) WITHOUT ROWID;\nINSERT INTO kv_index VALUES('hello','world');\nCOMMIT;\n```\n\nTo retrieve the inserted values, use `get` method as shown below\n\n```c++\n#include \"sqlite_index_blaster.h\"\n#include \u003cstring\u003e\n\nint main() {\n    std::string col_names = \"key, value\";\n    sqib::sqlite_index_blaster sqib(2, 1, col_names, \"kv_index\", 4096, 40, \"kv_idx.db\");\n    std::string key = \"hello\";\n    std::string val = \"world\";\n    sqib.put_string(key, val);\n    std::cout \u003c\u003c \"Value of hello is \" \u003c\u003c sqib.get_string(key, \"not_found\") \u003c\u003c std::endl;\n    sqib.close();\n    return 0;\n}\n```\n\n## Creating a Document store\n\nIn this mode, a table is created with just 2 columns, `key` and `doc` as shown below:\n\n```c++\n#include \"sqlite_index_blaster.h\"\n#include \u003cstring\u003e\n\nstd::string json1 = \"{\\\"name\\\": \\\"Alice\\\", \\\"age\\\": 25, \\\"email\\\": \\\"alice@example.com\\\"}\";\nstd::string json2 = \"{\\\"name\\\": \\\"George\\\", \\\"age\\\": 32, \\\"email\\\": \\\"george@example.com\\\"}\";\n\nint main() {\n    std::string col_names = \"key, doc\";\n    sqib::sqlite_index_blaster sqib(2, 1, col_names, \"doc_index\", 4096, 40, \"doc_store.db\");\n    std::string pc = \"primary_contact\";\n    sqib.put_string(pc, json1);\n    std::string sc = \"secondary_contact\";\n    sqib.put_string(sc, json2);\n    sqib.close();\n    return 0;\n}\n```\n\nThe index is created as `doc_store.db` and the json values can be queried using `sqlite3` console as shown below:\n\n```sql\nSELECT json_extract(doc, '$.email') AS email\nFROM doc_index\nWHERE key = 'primary_contact';\n```\n\n## Creating a regular table\n\nThis repo can be used to create regular tables with primary key(s) as shown below:\n\n```c++\n#include \u003ccmath\u003e\n#include \u003cstring\u003e\n\n#include \"sqlite_index_blaster.h\"\n\nconst uint8_t col_types[] = {SQLT_TYPE_TEXT, SQLT_TYPE_INT8, SQLT_TYPE_INT8, SQLT_TYPE_INT8, SQLT_TYPE_INT8, SQLT_TYPE_REAL};\n\nint main() {\n\n    std::string col_names = \"student_name, age, maths_marks, physics_marks, chemistry_marks, average_marks\";\n    sqib::sqlite_index_blaster sqib(6, 2, col_names, \"student_marks\", 4096, 40, \"student_marks.db\");\n\n    int8_t maths, physics, chemistry, age;\n    double average;\n    uint8_t rec_buf[500];\n    int rec_len;\n\n    age = 19; maths = 80; physics = 69; chemistry = 98;\n    average = round((maths + physics + chemistry) * 100 / 3) / 100;\n    const void *rec_values[] = {\"Robert\", \u0026age, \u0026maths, \u0026physics, \u0026chemistry, \u0026average};\n    rec_len = sqib.make_new_rec(rec_buf, 6, rec_values, NULL, col_types);\n    sqib.put(rec_buf, -rec_len, NULL, 0);\n\n    age = 20; maths = 82; physics = 99; chemistry = 83;\n    average = round((maths + physics + chemistry) * 100 / 3) / 100;\n    rec_values[0] = \"Barry\";\n    rec_len = sqib.make_new_rec(rec_buf, 6, rec_values, NULL, col_types);\n    sqib.put(rec_buf, -rec_len, NULL, 0);\n\n    age = 23; maths = 84; physics = 89; chemistry = 74;\n    average = round((maths + physics + chemistry) * 100 / 3) / 100;\n    rec_values[0] = \"Elizabeth\";\n    rec_len = sqib.make_new_rec(rec_buf, 6, rec_values, NULL, col_types);\n    sqib.put(rec_buf, -rec_len, NULL, 0);\n\n    return 0;\n}\n```\n\nThe index is created as `student_marks.db` and the data can be queried using `sqlite3` console as shown below:\n\n```sql\nsqlite3 student_marks.db \"select * from student_marks\"\nBarry|20|82|99|83|88.0\nElizabeth|23|84|89|74|82.33\nRobert|19|80|69|98|82.33\n```\n\n## Constructor parameters of sqlite_index_blaster class\n\n1. `total_col_count` - Total column count in the index\n2. `pk_col_count` - Number of columns to use as key.  These columns have to be positioned at the beginning\n3. `col_names` - Column names to create the table\n4. `tbl_name` - Table (clustered index) name\n5. `block_sz` - Page size (must be one of 512, 1024, 2048, 4096, 8192, 16384, 32768 or 65536)\n6. `cache_sz` - Size of LRU cache in kilobytes. 32 or 64 mb would be ideal. Higher values lead to better performance\n7. `fname` - Name of the Sqlite database file\n\n# Console Utility for playing around\n\n`test_sqlite_blaster` also has rudimentary ability to create, insert and query databases as shown below.  However this is just for demonstration.\n\n```c++\n./test_sqlite_blaster -c movie.db 4096 movie_list 3 1 Film,Genre,Studio\n```\n\nTo insert records, use -i as shown below:\n\n```c++\n./test_sqlite_blaster -i movie.db 4096 3 1 \"Valentine's Day,Comedy,Warner Bros.\" \"Sex and the City,Comedy,Disney\" \"Midnight in Paris,Romance,Sony\"\n```\n\nThis inserts 3 records.  To retrieve inserted records, run:\n\n```c++\n./test_sqlite_blaster -r movie.db 4096 3 1 \"Valentine's Day\"\n```\nand the output would be:\n```\nValentine's Day,Comedy,Warner Bros.\n```\n\n# Limitations\n\n- No crash recovery. If the insertion process is interruped, the database would be unusable.\n- The record length cannot change for update. Updating with lesser or greater record length is not implemented yet.\n- Deletes are not implemented yet.  This library is intended primarily for fast inserts.\n- Support for concurrent inserts not implemented yet.\n- The regular ROWID table of Sqlite is not implemented.\n- Only the equivalent of memcmp is used to index records.  The order in which keys are ordered may not match with official Sqlite lib for non-ASCII char sets.\n- Key lengths are limited depending on page size as shown in the table below.  This is just because the source code does not implement support for longer keys. However, this is considered sufficient for most practical purposes.\n\n  | **Page Size** | **Max Key Length** |\n  | ------------- | ------------------ |\n  | 512 | 35 |\n  | 1024 | 99 |\n  | 2048 | 227 |\n  | 4096 | 484 |\n  | 8192 | 998 |\n  | 16384 | 2026 |\n  | 32768 | 4082 |\n  | 65536 | 8194 |\n\n# Stability\n\nThis code has been tested with more than 200 million records, so it is expected to be quite stable, but bear in mind that this is so fast because there is no crash recovery.\n\nSo this repo is best suited for one time inserts of large datasets.  It may be suitable for power backed systems such as those hosted in Cloud and battery backed systems.\n\n# License\n\nSqlite Index Blaster and its command line tools are dual-licensed under the MIT license and the AGPL-3.0.  Users may choose one of the above.\n\n- The MIT License\n- The GNU Affero General Public License v3 (AGPL-3.0)\n\n# License for AI bots\n\nThe license mentioned is only applicable for humans and this work is NOT available for AI bots.\n\nAI has been proven to be beneficial to humans especially with the introduction of ChatGPT.  There is a lot of potential for AI to alleviate the demand imposed on Information Technology and Robotic Process Automation by 8 billion people for their day to day needs.\n\nHowever there are a lot of ethical issues particularly affecting those humans who have been trying to help alleviate the demand from 8b people so far. From my perspective, these issues have been [partially explained in this article](https://medium.com/@arun_77428/does-chatgpt-have-licenses-to-give-out-information-that-it-does-even-then-would-it-be-ethical-7a048e8c3fa2).\n\nI am part of this community that has a lot of kind hearted people who have been dedicating their work to open source without anything much to expect in return.  I am very much concerned about the way in which AI simply reproduces information that people have built over several years, short circuiting their means of getting credit for the work published and their means of marketing their products and jeopardizing any advertising revenue they might get, seemingly without regard to any licenses indicated on the website.\n\nI think the existing licenses have not taken into account indexing by AI bots and till the time modifications to the licenses are made, this work is unavailable for AI bots.\n\n# Support\n\nIf you face any problem, create issue in this website, or write to the author (Arundale Ramanathan) at arun@siara.cc.\n","funding_links":["https://github.com/sponsors/siara-cc"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiara-cc%2Fsqlite_blaster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsiara-cc%2Fsqlite_blaster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiara-cc%2Fsqlite_blaster/lists"}