{"id":15692500,"url":"https://github.com/capjamesg/jamesql","last_synced_at":"2025-04-05T15:10:10.253Z","repository":{"id":253833159,"uuid":"844591313","full_name":"capjamesg/jamesql","owner":"capjamesg","description":"An in-memory NoSQL database implemented in Python.","archived":false,"fork":false,"pushed_at":"2025-02-10T11:01:56.000Z","size":869,"stargazers_count":83,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-28T19:16:39.328Z","etag":null,"topics":["document-search","nosql","nosql-database","python","web-search"],"latest_commit_sha":null,"homepage":"https://jamesg.blog/2024/08/19/nosql-database-python/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/capjamesg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-19T15:11:40.000Z","updated_at":"2025-03-22T14:41:14.000Z","dependencies_parsed_at":"2025-02-23T02:15:25.033Z","dependency_job_id":null,"html_url":"https://github.com/capjamesg/jamesql","commit_stats":null,"previous_names":["capjamesg/jamesql","capjamesg/nosql"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capjamesg%2Fjamesql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capjamesg%2Fjamesql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capjamesg%2Fjamesql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capjamesg%2Fjamesql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/capjamesg","download_url":"https://codeload.github.com/capjamesg/jamesql/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247353749,"owners_count":20925329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-search","nosql","nosql-database","python","web-search"],"created_at":"2024-10-03T18:34:21.772Z","updated_at":"2025-04-05T15:10:10.218Z","avatar_url":"https://github.com/capjamesg.png","language":"Python","readme":"[![version](https://badge.fury.io/py/jamesql.svg)](https://badge.fury.io/py/jamesql)\n[![license](https://img.shields.io/pypi/l/jamesql)](https://github.com/capjamesg/knowledge-graph-language/blob/main/LICENSE.md)\n[![python-version](https://img.shields.io/pypi/pyversions/jamesql)](https://badge.fury.io/py/jamesql)\n[![test workflow](https://github.com/capjamesg/jamesql/actions/workflows/test.yml/badge.svg)](https://github.com/capjamesg/jamesql/actions/workflows/test.yml)\n[![test workflow](https://github.com/capjamesg/jamesql/actions/workflows/windows.yml/badge.svg)](https://github.com/capjamesg/jamesql/actions/workflows/windows.yml)\n\n# JameSQL\n\nAn in-memory, NoSQL database implemented in Python, with support for building custom ranking algorithms.\n\nYou can run full text search queries on thousands of documents with multiple fields in \u003c 1ms.\n\n[Try a site search engine built with JameSQL](https://jamesg.blog/search-pages/)\n\nHere is an example of a search engine with a JameSQL back-end:\n\nhttps://github.com/user-attachments/assets/f1bf931d-6601-4fc8-b43c-d284853bce8f\n\n## Installation\n\nTo install this project, run:\n\n```\npip install jamesql\n```\n\n## Quickstart\n\nHere is a quickstart with a string-based query:\n\n```python\nfrom jamesql import JameSQL\n\nindex = JameSQL.load()\n\nindex.add({\"title\": \"tolerate it\", \"lyric\": \"Use my best colors for your portrait\"})\n\n# results should return in \u003c 1ms, whether you have one or 1k documents\nresults = index.string_query_search(\"title:'tolerate it' colors\")\n\nprint(results)\n# {'documents': [{'title': 'tolerate it', 'lyric': 'Use my best colors for your portrait' ...}]\n```\n\n## Usage\n\n### Create a database\n\nTo create a database, use the following code:\n\n```python\nfrom jamesql import JameSQL\n\nindex = JameSQL()\n```\n\n### Load a database\n\nTo load the database you initialized in your last session, use the following code:\n\n```python\nfrom jamesql import JameSQL\n\nindex = JameSQL.load()\n```\n\n### Add documents to a database\n\nTo add documents to a database, use the following code:\n\n```python\nindex.add({\"title\": \"tolerate it\", \"artist\": \"Taylor Swift\"})\nindex.insert({\"title\": \"betty\", \"artist\": \"Taylor Swift\"})\n```\n\nValues within documents can have the following data types:\n\n- String\n- Integer\n- Float\n- List\n\nYou cannot currently index a document whose value is a dictionary.\n\nWhen documents are added, a `uuid` key is added for use in uniquely identifying the document.\n\n### Indexing strategies\n\nWhen you run a query on a field for the first time, JameSQL will automatically set up an index for the field. The index type will be chosen based on what is most likely to be effective at querying the type of data in the field.\n\nThere are four indexing strategies currently implemented:\n\n- `GSI_INDEX_STRATEGIES.CONTAINS`: Creates a reverse index for the field. This is useful for fields that contain longer strings (i.e. body text in a blog post). TF-IDF is used to search fields structured with the `CONTAINS` type.\n- `GSI_INDEX_STRATEGIES.NUMERIC`: Creates several buckets to allow for efficient search of numeric values, especially values with high cardinality.\n- `GSI_INDEX_STRATEGIES.FLAT`: Stores the field as the data type it is. A flat index is created of values that are not strings or numbers. This is the default. For example, if you are indexing document titles and don't need to do a `starts_with` query, you may choose a flat index to allow for efficient `equals` and `contains` queries.\n- `GSI_INDEX_STRATEGIES.PREFIX`: Creates a trie index for the field. This is useful for fields that contain short strings (i.e. titles).\n- `GSI_INDEX_STRATEGIES.CATEGORICAL`: Creates a categorical index for the field. This is useful for fields that contain specific categories (i.e. genres).\n- `GSI_INDEX_STRATEGIES.TRIGRAM_CODE`: Creates a character-level trigram index for the field. This is useful for efficient code search. See the \"Code Search\" documentation later in this README for more information about using code search with JameSQL.\n\nYou can manually set an index type by creating a index (called a GSI), like so:\n\n```python\nindex.create_gsi(\"title\", strategy=GSI_INDEX_STRATEGIES.PREFIX)\n```\n\nIf you manually set an indexing startegy, any document currently in or added to the database will be indexed according to the strategy provided.\n\n### Search for documents\n\nA query has the following format:\n\n```python\n{\n    \"query\": {},\n    \"limit\": 2,\n    \"sort_by\": \"song\",\n    \"skip\": 1\n}\n```\n\n- `query` is a dictionary that contains the fields to search for.\n- `limit` is the maximum number of documents to return. (default 10)\n- `sort_by` is the field to sort by. (default None)\n- `skip` is the number of documents to skip. This is useful for implementing pagination. (default 0)\n\n`limit`, `sort_by`, and `skip` are optional.\n\nWithin the `query` key you can query for documents that match one or more conditions.\n\nAn empty query returns no documents.\n\nYou can retrieve all documents by using a catch-all query, which uses the following syntax:\n\n```python\n{\n    \"query\": \"*\",\n    \"limit\": 2,\n    \"sort_by\": \"song\",\n    \"skip\": 1\n}\n```\n\nThis is useful if you want to page through documents. You should supply a `sort_by` field to ensure the order of documents is consistent.\n\n#### Response\n\nAll valid queries return responses in the following form:\n\n```json\n{\n    \"documents\": [\n        {\"uuid\": \"1\", \"title\": \"test\", \"artist\": \"...\"},\n        {\"uuid\": \"2\", \"title\": \"test\", \"artist\": \"...\"},\n        ...\n    ],\n    \"query_time\": 0.0001,\n    \"total_results\": 200\n}\n```\n\n`documents` is a list of documents that match the query. `query_time` is the amount of time it took to execute the query. `total_results` is the total number of documents that match the query before applying any `limit`.\n\n`total_results` is useful for implementing pagination.\n\nIf an error was encountered, the response will be in the following form:\n\n```json\n{\n    \"documents\": [],\n    \"query_time\": 0.0001,\n    \"error\": \"Invalid query\"\n}\n```\n\nThe `error` key contains a message describing the exact error encountered.\n\n### Document ranking\n\nBy default, documents are ranked in no order. If you provide a `sort_by` field, documents are sorted by that field.\n\nFor more advanced ranking, you can use the `boost` feature. This feature lets you boost the value of a field in a document to calculate a final score.\n\nThe default score for each field is `1`.\n\nTo use this feature, you must use `boost` on fields that have an index.\n\nHere is an example of a query that uses the `boost` feature:\n\n```python\n{\n    \"query\": {\n        \"or\": {\n            \"post\": {\n                \"contains\": \"taylor swift\",\n                \"strict\": False,\n                \"boost\": 1\n            },\n            \"title\": {\n                \"contains\": \"desk\",\n                \"strict\": True,\n                \"boost\": 25\n            }\n        }\n    },\n    \"limit\": 4,\n    \"sort_by\": \"_score\",\n}\n```\n\nThis query would search for documents whose `post` field contains `taylor swift` or whose `title` field contains `desk`. The `title` field is boosted by 25, so documents that match the `title` field are ranked higher.\n\nThe score for each document before boosting is equal to the number of times the query condition is satisfied. For example, if a post contains `taylor swift` twice, the score for that document is `2`; if a title contains `desk` once, the score for that document is `1`.\n\nDocuments are then ranked in decreasing order of score.\n\n#### Document ranking with script scores\n\nThe script score feature lets you write custom scripts to calculate the score for each document. This is useful if you want to calculate a score based on multiple fields, including numeric fields.\n\nScript scores are applied after all documents are retrieved.\n\nThe script score feature supports the following mathematical operations:\n\n- `+` (addition)\n- `-` (subtraction)\n- `*` (multiplication)\n- `/` (division)\n- `log` (logarithm)\n- `decay` (timeseries decay)\n\nYou can apply a script score at the top level of your query:\n\n```python\n{\n    \"query\": {\n        \"or\": {\n            \"post\": {\n                \"contains\": \"taylor swift\",\n                \"strict\": False,\n                \"boost\": 1\n            },\n            \"title\": {\n                \"contains\": \"desk\",\n                \"strict\": True,\n                \"boost\": 25\n            }\n        }\n    },\n    \"limit\": 4,\n    \"sort_by\": \"_score\",\n    \"script_score\": \"((post + title) * 2)\"\n}\n```\n\nThe above example will calculate the score of documents by adding the score of the `post` field and the `title` field, then multiplying the result by `2`.\n\nA script score is made up of terms. A term is a field name or number (float or int), followed by an operator, followed by another term or number. Terms can be nested.\n\nAll terms must be enclosed within parentheses.\n\nTo compute a score that adds the `post` score to `title` and multiplies the result by `2`, use the following code:\n\n```text\n((post + title) * 2)\n```\n\nInvalid forms of this query include:\n\n- `post + title * 2` (missing parentheses)\n- `(post + title * 2)` (terms can only include one operator)\n\nThe `decay` function lets you decay a value by `0.9 ** days_since_post / 30`. This is useful for gradually decreasing the rank for older documents as time passes. This may be particularly useful if you are working with data where you want more recent documents to be ranked higher. `decay` only works with timeseries.\n\nHere is an example of `decay` in use:\n\n```\n(_score * decay published)\n```\n\nThis will apply the `decay` function to the `published` field.\n\nData must be stored as a Python `datetime` object for the `decay` function to work.\n\n### Condition matching\n\nThere are three operators you can use for condition matching:\n\n- `equals`\n- `contains`\n- `starts_with`\n\nHere is an example of a query that searches for documents that have the `artist` field set to `Taylor Swift`:\n\n```python\nquery = {\n    \"query\": {\n        \"artist\": {\n            \"equals\": \"Taylor Swift\"\n        }\n    }\n}\n```\n\nThese operators can be used with three query types:\n\n- `and`\n- `or`\n- `not`\n\n### and\n\nYou can also search for documents that have the `artist` field set to `Taylor Swift` and the `title` field set to `tolerate it`:\n\n```python\nquery = {\n    \"query\": {\n        \"and\": [\n            {\n                \"artist\": {\n                    \"equals\": \"Taylor Swift\"\n                }\n            },\n            {\n                \"title\": {\n                    \"equals\": \"tolerate it\"\n                }\n            }\n        ]\n    }\n}\n```\n\n### or\n\nYou can nest conditions to create complex queries, like:\n\n```python\nquery = {\n    \"query\": {\n        \"or\": {\n            \"and\": [\n                {\"title\": {\"starts_with\": \"tolerate\"}},\n                {\"title\": {\"contains\": \"it\"}},\n            ],\n            \"lyric\": {\"contains\": \"kiss\"},\n        }\n    },\n    \"limit\": 2,\n    \"sort_by\": \"title\",\n}\n```\n\nThis will return a list of documents that match the query.\n\n### not\n\nYou can search for documents that do not match a query by using the `not` operator. Here is an example of a query that searches for lyrics that contain `sky` but not `kiss`:\n\n```python\nquery = {\n    \"query\": {\n        \"and\": {\n            \"or\": [\n                {\"lyric\": {\"contains\": \"sky\", \"boost\": 3}},\n            ],\n            \"not\": {\"lyric\": {\"contains\": \"kiss\"}},\n        }\n    },\n    \"limit\": 10,\n    \"sort_by\": \"title\",\n}\n```\n\n### Running a search\n\nTo search for documents that match a query, use the following code:\n\n```python\nresult = index.search(query)\n```\n\nThis returns a JSON payload with the following structure:\n\n```json\n{\n    \"documents\": [\n        {\"uuid\": \"1\", ...}\n        {\"uuid\": \"2\", ...}\n        ...\n    ],\n    \"query_time\": 0.0001,\n    \"total_results\": 200\n}\n```\n\nYou can search through multiple pages with the `scroll()` method:\n\n```python\nresult = index.scroll(query)\n```\n\n`scroll()` returns a generator that yields documents in the same format as `search()`.\n\n### Strict matching\n\nBy default, a search query on a text field will find any document where the field contains any word in the query string. For example, a query for `tolerate it` on a `title` field will match any document whose `title` that contains `tolerate` or `it`. This is called a non-strict match.\n\nNon-strict matches are the default because they are faster to compute than strict matches.\n\nIf you want to find documents where terms appear next to each other in a field, you can do so with a strict match. Here is an example of a strict match:\n\n```python\nquery = {\n    \"query\": {\n        \"title\": {\n            \"contains\": \"tolerate it\",\n            \"strict\": True\n        }\n    }\n}\n```\n\nThis will return documents whose title contains `tolerate it` as a single phrase.\n\n### Fuzzy matching\n\nBy default, search queries look for the exact string provided. This means that if a query contains a typo (i.e. searching for `tolerate ip` instead of `tolerate it`), no documents will be returned.\n\nJameSQL implements a limited form of fuzzy matching. This means that if a query contains a typo, JameSQL will still return documents that match the query.\n\nThe fuzzy matching feature matches documents that contain one typo. If a document contains more than one typo, it will not be returned. A typo is an incorrectly typed character. JameSQL does not support fuzzy matching that accounts for missing or additional characters (i.e. `tolerate itt` will not match `tolerate it`).\n\nYou can enable fuzzy matching by setting the `fuzzy` key to `True` in the query. Here is an example of a query that uses fuzzy matching:\n\n```python\nquery = {\n    \"query\": {\n        \"title\": {\n            \"contains\": \"tolerate ip\",\n            \"fuzzy\": True\n        }\n    }\n}\n```\n\n### Wildcard matching\n\nYou can match documents using a single wildcard character. This character is represented by an asterisk `*`.\n\n```python\nquery = {\n    \"query\": {\n        \"title\": {\n            \"contains\": \"tolerat* it\",\n            \"fuzzy\": True\n        }\n    }\n}\n```\n\nThis query will look for all words that match the pattern `tolerat* it`, where the `*` character can be any single character.\n\n### Look for terms close to each other\n\nYou can find terms that appear close to each other with a `close_to` query. Here is an example of a query that looks for documents where `made` and `temple` appear within `7` words of each other and `my` appears within `7` words of `temple`:\n\n```python\nquery = {\n    \"query\": {\n        \"close_to\": [\n            {\"lyric\": \"made\"},\n            {\"lyric\": \"temple,\"},\n            {\"lyric\": \"my\"},\n        ],\n        \"distance\": 7\n    },\n    \"limit\": 10\n}\n```\n\n### Less than, greater than, less than or equal to, greater than or equal to\n\nYou can find documents where a field is less than, greater than, less than or equal to, or greater than or equal to a value with a range query. Here is an example of a query that looks for documents where the `year` field is greater than `2010`:\n\n```python\nquery = {\n    \"query\": {\n        \"year\": {\n            \"greater_than\": 2010\n        }\n    }\n}\n```\n\nThe following operators are supported:\n\n- `greater_than`\n- `less_than`\n- `greater_than_or_equal`\n- `less_than_or_equal`\n\n### Range queries\n\nYou can find values in a numeric range with a range query. Here is an example of a query that looks for documents where the `year` field is between `2010` and `2020`:\n\n```python\nquery = {\n    \"query\": {\n        \"year\": {\n            \"range\": [2010, 2020]\n        }\n    }\n}\n```\n\nThe first value in the range is the lower bound to use in the search, and the second value is the upper bound.\n\n### Highlight results\n\nYou can extract context around results. This data can be used to show a snippet of the document that contains the query term.\n\nHere is an example of a query that highlights context around all instances of the term \"sky\" in the `lyric` field:\n\n```python\nquery = {\n    \"query\": {\n        \"lyric\": {\n            \"contains\": \"sky\",\n            \"highlight\": True,\n            \"highlight_stride\": 3\n        }\n    }\n}\n```\n\n`highlight_stride` states how many words to retrieve before and after the match.\n\nAll documents returned by this query will have a `_context` key that contains the context around all instances of the term \"sky\".\n\n### Group by\n\nYou can group results by a single key. This is useful for presenting aggregate views of data.\n\nTo group results by a key, use the following code:\n\n```python\nquery = {\n    \"query\": {\n        \"lyric\": {\n            \"contains\": \"sky\"\n        }\n    },\n    \"group_by\": \"title\"\n}\n```\n\nThis query will search for all `lyric` fields that contain the term \"sky\" and group the results by the `title` field.\n\n### Aggregate metrics\n\nYou can find the total number of unique values for the fields returned by a query using an `aggregate` query. This is useful for presenting the total number of options available in a search space to a user.\n\nYou can use the following query to find the total number of unique values for all fields whose `lyric` field contains the term \"sky\":\n\n```python\nquery = {\n    \"query\": {\n        \"lyric\": {\n            \"contains\": \"sky\"\n        }\n    },\n    \"metrics\": [\"aggregate\"]\n}\n```\n\nThe aggregate results are presented in an `unique_record_values` key with the following structure:\n\n```python\n{\n    \"documents\": [...],\n    \"query_time\": 0.0001,\n    {'unique_record_values': {'title': 2, 'lyric': 2, 'listens': 2, 'categories': 3}}\n}\n```\n\n### Update documents\n\nYou need a document UUID to update a document. You can retrieve a UUID by searching for a document.\n\nHere is an example showing how to update a document:\n\n```python\nresponse = index.search(\n    {\n        \"query\": {\"title\": {\"equals\": \"tolerate it\"}},\n        \"limit\": 10,\n        \"sort_by\": \"title\",\n    }\n)\n\nuuid = response[\"documents\"][0][\"uuid\"]\n\nindex.update(uuid, {\"title\": \"tolerate it (folklore)\", \"artist\": \"Taylor Swift\"})\n```\n\n`update` is an override operation. This means you must provide the full document that you want to save, instead of only the fields you want to update.\n\n### Delete documents\n\nYou need a document UUID to delete a document. You can retrieve a UUID by searching for a document.\n\nHere is an example showing how to delete a document:\n\n```python\nresponse = index.search(\n    {\n        \"query\": {\"title\": {\"equals\": \"tolerate it\"}},\n        \"limit\": 10,\n        \"sort_by\": \"title\",\n    }\n)\n\nuuid = response[\"documents\"][0][\"uuid\"]\n\nindex.remove(uuid)\n```\n\nYou can validate the document has been deleted using this code:\n\n```python\nresponse = index.search(\n    {\n        \"query\": {\"title\": {\"equals\": \"tolerate it\"}},\n        \"limit\": 10,\n        \"sort_by\": \"title\",\n    }\n)\n\nassert len(response[\"documents\"]) == 0\n```\n\n## String queries\n\nJameSQL supports string queries. String queries are single strings that use special syntax to assert the meaning of parts of a string.\n\nFor example, you could use the following query to find documents where the `title` field contains `tolerate it` and any field contains `mural`:\n\n```\ntitle:\"tolerate it\" mural\n```\n\nThe following operators are supported:\n\n- `-term`: Search for documents that do not contain `term`.\n- `term`: Search for documents that contain `term`.\n- `term1 term2`: Search for documents that contain `term1` and `term2`.\n- `'term1 term2'`: Search for the literal phrase `term1 term2` in documents.\n- `field:'term'`: Search for documents where the `field` field contains `term` (i.e. `title:\"tolerate it\"`).\n- `field^2 term`: Boost the score of documents where the `field` field matches the query `term` by `2`.\n\nThis feature turns a string query into a JameSQL query, which is then executed and the results returned.\n\nTo run a string query, use the following code:\n\n```python\nresults = index.string_query_search(\"title:'tolerate it' mural\")\n```\n\nWhen you run a string query, JameSQL will attempt to simplify the query to make it more efficient. For example, if you search for `-sky sky mural`, the query will be `mural` because `-sky` negates the `sky` mention.\n\n## Autosuggest\n\nYou can enable autosuggest using one or more fields in an index. This can be used to efficiently find records that start with a given prefix.\n\nTo enable autosuggest on an index, run:\n\n```python\nindex = JameSQL()\n\n...\n\nindex.enable_autosuggest(\"field\")\n```\n\nWhere `field` is the name of the field on which you want to enable autosuggest.\n\nYou can enable autosuggest on multiple fields:\n\n```python\nindex.enable_autosuggest(\"field1\")\nindex.enable_autosuggest(\"field2\")\n```\n\nWhen you enable autosuggest on a field, JameSQL will create a trie index for that field. This index is used to efficiently find records that start with a given prefix.\n\nTo run an autosuggest query, use the following code:\n\n```python\nsuggestions = index.autosuggest(\"started\", match_full_record=True, limit = 1)\n```\n\nThis will automatically return records that start with the prefix `started`.\n\nThe `match_full_record` parameter indicates whether to return full record names, or any records starting with a term.\n\n`match_full_record=True` means that the full record name will be returned. This is ideal to enable selection between full records.\n\n`match_full_record=False` means that any records starting with the term will be returned. This is ideal for autosuggesting single words.\n\nFor example, given the query `start`, matching against full records with `match_full_record=True` would return:\n\n- `Started with a kiss`\n\nThis is the content of a full document.\n\n`match_full_record=False`, on the other hand, would return:\n\n- `started`\n- `started with a kiss`\n\nThis contains both a root word starting with `start` and full documents starting with `start`.\n\nThis feature is case insensitive.\n\nThe `limit` argument limits the number of results returned.\n\n## Spelling correction\n\nIt is recommended that you check the spelling of words before you run a query. \n\nThis is because correcting the spelling of a word can improve the accuracy of your search results.\n\n### Correcting the spelling of a single word\n\nTo recommend a spelling correction for a query, use the following code:\n\n```python\nindex = ...\n\nsuggestion = index.spelling_correction(\"taylr swift\")\n```\n\nThis will return a single suggestion. The suggestion will be the word that is most likely to be the correct spelling of the word you provided.\n\nSpelling correction first generates segmentations of a word, like:\n\n- `t aylorswift`\n- `ta ylorswift`\n\nIf a segmentation is valid, it is returned.\n\nFor example, if the user types in `taylorswift`, one permutation would be segmented into `taylor swift`. If `taylor swift` is common in the index, `taylor swift` will be returned as the suggestion.\n\nSpelling correction works by transforming the input query by inserting, deleting, and transforming one character in every position in a string. The transformed strings are then looked up in the index to find if they are present and, if so, how common they are.\n\nThe most common suggestion is then returned.\n\nFor example, if you provide the word `tayloi` and `taylor` is common in the index, the suggestion will be `taylor`.\n\nIf correction was not possible after transforming one character, correction will be attempted with two transformations given the input string.\n\nIf the word you provided is already spelled correctly, the suggestion will be the word you provided. If spelling correction is not possible (i.e. the word is too distant from any word in the index), the suggestion will be `None`.\n\n### Correcting a string query\n\nIf you are correcting a string query submitted with the `string_query_search()` function, spelling will be automatically corrected using the algorithm above. No configuration is required.\n\n## Code Search\n\nYou can use JameSQL to efficiently search through code.\n\nTo do so, first create a `TRIGRAM_CODE` index on the field you want to search.\n\nWhen you add documents, include at least the following two fields:\n\n- `file_name`: The name of the file the code is in.\n- `code`: The code you want to index.\n\nWhen you search for code, all matching documents will have a `_context` key with the following structure:\n\n```python\n{\n    \"line\": \"1\",\n    \"code\": \"...\"\n}\n```\n\nThis tells you on what line your search matched, and the code that matched. This information is ideal to highlight specific lines relevant to your query.\n\n\n## Data Storage\n\nJameSQL indices are stored in memory and on disk.\n\nWhen you call the `add()` method, the document is appended to an `index.jamesql` file in the directory in which your program is running. This file is serialized as JSONL.\n\nWhen you load an index, all entries in the `index.jamesql` file will be read back into memory.\n\n_Note: You will need to manually reconstruct your indices using the `create_gsi()` method after loading an index._\n\n## Data Consistency\n\nWhen you call `add()`, a `journal.jamesql` file is created. This is used to store the contents of the `add()` operation you are executing. If JameSQL terminates during an `add()` call for any reason (i.e. system crash, program termination), this journal will be used to reconcile the database.\n\nNext time you initialize a JameSQL instance, your documents in `index.jamesql` will be read into memory. Then, the transactions in `journal.jamesql` will be replayed to ensure the index is consistent. Finally, the `journal.jamesql` file will be deleted.\n\nYou can access the JSON of the last transaction issued, sans the `uuid`, by calling `index.last_transaction`.\n\nIf you were in the middle of ingesting data, this could be used to resume the ingestion process from where you left off by allowing you to skip records that were already ingested.\n\n## Reducing Precision for Large Results Pages\n\nBy default, JameSQL assigns scores to the top 1,000 documents in each clause in a query. Consider the following query;\n\n```\nquery = {\n    \"query\": {\n        \"and\": [\n            {\n                \"artist\": {\n                    \"equals\": \"Taylor Swift\"\n                }\n            },\n            {\n                \"title\": {\n                    \"equals\": \"tolerate it\"\n                }\n            }\n        ]\n    },\n    \"limit\": 10\n}\n```\n\nThe `{ \"artist\": { \"equals\": \"Taylor Swift\" } }` clause will return the top 1,000 documents that match the query. The `{ \"title\": { \"equals\": \"tolerate it\" } }` clause will return the top 1,000 documents that match the query.\n\nThese will then be combine and sorted to return the 10 documents of the 2,000 processed that have the highest score.\n\nThis means that if you have a large number of documents that match a query, you may not get precisely the most relevant documents in the top 10 results, rather an approximation of the most relevant documents.\n\nYou can override the number of documents to consider with:\n\n```\nindex.match_limit_for_large_result_pages = 10_000\n```\n\nThe higher this number, the longer it will take to process results with a large number of matching documents.\n\n## Web Interface\n\nJameSQL comes with a limited web interface designed for use in testing queries.\n\n_Note: You should not use the web interface if you are extending the query engine. Full error messages are only available in the console when you run the query engine._\n\nTo start the web interface, run:\n\n```\npython3 web.py\n```\n\nThe web interface will run on `localhost:5000`.\n\n## Testing\n\nYou can run the project unit tests with the following command:\n\n```\npytest tests/*.py\n```\n\nThe tests have three modes:\n\n1. Run all unit tests.\n2. Run all unit tests with an index of 30,000 small documents and ensure the query engine is fast.\n3. Run all unit tests with an index of 30,000 documents with a few dozen words and ensure the query engine is fast.\n\nTo run the 30,000 small documents benchmark tests, run:\n\n```\npytest tests/*.py --benchmark\n```\n\nTo run the 30,000 documents with a few dozen words benchmark tests, run:\n\n```\npytest tests/*.py --long-benchmark\n```\n\nIn development, the goal should be making the query engine as fast as possible. The performance tests are designed to monitor for performance regressions, not set a ceiling for acceptable performance.\n\n## Deployment considerations\n\nProgress is being made on making JameSQL thread safe, but there are still some issues to work out. It is recommended that you run JameSQL in a single-threaded environment.\n\nIt is recommended that you cache responses from JameSQL. While it takes \u003c 1ms to process many JameSQL queries, reading a set of results from a cache will be faster.\n\n## Development notes\n\nThe following are notes that describe limitations of which I am aware, and may fix in the future:\n\n- `boost` does not work with and/or queries.\n- The query engine relies on `uuid`s to uniquely identify items. But these are treated as the partition key, which is not appropriate. Two documents should be able to have the same partition key, as long as they have their own `uuid`.\n\n## License\n\nThis project is licensed under an [MIT license](LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapjamesg%2Fjamesql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcapjamesg%2Fjamesql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapjamesg%2Fjamesql/lists"}