{"id":15432925,"url":"https://github.com/simonw/sqlite-comprehend","last_synced_at":"2025-10-29T01:38:52.897Z","repository":{"id":44054130,"uuid":"511787166","full_name":"simonw/sqlite-comprehend","owner":"simonw","description":"Tools for running data in a SQLite database through AWS Comprehend","archived":false,"fork":false,"pushed_at":"2022-07-12T14:21:42.000Z","size":77,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-14T22:42:00.634Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-08T06:26:15.000Z","updated_at":"2025-06-27T17:13:44.000Z","dependencies_parsed_at":"2022-08-03T18:01:02.025Z","dependency_job_id":null,"html_url":"https://github.com/simonw/sqlite-comprehend","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/simonw/sqlite-comprehend","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fsqlite-comprehend","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fsqlite-comprehend/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fsqlite-comprehend/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fsqlite-comprehend/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonw","download_url":"https://codeload.github.com/simonw/sqlite-comprehend/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fsqlite-comprehend/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281544216,"owners_count":26519552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-28T02:00:06.022Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T18:29:22.389Z","updated_at":"2025-10-29T01:38:52.863Z","avatar_url":"https://github.com/simonw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sqlite-comprehend\n\n[![PyPI](https://img.shields.io/pypi/v/sqlite-comprehend.svg)](https://pypi.org/project/sqlite-comprehend/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/sqlite-comprehend?include_prereleases\u0026label=changelog)](https://github.com/simonw/sqlite-comprehend/releases)\n[![Tests](https://github.com/simonw/sqlite-comprehend/workflows/Test/badge.svg)](https://github.com/simonw/sqlite-comprehend/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/sqlite-comprehend/blob/master/LICENSE)\n\nTools for running data in a SQLite database through [AWS Comprehend](https://aws.amazon.com/comprehend/)\n\nSee [sqlite-comprehend: run AWS entity extraction against content in a SQLite database](https://simonwillison.net/2022/Jul/11/sqlite-comprehend/) for background on this project.\n\n## Installation\n\nInstall this tool using `pip`:\n\n    pip install sqlite-comprehend\n\n## Demo\n\nYou can see examples of tables generated using this command here:\n\n- [comprehend_entities](https://datasette.simonwillison.net/simonwillisonblog/comprehend_entities) - the extracted entities, classified by type\n- [blog_entry_comprehend_entities](https://datasette.simonwillison.net/simonwillisonblog/blog_entry_comprehend_entities) - a table relating entities to the entries that they appear in\n- [comprehend_entity_types](https://datasette.simonwillison.net/simonwillisonblog/comprehend_entity_types) - a small lookup table of entity types\n\n## Configuration\n\nYou will need AWS credentials with the `comprehend:BatchDetectEntities` [IAM permission](https://docs.aws.amazon.com/comprehend/latest/dg/access-control-managing-permissions.html).\n\nYou can configure credentials [using these instructions](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html). You can also save them to a JSON or INI configuration file and pass them to the command using `-a credentials.ini`, or pass them using the `--access-key` and `--secret-key` options.\n\n## Entity extraction\n\nThe `sqlite-comprehend entities` command runs entity extraction against every row in the specified table and saves the results to your database.\n\nSpecify the database, the table and one or more columns containing text in that table. The following runs against the `text` column in the `pages` table of the `sfms.db` SQLite database:\n\n    sqlite-comprehend sfms.db pages text\n\nResults will be written into a `pages_comprehend_entities` table. Change the name of the output table by passing `-o other_table_name`.\n\nYou can run against a subset of rows by adding a `--where` clause:\n\n    sqlite-comprehend sfms.db pages text --where 'id \u003c 10'\n\nYou can also used named parameters in your `--where` clause:\n\n    sqlite-comprehend sfms.db pages text --where 'id \u003c :maxid' -p maxid 10\n\nOnly the first 5,000 characters for each row will be considered. Be sure to review [Comprehend's pricing](https://aws.amazon.com/comprehend/pricing/) - which starts at $0.0001 per hundred characters.\n\nIf your context includes HTML tags, you can strip them out before extracting entities by adding `--strip-tags`:\n\n    sqlite-comprehend sfms.db pages text --strip-tags\n\nRows that have been processed are recorded in the `pages_comprehend_entities_done` table. If you run the command more than once it will only process rows that have been newly added.\n\nYou can delete records from that `_done` table to run them again.\n\n### sqlite-comprehend entities --help\n\n\u003c!-- [[[cog\nfrom click.testing import CliRunner\nfrom sqlite_comprehend import cli\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\"entities\", \"--help\"])\nhelp = result.output.replace(\"Usage: cli\", \"Usage: sqlite-comprehend\")\ncog.out(\n    \"```\\n{}\\n```\".format(help)\n)\n]]] --\u003e\n```\nUsage: sqlite-comprehend entities [OPTIONS] DATABASE TABLE COLUMNS...\n\n  Detect entities in columns in a table\n\n  To extract entities from columns text1 and text2 in mytable:\n\n      sqlite-comprehend entities my.db mytable text1 text2\n\n  To run against just a subset of the rows in the table, add:\n\n      --where \"id \u003c :max_id\" -p max_id 50\n\n  Results will be written to a table called mytable_comprehend_entities\n\n  To specify a different output table, use -o custom_table_name\n\nOptions:\n  --where TEXT                WHERE clause to filter table\n  -p, --param \u003cTEXT TEXT\u003e...  Named :parameters for SQL query\n  -o, --output TEXT           Custom output table\n  -r, --reset                 Start from scratch, deleting previous results\n  --strip-tags                Strip HTML tags before extracting entities\n  --access-key TEXT           AWS access key ID\n  --secret-key TEXT           AWS secret access key\n  --session-token TEXT        AWS session token\n  --endpoint-url TEXT         Custom endpoint URL\n  -a, --auth FILENAME         Path to JSON/INI file containing credentials\n  --help                      Show this message and exit.\n\n```\n\u003c!-- [[[end]]] --\u003e\n\n## Schema\n\nAssuming an input table called `pages` the tables created by this tool will have the following schema:\n\n\u003c!-- [[[cog\nimport cog, json\nfrom sqlite_comprehend import cli\nfrom unittest.mock import patch\nfrom click.testing import CliRunner\nimport sqlite_utils\nimport tempfile, pathlib\ntmpdir = pathlib.Path(tempfile.mkdtemp())\ndb_path = str(tmpdir / \"data.db\")\ndb = sqlite_utils.Database(db_path)\ndb[\"pages\"].insert_all(\n    [\n        {\n            \"id\": 1,\n            \"text\": \"John Bob\",\n        },\n        {\n            \"id\": 2,\n            \"text\": \"Sandra X\",\n        },\n    ],\n    pk=\"id\",\n)\nwith patch('boto3.client') as client:\n    client.return_value.batch_detect_entities.return_value = {\n        \"ResultList\": [\n            {\n                \"Index\": 0,\n                \"Entities\": [\n                    {\n                        \"Score\": 0.8,\n                        \"Type\": \"PERSON\",\n                        \"Text\": \"John Bob\",\n                        \"BeginOffset\": 0,\n                        \"EndOffset\": 5,\n                    },\n                ],\n            },\n            {\n                \"Index\": 1,\n                \"Entities\": [\n                    {\n                        \"Score\": 0.8,\n                        \"Type\": \"PERSON\",\n                        \"Text\": \"Sandra X\",\n                        \"BeginOffset\": 0,\n                        \"EndOffset\": 5,\n                    },\n                ],\n            },\n        ],\n        \"ErrorList\": [],\n    }\n    runner = CliRunner()\n    result = runner.invoke(cli.cli, [\n        \"entities\", db_path, \"pages\", \"text\"\n    ])\ncog.out(\"```sql\\n\")\ncog.out(db.schema)\ncog.out(\"\\n```\")\n]]] --\u003e\n```sql\nCREATE TABLE [pages] (\n   [id] INTEGER PRIMARY KEY,\n   [text] TEXT\n);\nCREATE TABLE [comprehend_entity_types] (\n   [id] INTEGER PRIMARY KEY,\n   [value] TEXT\n);\nCREATE TABLE [comprehend_entities] (\n   [id] INTEGER PRIMARY KEY,\n   [name] TEXT,\n   [type] INTEGER REFERENCES [comprehend_entity_types]([id])\n);\nCREATE TABLE [pages_comprehend_entities] (\n   [id] INTEGER REFERENCES [pages]([id]),\n   [score] FLOAT,\n   [entity] INTEGER REFERENCES [comprehend_entities]([id]),\n   [begin_offset] INTEGER,\n   [end_offset] INTEGER\n);\nCREATE UNIQUE INDEX [idx_comprehend_entity_types_value]\n    ON [comprehend_entity_types] ([value]);\nCREATE UNIQUE INDEX [idx_comprehend_entities_type_name]\n    ON [comprehend_entities] ([type], [name]);\nCREATE TABLE [pages_comprehend_entities_done] (\n   [id] INTEGER PRIMARY KEY REFERENCES [pages]([id])\n);\n```\n\u003c!-- [[[end]]] --\u003e\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n\n    cd sqlite-comprehend\n    python -m venv venv\n    source venv/bin/activate\n\nNow install the dependencies and test dependencies:\n\n    pip install -e '.[test]'\n\nTo run the tests:\n\n    pytest\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fsqlite-comprehend","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonw%2Fsqlite-comprehend","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fsqlite-comprehend/lists"}