{"id":15433000,"url":"https://github.com/simonw/llm-cluster","last_synced_at":"2025-04-14T07:08:21.928Z","repository":{"id":192394619,"uuid":"687084324","full_name":"simonw/llm-cluster","owner":"simonw","description":"LLM plugin for clustering embeddings","archived":false,"fork":false,"pushed_at":"2024-03-01T21:58:19.000Z","size":19,"stargazers_count":73,"open_issues_count":5,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-14T07:08:16.014Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-04T15:16:04.000Z","updated_at":"2025-04-13T21:40:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"5c70fd66-7721-44a8-9b8f-0c76b73a0b35","html_url":"https://github.com/simonw/llm-cluster","commit_stats":{"total_commits":9,"total_committers":1,"mean_commits":9.0,"dds":0.0,"last_synced_commit":"d2f93f1d17abd001454cecd8de21c75138053d7b"},"previous_names":["simonw/llm-cluster"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fllm-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fllm-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fllm-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fllm-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonw","download_url":"https://codeload.github.com/simonw/llm-cluster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248837280,"owners_count":21169374,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T18:30:10.661Z","updated_at":"2025-04-14T07:08:21.899Z","avatar_url":"https://github.com/simonw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llm-cluster\n\n[![PyPI](https://img.shields.io/pypi/v/llm-cluster.svg)](https://pypi.org/project/llm-cluster/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/llm-cluster?include_prereleases\u0026label=changelog)](https://github.com/simonw/llm-cluster/releases)\n[![Tests](https://github.com/simonw/llm-cluster/workflows/Test/badge.svg)](https://github.com/simonw/llm-cluster/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/llm-cluster/blob/main/LICENSE)\n\n[LLM](https://llm.datasette.io/) plugin for clustering embeddings\n\nBackground on this project: [Clustering with llm-cluster](https://simonwillison.net/2023/Sep/4/llm-embeddings/#llm-cluster).\n\n## Installation\n\nInstall this plugin in the same environment as LLM.\n```bash\nllm install llm-cluster\n```\n\n## Usage\n\nThe plugin adds a new command, `llm cluster`. This command takes the name of an [embedding collection](https://llm.datasette.io/en/stable/embeddings/cli.html#storing-embeddings-in-sqlite) and the number of clusters to return.\n\nFirst, use [paginate-json](https://github.com/simonw/paginate-json) and [jq](https://stedolan.github.io/jq/) to populate a collection. I this case we are embedding the title and body of every issue in the [llm repository](https://github.com/simonw/llm), and storing the result in a `issues.db` database:\n```bash\npaginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all\u0026filter=all' \\\n  | jq '[.[] | {id: .id, title: .title}]' \\\n  | llm embed-multi llm-issues - \\\n    --database issues.db --store\n```\nThe `--store` flag causes the content to be stored in the database along with the embedding vectors.\n\nNow we can cluster those embeddings into 10 groups:\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db\n```\nIf you omit the `-d` option the default embeddings database will be used.\n\nThe output should look something like this (truncated):\n```json\n[\n  {\n    \"id\": \"2\",\n    \"items\": [\n      {\n        \"id\": \"1650662628\",\n        \"content\": \"Initial design\"\n      },\n      {\n        \"id\": \"1650682379\",\n        \"content\": \"Log prompts and responses to SQLite\"\n      }\n    ]\n  },\n  {\n    \"id\": \"4\",\n    \"items\": [\n      {\n        \"id\": \"1650760699\",\n        \"content\": \"llm web command - launches a web server\"\n      },\n      {\n        \"id\": \"1759659476\",\n        \"content\": \"`llm models` command\"\n      },\n      {\n        \"id\": \"1784156919\",\n        \"content\": \"`llm.get_model(alias)` helper\"\n      }\n    ]\n  },\n  {\n    \"id\": \"7\",\n    \"items\": [\n      {\n        \"id\": \"1650765575\",\n        \"content\": \"--code mode for outputting code\"\n      },\n      {\n        \"id\": \"1659086298\",\n        \"content\": \"Accept PROMPT from --stdin\"\n      },\n      {\n        \"id\": \"1714651657\",\n        \"content\": \"Accept input from standard in\"\n      }\n    ]\n  }\n]\n```\nThe content displayed is truncated to 100 characters. Pass `--truncate 0` to disable truncation, or `--truncate X` to truncate to X characters.\n\n## Generating summaries for each cluster\n\nThe `--summary` flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the `--truncate` option) through a prompt to a Large Language Model.\n\nThis feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.\n\nSince this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.\n\nThis feature only works for embeddings that have had their associated content stored in the database using the `--store` flag.\n\nYou can use it like this:\n\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary\n```\nThis uses the default prompt and the default model.\n\nPartial example output:\n```json\n[\n  {\n    \"id\": \"5\",\n    \"items\": [\n      {\n        \"id\": \"1650682379\",\n        \"content\": \"Log prompts and responses to SQLite\"\n      },\n      {\n        \"id\": \"1650757081\",\n        \"content\": \"Command for browsing captured logs\"\n      }\n    ],\n    \"summary\": \"Log Management and Interactive Prompt Tracking\"\n  },\n  {\n    \"id\": \"6\",\n    \"items\": [\n      {\n        \"id\": \"1650771320\",\n        \"content\": \"Mechanism for continuing an existing conversation\"\n      },\n      {\n        \"id\": \"1740090291\",\n        \"content\": \"-c option for continuing a chat (using new chat_id column)\"\n      },\n      {\n        \"id\": \"1784122278\",\n        \"content\": \"Figure out truncation strategy for continue conversation mode\"\n      }\n    ],\n    \"summary\": \"Continuing Conversation Mechanism and Management\"\n  }\n]\n```\n\nTo use a different model, e.g. GPT-4, pass the `--model` option:\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary \\\n  --model gpt-4\n```\nThe default prompt used is:\n\n\u003e Short, concise title for this cluster of related documents.\n\nTo use a custom prompt, pass `--prompt`:\n\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary \\\n  --model gpt-4 \\\n  --prompt 'Summarize this in a short line in the style of a bored, angry panda'\n```\nA `\"summary\"` key will be added to each cluster, containing the generated summary.\n\n## Development\n\nTo set up this plugin locally, first checkout the code. Then create a new virtual environment:\n```bash\ncd llm-cluster\npython3 -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fllm-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonw%2Fllm-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fllm-cluster/lists"}