{"id":28240520,"url":"https://github.com/567-labs/kura","last_synced_at":"2025-06-12T05:30:46.078Z","repository":{"id":270996809,"uuid":"912106066","full_name":"567-labs/kura","owner":"567-labs","description":"Kura is a simple reproduction of the CLIO paper which uses language models to label user behaviour before clustering them based on embeddings recursively. This helps us understand user behaviour on a higher level without sacrificing PII.","archived":false,"fork":false,"pushed_at":"2025-06-09T18:59:41.000Z","size":2967,"stargazers_count":169,"open_issues_count":10,"forks_count":21,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-09T19:26:15.035Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/567-labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-04T16:12:18.000Z","updated_at":"2025-06-09T18:59:45.000Z","dependencies_parsed_at":"2025-01-28T06:23:27.711Z","dependency_job_id":"c31cc769-5c5c-4f1d-ad02-8c668446e665","html_url":"https://github.com/567-labs/kura","commit_stats":null,"previous_names":["ivanleomk/chatterbox","ivanleomk/open-clio","567-labs/kura","ivanleomk/kura"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/567-labs/kura","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/567-labs%2Fkura","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/567-labs%2Fkura/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/567-labs%2Fkura/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/567-labs%2Fkura/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/567-labs","download_url":"https://codeload.github.com/567-labs/kura/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/567-labs%2Fkura/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259404101,"owners_count":22852118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-19T04:06:21.862Z","updated_at":"2025-06-12T05:30:46.072Z","avatar_url":"https://github.com/567-labs.png","language":"Python","funding_links":[],"categories":["LLM"],"sub_categories":[],"readme":"# Kura: Procedural API for Chat Data Analysis\n\n![Kura Architecture](./kura.png)\n\n[![PyPI Downloads](https://img.shields.io/pypi/dm/kura?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/kura/)\n[![GitHub Stars](https://img.shields.io/github/stars/567-labs/kura?style=flat-square\u0026logo=github)](https://github.com/567-labs/kura/stargazers)\n[![Documentation](https://img.shields.io/badge/docs-available-brightgreen?style=flat-square\u0026logo=gitbook\u0026logoColor=white)](https://567-labs.github.io/kura/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)\n[![Python Version](https://img.shields.io/pypi/pyversions/kura?style=flat-square\u0026logo=python\u0026logoColor=white)](https://pypi.org/project/kura/)\n[![PyPI Version](https://img.shields.io/pypi/v/kura?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/kura/)\n\nKura is an open-source library for understanding chat data through machine learning, inspired by [Anthropic's CLIO](https://www.anthropic.com/research/clio). It provides a functional, composable API for clustering conversations to discover patterns and insights.\n\n## Why Analyze Conversation Data?\n\nAs AI assistants and chatbots become increasingly central to product experiences, understanding how users interact with these systems at scale becomes a critical challenge. Manually reviewing thousands of conversations is impractical, yet crucial patterns and user needs often remain hidden in this data.\n\nKura addresses this challenge by:\n\n- **Revealing user intent patterns** that may not be obvious from individual conversations\n- **Identifying common user needs** to prioritize feature development\n- **Discovering edge cases and failures** that require attention\n- **Tracking usage trends** over time as your product evolves\n- **Informing prompt engineering** by highlighting successful and problematic interactions\n\nBy clustering similar conversations and providing intuitive visualizations, Kura transforms raw chat data into actionable insights without compromising user privacy.\n\n## Installation\n\n```bash\nuv pip install kura\n```\n\n## Quick Start\n\n```python\nimport asyncio\nfrom rich.console import Console\nfrom kura import (\n    summarise_conversations,\n    generate_base_clusters_from_conversation_summaries,\n    reduce_clusters_from_base_clusters,\n    reduce_dimensionality_from_clusters,\n)\nfrom kura.checkpoints import (\n    HFDatasetCheckpointManager,\n    JSONLCheckpointManager,\n    ParquetCheckpointManager,\n)\nfrom kura.v1.visualization import visualise_pipeline_results\nfrom kura.types import Conversation\nfrom kura.summarisation import SummaryModel\nfrom kura.k_means import MiniBatchKmeansClusteringMethod\nfrom kura.cluster import ClusterModel\nfrom kura.meta_cluster import MetaClusterModel\nfrom kura.dimensionality import HDBUMAP\n\nasync def main():\n    # Initialize models\n    console = Console()\n    summary_model = SummaryModel(console=console, max_concurrent_requests=100)\n\n    # Use MiniBatch K-means for better performance with large datasets\n    minibatch_kmeans_clustering = MiniBatchKmeansClusteringMethod(\n        clusters_per_group=10,  # Target items per cluster\n        batch_size=1000,  # Mini-batch size for processing\n        max_iter=100,  # Maximum iterations\n        random_state=42,  # Random seed for reproducibility\n    )\n\n    cluster_model = ClusterDescriptionModel(\n        console=console,\n    )\n    meta_cluster_model = MetaClusterModel(console=console, max_concurrent_requests=100)\n    dimensionality_model = HDBUMAP()\n\n    # Set up checkpointing - you can choose from multiple backends\n    # HuggingFace Datasets (advanced features, cloud sync)\n    checkpoint_manager = HFDatasetCheckpointManager(\"./checkpoints\", enabled=True)\n\n    # Alternative checkpoint managers:\n    # checkpoint_manager = ParquetCheckpointManager(\"./checkpoints\", enabled=True)  # 50% smaller files\n    # checkpoint_manager = JSONLCheckpointManager(\"./checkpoints\", enabled=True)    # Human-readable\n\n    # Load conversations from Hugging Face dataset\n    conversations = Conversation.from_hf_dataset(\n        \"ivanleomk/synthetic-gemini-conversations\",\n        split=\"train\"\n    )\n\n    # Process through the pipeline step by step\n    print(\"Step 1: Generating conversation summaries...\")\n    summaries = await summarise_conversations(\n        conversations,\n        model=summary_model,\n        checkpoint_manager=checkpoint_manager\n    )\n    print(f\"Generated {len(summaries)} summaries\")\n\n    print(\"Step 2: Generating base clusters from summaries...\")\n    clusters = await generate_base_clusters_from_conversation_summaries(\n        summaries,\n        model=cluster_model,\n        clustering_method=minibatch_kmeans_clustering,\n        checkpoint_manager=checkpoint_manager\n    )\n    print(f\"Generated {len(clusters)} base clusters\")\n\n    print(\"Step 3: Reducing clusters hierarchically...\")\n    reduced_clusters = await reduce_clusters_from_base_clusters(\n        clusters,\n        model=meta_cluster_model,\n        checkpoint_manager=checkpoint_manager\n    )\n    print(f\"Reduced to {len(reduced_clusters)} meta clusters\")\n\n    print(\"Step 4: Projecting clusters to 2D for visualization...\")\n    projected_clusters = await reduce_dimensionality_from_clusters(\n        reduced_clusters,\n        model=dimensionality_model,\n        checkpoint_manager=checkpoint_manager,\n    )\n    print(f\"Generated {len(projected_clusters)} projected clusters\")\n\n    # Visualize results\n    visualise_pipeline_results(reduced_clusters, style=\"enhanced\")\n\n    print(f\"\\nProcessed {len(conversations)} conversations\")\n    print(f\"Created {len(reduced_clusters)} meta clusters\")\n    print(f\"Checkpoints saved to: {checkpoint_manager.checkpoint_dir}\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\nThis example will:\n\n1. Load 190 synthetic programming conversations from Hugging Face\n2. Process them through the complete analysis pipeline step by step\n3. Generate hierarchical clusters organized into categories\n4. Display the results with enhanced visualization\n\n## Key Design Principles\n\nKura follows a function-based architecture where pipeline functions orchestrate the execution while models handle the core logic. Each function is designed with explicit inputs/outputs and no hidden state, working with any model that implements the required interface. The system supports various model types through polymorphic interfaces - from OpenAI to local models for summarization, different clustering algorithms, and various dimensionality reduction techniques.\n\nData can be loaded from multiple sources including Claude conversation history (`Conversation.from_claude_conversation_dump()`) and Hugging Face datasets (`Conversation.from_hf_dataset()`). The example uses a dataset of 190 synthetic programming conversations that form natural clusters across technical topics.\n\nThe pipeline architecture processes data through sequential stages: loading, summarization, embedding, base clustering, meta-clustering, and dimensionality reduction. All progress is automatically saved using checkpoints, and the system can be extended by implementing custom versions of any component model.\n\n## Documentation\n\n- **Getting Started**\n\n  - [Installation Guide](https://567-labs.github.io/kura/getting-started/installation/)\n  - [Quickstart Guide](https://567-labs.github.io/kura/getting-started/quickstart/)\n\n- **Core Concepts**\n\n  - [Conversations](https://567-labs.github.io/kura/core-concepts/conversations/)\n  - [Embedding](https://567-labs.github.io/kura/core-concepts/embedding/)\n  - [Clustering](https://567-labs.github.io/kura/core-concepts/clustering/)\n  - [Summarization](https://567-labs.github.io/kura/core-concepts/summarization/)\n  - [Meta-Clustering](https://567-labs.github.io/kura/core-concepts/meta-clustering/)\n  - [Dimensionality Reduction](https://567-labs.github.io/kura/core-concepts/dimensionality-reduction/)\n\n- **API Reference**\n  - [Procedural API Documentation](https://567-labs.github.io/kura/api/)\n\n## Checkpoint System\n\nKura provides three checkpoint managers for different use cases:\n\n| Checkpoint Manager             | Format      | Dependencies      | File Size   | Use Case                               |\n| ------------------------------ | ----------- | ----------------- | ----------- | -------------------------------------- |\n| **JSONLCheckpointManager**     | JSON Lines  | None              | Baseline    | Development, debugging, small datasets |\n| **ParquetCheckpointManager**   | Parquet     | PyArrow           | 50% smaller | Production workflows, analytics        |\n| **HFDatasetCheckpointManager** | HF Datasets | datasets, PyArrow | 7% smaller  | Large-scale ML, cloud workflows        |\n\n### Checkpoint Performance (190 conversations)\n\nBased on tutorial benchmarks:\n\n```text\nJSONL: 200KB total storage\nPARQUET: 100KB total storage (50% space savings)\nHF: 186KB total storage (7% space savings)\n```\n\n- **JSONL**: Human-readable, universal compatibility, no dependencies\n- **Parquet**: Best compression, fastest analytical queries, type safety\n- **HuggingFace**: Streaming support, cloud sync, versioning, advanced features\n\n## Comparison with Similar Tools\n\n| Feature                | Kura                                  | Traditional Analytics          | Manual Review          | Generic Clustering       |\n| ---------------------- | ------------------------------------- | ------------------------------ | ---------------------- | ------------------------ |\n| Semantic Understanding | ✅ Uses LLMs for deep understanding   | ❌ Limited to keywords         | ✅ Human understanding | ⚠️ Basic similarity only |\n| Scalability            | ✅ Handles thousands of conversations | ✅ Highly scalable             | ❌ Time intensive      | ✅ Works at scale        |\n| Visualization          | ✅ Interactive UI                     | ⚠️ Basic charts                | ❌ Manual effort       | ⚠️ Generic plots         |\n| Hierarchy Discovery    | ✅ Meta-clustering feature            | ❌ Flat categories             | ⚠️ Subjective grouping | ❌ Typically flat        |\n| Extensibility          | ✅ Custom models and extractors       | ⚠️ Limited customization       | ✅ Flexible but manual | ⚠️ Some algorithms       |\n| Privacy                | ✅ Self-hosted option                 | ⚠️ Often requires data sharing | ✅ Can be private      | ✅ Can be private        |\n\n## Future Roadmap\n\nKura is actively evolving with plans to add:\n\n- **Enhanced Topic Modeling**: More sophisticated detection of themes across conversations\n- **Temporal Analysis**: Tracking how conversation patterns evolve over time\n- **Advanced Visualizations**: Additional visual representations of conversation data\n- **Data Connectors**: More integrations with popular conversation data sources\n- **Multi-modal Support**: Analysis of conversations that include images and other media\n- **Export Capabilities**: Enhanced formats for sharing and presenting findings\n\n## Testing\n\nTo quickly test Kura and see it in action:\n\n```bash\nuv run python scripts/tutorial_procedural_api.py\n```\n\nThis script tests all three checkpoint managers and provides timing comparisons. Expected output:\n\n```text\nLoaded 190 conversations successfully!\n\nSaved 190 conversations to ./tutorial_checkpoints/conversations.json\n\nRunning with HFDatasetCheckpointManager\nStep 1: Generating summaries with checkpoints...\nGenerated 190 summaries using checkpoints\nStep 2: Generating clusters with checkpoints...\nGenerated 19 clusters using checkpoints\nStep 3: Meta clustering with checkpoints...\nReduced to 29 meta clusters using checkpoints\nStep 4: Dimensionality reduction with checkpoints...\nGenerated 29 projected clusters using HFDatasetCheckpointManager\n\nRunning with ParquetCheckpointManager\nStep 1: Generating summaries with checkpoints...\nGenerated 190 summaries using checkpoints\nStep 2: Generating clusters with checkpoints...\nGenerated 19 clusters using checkpoints\nStep 3: Meta clustering with checkpoints...\nReduced to 29 meta clusters using checkpoints\nStep 4: Dimensionality reduction with checkpoints...\nGenerated 29 projected clusters using ParquetCheckpointManager\n\nRunning with JSONLCheckpointManager\nStep 1: Generating summaries with checkpoints...\nGenerated 190 summaries using checkpoints\nStep 2: Generating clusters with checkpoints...\nGenerated 19 clusters using checkpoints\nStep 3: Meta clustering with checkpoints...\nReduced to 29 meta clusters using checkpoints\nStep 4: Dimensionality reduction with checkpoints...\nGenerated 29 projected clusters using JSONLCheckpointManager\n\n============================================================\n                    TIMING SUMMARY\n============================================================\nLoading sample conversations               1.23s ( 5.2%)\nSaving conversations to JSON               0.45s ( 1.9%)\nHFDatasetCheckpointManager - Summarization 8.45s (35.8%)\nHFDatasetCheckpointManager - Clustering    6.78s (28.7%)\nHFDatasetCheckpointManager - Meta clustering 4.32s (18.3%)\nHFDatasetCheckpointManager - Dimensionality 2.34s (9.9%)\n------------------------------------------------------------\nTotal Time                               23.57s\n============================================================\n```\n\nThis will:\n\n- Load 190 sample conversations from Hugging Face\n- Process them through the complete pipeline with each checkpoint manager\n- Compare timing and storage efficiency across formats\n- Generate 29 hierarchical clusters organized into categories\n- Save checkpoints to `./tutorial_checkpoints/` with subfolders for each format\n\n## Development\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.\n\n## License\n\n[MIT License](LICENSE)\n\n## About\n\nKura is under active development. If you face any issues or have suggestions, please feel free to [open an issue](https://github.com/567-labs/kura/issues) or a PR. For more details on the technical implementation, check out this [walkthrough of the code](https://ivanleo.com/blog/understanding-user-conversations).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F567-labs%2Fkura","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F567-labs%2Fkura","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F567-labs%2Fkura/lists"}