https://github.com/seonghobae/vector-topic-modeling
Standalone embedding-based vector topic modeling package
https://github.com/seonghobae/vector-topic-modeling
Last synced: 2 months ago
JSON representation
Standalone embedding-based vector topic modeling package
- Host: GitHub
- URL: https://github.com/seonghobae/vector-topic-modeling
- Owner: seonghobae
- License: mit
- Created: 2026-03-25T00:41:30.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-30T04:20:52.000Z (2 months ago)
- Last Synced: 2026-04-04T10:52:21.641Z (2 months ago)
- Language: Python
- Size: 391 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Support: SUPPORT.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Vector Topic Modeling
[](https://github.com/seonghobae/vector-topic-modeling/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](./LICENSE)
Standalone embedding-based topic modeling software for vector workflows.
## What it provides
- dependency-light clustering kernel
- session-aware representative selection
- safe text shaping/redaction for embedding input
- generic ingestion for DB column-value rows and JSON payloads
- provider-driven `TopicModeler` API
- JSONL-oriented CLI for standalone runs
## Install
Install from a local checkout today:
```bash
git clone https://github.com/seonghobae/vector-topic-modeling.git
cd vector-topic-modeling
uv sync
```
For local wheel installation during development or release validation:
```bash
python3.11 -m pip install dist/vector_topic_modeling-0.1.0-py3-none-any.whl
```
The package requires Python 3.11 or newer.
## Development install
```bash
uv sync --extra dev
```
## Verify locally
```bash
uv run pytest -q
uv run python scripts/docstring_coverage.py --min-percent 100
# Delete any previous build artifacts and smoke-test virtual environment.
# On POSIX shells: rm -rf dist .venv-smoke-cli
# On Windows PowerShell: Remove-Item -Recurse -Force dist, .venv-smoke-cli
uv run python -m build
uv run python scripts/smoke_installed_cli.py --dist-dir dist --venv-dir .venv-smoke-cli
```
The repository release gate also smoke-tests the installed
`vector-topic-modeling` console script with:
```bash
uv run python scripts/smoke_installed_cli.py --dist-dir dist --venv-dir .venv-smoke-cli
```
## Quick start
```python
from vector_topic_modeling import TopicDocument, TopicModelConfig, TopicModeler
class FakeEmbeddingProvider:
def embed(self, texts: list[str]) -> list[list[float]]:
return [[1.0, 0.0] for _ in texts]
modeler = TopicModeler(
embedding_provider=FakeEmbeddingProvider(),
config=TopicModelConfig(similarity_threshold=0.85),
)
result = modeler.fit_predict([
TopicDocument(id="1", text="refund duplicate billing"),
])
```
See [`examples/`](./examples/) for end-to-end local usage samples.
Detailed usage and troubleshooting guidance is in
[`docs/user-manual.md`](./docs/user-manual.md).
## JSONL CLI input shapes
### Legacy flat shape
Each line can contain:
```json
{"id":"1","text":"refund duplicate billing","session_id":"s1","question":"...","response":"...","count":1}
```
### Generic DB / JSON payload shape
You can also pass arbitrary rows (DB-export style columns or nested JSON payloads)
and map them with `--ingestion-config`.
Example config: [`examples/ingestion_config_db_columns.json`](./examples/ingestion_config_db_columns.json)
Example rows: [`examples/sample_db_rows.jsonl`](./examples/sample_db_rows.jsonl)
Run the CLI with an OpenAI-compatible embedding endpoint:
```bash
vector-topic-modeling cluster input.jsonl \
--output topics.json \
--base-url https://your-gateway.example.com \
--api-key "$LITELLM_API_KEY" \
--model text-embedding-3-large
```
With generic ingestion mapping:
```bash
vector-topic-modeling cluster examples/sample_db_rows.jsonl \
--output topics.json \
--ingestion-config examples/ingestion_config_db_columns.json \
--base-url https://your-gateway.example.com \
--api-key "$LITELLM_API_KEY" \
--model text-embedding-3-large
```
Sample files:
- [`examples/sample_queries.jsonl`](./examples/sample_queries.jsonl)
- [`examples/sample_db_rows.jsonl`](./examples/sample_db_rows.jsonl)
- [`examples/ingestion_config_db_columns.json`](./examples/ingestion_config_db_columns.json)
- [`examples/cli_openai_compat.sh`](./examples/cli_openai_compat.sh)
- [`examples/basic_in_memory_provider.py`](./examples/basic_in_memory_provider.py)
## Scope
- This package intentionally excludes web framework routes, persistence,
background jobs, export pipelines, and email delivery concerns.
## Repository guides
- [Contributing](./CONTRIBUTING.md)
- [User manual](./docs/user-manual.md) (Korean version: [사용자 매뉴얼](./docs/user-manual-ko.md))
- [Security policy](./SECURITY.md)
- [Support](./SUPPORT.md)
- [Changelog](./CHANGELOG.md)
- [Maintainer release guide](./docs/maintainers/releasing.md)