{"id":51317002,"url":"https://github.com/tiennm99/phow2sim","last_synced_at":"2026-07-01T08:30:58.321Z","repository":{"id":353249162,"uuid":"1218597904","full_name":"tiennm99/phow2sim","owner":"tiennm99","description":"Vietnamese word similarity API. Tiny stateless FastAPI service over VinAI's PhoW2V pretrained vectors. Endpoints: /similarity /neighbors /vocab /random. Docker-ready building block for Vietnamese Semantle-style games, search re-rankers, writing tools.","archived":false,"fork":false,"pushed_at":"2026-04-23T04:22:07.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-23T05:29:03.147Z","etag":null,"topics":["cosine-similarity","docker","fastapi","microservice","nearest-neighbors","nlp","phow2v","python","rest-api","semantic-similarity","semantle","vietnamese","vietnamese-nlp","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"https://phow2sim.sg.miti99.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tiennm99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T03:08:57.000Z","updated_at":"2026-04-23T04:22:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tiennm99/phow2sim","commit_stats":null,"previous_names":["tiennm99/phow2sim"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tiennm99/phow2sim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiennm99%2Fphow2sim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiennm99%2Fphow2sim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiennm99%2Fphow2sim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiennm99%2Fphow2sim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tiennm99","download_url":"https://codeload.github.com/tiennm99/phow2sim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiennm99%2Fphow2sim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34999790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-01T02:00:05.325Z","response_time":130,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-similarity","docker","fastapi","microservice","nearest-neighbors","nlp","phow2v","python","rest-api","semantic-similarity","semantle","vietnamese","vietnamese-nlp","word-embeddings","word2vec"],"created_at":"2026-07-01T08:30:56.177Z","updated_at":"2026-07-01T08:30:58.316Z","avatar_url":"https://github.com/tiennm99.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# phow2sim\n\nTiny HTTP service that returns Vietnamese word2vec similarity and nearest\nneighbors — Vietnamese sibling of [`word2sim`](../word2sim). Same endpoint\nshapes; swap URLs and it's a drop-in replacement.\n\nBacked by [**PhoW2V**](https://github.com/datquocnguyen/PhoW2V) (VinAI /\nDat Quoc Nguyen), the largest pretrained Vietnamese word vectors\navailable. Chosen over PhoBERT for this purpose because word2vec's\nsimilarity distribution is wide enough to drive a Semantle-style warmth\nmeter, whereas raw transformer embeddings saturate at the top.\n\n\u003e **License note.** PhoW2V's research-only license forbids public\n\u003e redistribution, so this service doesn't — and can't — embed or\n\u003e auto-download the vectors from any public URL. You supply your own\n\u003e private mirror (typically a Nextcloud instance you control). See\n\u003e [Quick start](#quick-start).\n\n## Stack\n\n- FastAPI + uvicorn\n- gensim (loads PhoW2V `.txt` files; caches a binary `.bin` alongside for 5× faster restarts)\n\n## Variants\n\nPhoW2V ships in four flavors. Pick one per deployment.\n\n| Variant | Dims | Size | Best for |\n|---|---|---|---|\n| `word-100`    | 100 | ~400MB | low-RAM hosts, compound-aware |\n| `word-300`    | 300 | ~1.2GB | **default** — best quality, compound-aware |\n| `syllable-100`| 100 | ~50MB  | single-syllable guesses, tiny footprint |\n| `syllable-300`| 300 | ~150MB | single-syllable guesses, richer vectors |\n\nThe \"word\" variants expect underscore-joined compounds (`sinh_viên`);\nthe \"syllable\" variants have no multi-token keys. The canonicalizer\ntries both forms, but the client should pre-segment for the word variant\nif it wants reliable coverage of compounds.\n\n## Endpoints\n\n| Method | Path | Purpose |\n|---|---|---|\n| GET | `/health` | liveness probe |\n| GET | `/similarity?a=X\u0026b=Y` | cosine similarity between two keys |\n| GET | `/neighbors?word=X\u0026topn=10` | nearest-neighbor keys with scores |\n| GET | `/vocab?word=X` | check in-vocab; return canonical form |\n| GET | `/random` | random vocab key, filtered for game-friendliness |\n\nResponse shape is identical to word2sim.\n\n### Examples\n\n```bash\ncurl 'http://localhost:8001/similarity?a=con_chó\u0026b=con_mèo'\n# {\"a\":\"con_chó\",\"b\":\"con_mèo\",\"canonical_a\":\"con_chó\",\"canonical_b\":\"con_mèo\",\n#  \"in_vocab_a\":true,\"in_vocab_b\":true,\"similarity\":0.78}\n\ncurl 'http://localhost:8001/neighbors?word=đại_học\u0026topn=5'\n\ncurl 'http://localhost:8001/vocab?word=con%20ch%C3%B3'   # \"con chó\" → tries \"con_chó\"\n# {\"word\":\"con chó\",\"canonical\":\"con_chó\",\"in_vocab\":true}\n\ncurl 'http://localhost:8001/random?min_rank=500\u0026max_rank=20000\u0026min_len=3\u0026max_len=12'\n```\n\nOut-of-vocab returns `in_vocab:false` and `similarity:null`. Lookup\ntries exact → lowercase → space-to-underscore variants.\n\n## Quick start\n\n1. **Get the vectors once.** Download from the [upstream Google Drive\n   mirror](https://drive.google.com/drive/folders/1NZhZFYbcwKzLpvvGdJUdPbwEVdVW4E3j?usp=drive_link)\n   (the one linked from the PhoW2V README — the original\n   `public.vinai.io` URLs are dead). Four zips; keep the one matching\n   your chosen variant.\n\n2. **Host the zip somewhere a plain `GET` can reach it.** Options:\n   - **Nextcloud public share** with file upload, then use the\n     `/download` endpoint: `https://cloud.example.com/s/\u003ctoken\u003e/download`.\n     The share token acts as the capability; leave it unguessable and\n     unlisted.\n   - Any signed/pre-signed URL from your object store (S3, R2,\n     BackBlaze B2), or your own HTTP(S) endpoint.\n\n   The service sends **no auth headers** — any authentication must be\n   baked into the URL itself. This keeps the code minimal and puts\n   hosting policy on the operator.\n\n3. **Configure env.** Copy `.env.example` to `.env` and set `MODEL_URL`:\n   ```bash\n   cp .env.example .env\n   # edit .env:\n   #   MODEL_URL=https://cloud.example.com/s/abc123XYZ/download\n   ```\n\n4. **Boot.**\n   ```bash\n   docker compose up --build\n   ```\n   First boot streams ~1.2GB (word-300d) into the `phow2v-cache` volume,\n   then parses ~60s. A binary `.bin` is written alongside so later\n   restarts load in ~10s. Health check start period is 10 min to cover\n   the first-boot cost.\n\n### Alternative: mount a local file instead\n\nIf you've already downloaded the `.txt` locally and don't want to\nre-upload anywhere, skip `MODEL_URL` entirely and mount the file. In\n`docker-compose.yml`, uncomment the bind mount:\n\n```yaml\nvolumes:\n  - phow2v-cache:/data/phow2v\n  - ./models/word2vec_vi_words_300dims.txt:/data/phow2v/word2vec_vi_words_300dims.txt:ro\n```\n\nThen `docker compose up` boots straight into parse — no download step.\n\n## Switching variant\n\nHost the desired zip and update `.env`:\n\n```bash\nMODEL_URL=https://cloud.example.com/s/\u003ctoken-for-syllables-100\u003e/download\nMODEL_PATH=/data/phow2v/word2vec_vi_syllables_100dims.txt\n```\n\nDelete the `phow2v-cache` volume when switching, otherwise the stale\n`.bin` from the previous variant loads instead:\n\n```bash\ndocker compose down -v \u0026\u0026 docker compose up --build\n```\n\n## Config (env vars)\n\n| Var | Default | Meaning |\n|---|---|---|\n| `MODEL_URL` | `\"\"` | URL that serves the zip via a plain GET. Bake any auth into the URL. Optional if `MODEL_PATH` is already populated via a bind mount. |\n| `MODEL_PATH` | `/data/phow2v/word2vec_vi_words_300dims.txt` | Where the text-format vectors live. A binary `.bin` sibling is written on first parse. |\n\n## Auth\n\nThe service does not authenticate its callers. Put it behind a reverse\nproxy (Caddy, nginx, Cloudflare Tunnel) if you need access control.\n\n## Project layout\n\n```\nphow2sim/\n├── app/\n│   ├── main.py       # FastAPI routes\n│   └── vectors.py    # PhoW2V loader + canonicalize + similarity/neighbors/random\n├── Dockerfile\n├── docker-compose.yml\n├── requirements.txt\n└── .env.example      # copy to .env and set MODEL_URL\n```\n\n## Credits\n\n- Vectors: [PhoW2V](https://github.com/datquocnguyen/PhoW2V) by Dat Quoc Nguyen / VinAI Research (research-only license — see upstream).\n- API shape: sibling of [`word2sim`](../word2sim).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiennm99%2Fphow2sim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiennm99%2Fphow2sim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiennm99%2Fphow2sim/lists"}