{"id":31780636,"url":"https://github.com/ashly1991/word2vec-tf2","last_synced_at":"2026-05-18T19:03:03.115Z","repository":{"id":316242570,"uuid":"1062640907","full_name":"Ashly1991/word2vec-tf2","owner":"Ashly1991","description":"Word2Vec Skipgram with negative sampling in TensorFlow 2. Self-supervised embeddings, efficient sampled softmax, and analogies evaluation.","archived":false,"fork":false,"pushed_at":"2025-09-23T14:28:37.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-23T14:42:42.659Z","etag":null,"topics":["embeddings","jupyter-notebook","natural-language-processing","negative-sampling","nlp","self-supervised-learning","skipgram","tensorflow","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ashly1991.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-23T14:21:49.000Z","updated_at":"2025-09-23T14:30:44.000Z","dependencies_parsed_at":"2025-09-23T14:42:44.608Z","dependency_job_id":null,"html_url":"https://github.com/Ashly1991/word2vec-tf2","commit_stats":null,"previous_names":["ashly1991/word2vec-tf2"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Ashly1991/word2vec-tf2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashly1991%2Fword2vec-tf2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashly1991%2Fword2vec-tf2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashly1991%2Fword2vec-tf2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashly1991%2Fword2vec-tf2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ashly1991","download_url":"https://codeload.github.com/Ashly1991/word2vec-tf2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashly1991%2Fword2vec-tf2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279003305,"owners_count":26083555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","jupyter-notebook","natural-language-processing","negative-sampling","nlp","self-supervised-learning","skipgram","tensorflow","word2vec"],"created_at":"2025-10-10T08:16:50.448Z","updated_at":"2025-10-10T08:16:53.010Z","avatar_url":"https://github.com/Ashly1991.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Word2Vec — Skipgram with Negative Sampling (TensorFlow 2)\n\nThis project implements **Word2Vec** to learn **word embeddings** from raw text using a **self‑supervised** objective. It focuses on **Skipgram** with **negative sampling**, discusses why full softmax is inefficient with large vocabularies, and explores simple intrinsic evaluations (nearest neighbors, analogies).\n\n## Key Points\n- Example of **self‑supervised learning**: define a prediction task directly from unlabeled text to learn useful representations.\n- Understand why **full softmax** is problematic with large vocabularies and how **negative sampling** / sampled losses help.\n- Build and analyze **word embeddings**.\n\n## Questions for Understanding\n1. Given “I like to cuddle dogs”, how many skipgrams are created with window size 2?  \n2. In general, how does the number of skipgrams relate to dataset size (as input‑target pairs)?  \n3. Why isn’t computing the **full softmax** a good idea?  \n4. For a fixed (target, context) pair, are the **negative samples** re‑drawn each time or reused?  \n5. For the Shakespeare dataset, do we create (target, context) pairs across **line breaks** (last word of one line + first of next)?  \n6. Are skipgrams generated for **padding** tokens (index 0)?  \n7. The **sampling table** is created without reading the text—how does it decide probabilities?\n\n## Possible Improvements \u0026 Extensions\n- **Skip padding**: prevent generating skipgrams for padding tokens.  \n- **Re‑draw negatives** each iteration to reduce bias.  \n- **Avoid true‑context collisions** in negatives (e.g., use `tf.nn.sampled_softmax_loss` with appropriate flags—requires refactors).  \n- **Analogies**: e.g., `king - man + woman ≈ queen`, compute via **cosine similarity** over embeddings.  \n- **Scale up**: larger vocabulary/corpora; compare **naive full softmax** vs **negative sampling** efficiency as vocab grows.\n\n## Optional: CBOW Variant\n- Build **CBOW** (predict the center word from surrounding context).  \n- Create windows (e.g., with `tf.data.Dataset.window`), average context embeddings, keep negative sampling.  \n- Compare CBOW vs Skipgram.\n\n## How to Run\n```bash\npython -m venv .venv \u0026\u0026 source .venv/bin/activate     # Windows: .venv\\Scripts\\activate\npip install -r requirements.txt\njupyter lab word2vec-skipgram.ipynb\n```\n\n## Requirements\n```\ntensorflow==2.13.0\nnumpy\nmatplotlib\njupyterlab\ntqdm\n```\n(Adjust versions as needed for your environment.)\n\n## Notes\n- For reproducible negatives: set a random seed and re‑sample within the training loop.  \n- Monitor embedding quality with **nearest neighbors** and **analogy tests**; results improve with data size and training time.\n\n## License\nMIT — see `LICENSE`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashly1991%2Fword2vec-tf2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashly1991%2Fword2vec-tf2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashly1991%2Fword2vec-tf2/lists"}