{"id":50658739,"url":"https://github.com/ftosoni/mediawiki-code2code-search","last_synced_at":"2026-06-08T01:06:08.613Z","repository":{"id":349577233,"uuid":"1190953479","full_name":"ftosoni/mediawiki-code2code-search","owner":"ftosoni","description":"MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.","archived":false,"fork":false,"pushed_at":"2026-06-04T14:44:45.000Z","size":10405,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-04T16:06:39.717Z","etag":null,"topics":["code-search-engine","mediawiki","neural-search","software-heritage"],"latest_commit_sha":null,"homepage":"https://code2codesearch.toolforge.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ftosoni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-24T19:27:58.000Z","updated_at":"2026-06-04T14:46:38.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ftosoni/mediawiki-code2code-search","commit_stats":null,"previous_names":["ftosoni/mediawiki-code2code-search"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ftosoni/mediawiki-code2code-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fmediawiki-code2code-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fmediawiki-code2code-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fmediawiki-code2code-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fmediawiki-code2code-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ftosoni","download_url":"https://codeload.github.com/ftosoni/mediawiki-code2code-search/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fmediawiki-code2code-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34043826,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-search-engine","mediawiki","neural-search","software-heritage"],"created_at":"2026-06-08T01:06:08.045Z","updated_at":"2026-06-08T01:06:08.608Z","avatar_url":"https://github.com/ftosoni.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MediaWiki Code2Code Search\n\nA high-performance semantic code search engine designed for the MediaWiki ecosystem. \nBuilt on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations.\nMetadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).\n\nAs featured on [Wikimedia Diff](https://diff.wikimedia.org/2026/04/14/introducing-mediawiki-code2code-search-semantic-search-to-find-code-by-under-the-surface-similarity/).\n\n## ✨ Key Features\n\n- **📂 Global MediaWiki Indexing**: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).\n- **🧠 Single-Stage Neural Retrieval**: Uses `Qwen3-Embedding-0.6B` with FAISS `IndexIVFPQ` for lightning-fast results (approx. 0.3s).\n- **🌳 Granular Structural Filtering**: High-precision extraction and filtering of **Functions**, **Types**, **Template Functions**, and **Template Types** across 10 languages.\n- **🏗️ Split-Build Architecture**: Optimized for asymmetric hardware—run heavy extraction on a laptop and neural vectorization on a GPU.\n- **🌍 Massive Localization Footprint**: Fully localized UI supporting **17 languages**.\n- **🎨 Codex UI**: A clean, accessible frontend built with Wikimedia's **Codex Design System** for a native look and feel.\n- **🔍 Advanced Multi-select Filtering**: Granular control over results by repository group, programming language, and entry type.\n\n## 📂 Project Structure\n\n```\nmediawiki-code2code-search/\n├── frontend/                  # Codex-based Static Frontend\n│   ├── css/style.css          # Stylesheets using the Codex Design System\n│   ├── js/main.js             # Main frontend application logic\n│   └── i18n/                  # Localization JSONs supporting 17 languages\n├── backend/                   # FAISS Index, SQLite \u0026 Vector DB Management\n│   ├── generate_embeddings.py # Computes neural embeddings from raw snippets (saves embeddings.npy)\n│   ├── build_index.py         # Trains and builds the FAISS search index from saved embeddings\n│   ├── migrate_to_sqlite.py   # RAM optimization script (JSON metadata -\u003e SQLite)\n│   ├── snippets.db            # SQLite metadata store for fast lookups\n│   └── mediawiki.index        # Compiled FAISS vector index\n├── preprocessing/             # Global-Scale Indexing Pipeline (Phases 1-3)\n│   ├── list_repos.py          # Discovers and lists 2,400+ MediaWiki repositories\n│   ├── download_repos.py      # Handles shallow clones of target repositories\n│   ├── extract_entities.py    # Structural parsing \u0026 AST entity extraction\n│   ├── archive_to_swh.py      # Software Heritage archiving pipeline scripts\n│   └── resolve_swh_hashes.py  # Resolves local Git hashes to SWH SHA1 IDs\n├── tests/                     # Parser \u0026 API Verification Suite\n│   ├── test_api.py            # Backend API endpoint tests\n│   ├── test_*_parser.py       # Syntax extraction validations for 10+ languages\n│   └── example.*              # Target language snippets parsed during testing\n├── scripts/                   # Internal utilities \u0026 metadata migration helpers\n├── manuscript/                # Academic paper \u0026 System documentation (LaTeX)\n│   ├── main.tex               # Manuscript source file documenting architecture\n│   └── main.pdf               # Compiled system documentation/paper\n├── app.py                     # Root FastAPI web application entry point\n├── download_models.py         # Script to pre-download model weights locally\n├── requirements.txt           # Python backend dependencies\n└── CITATION.cff               # CITATION file for academic/repository reference\n```\n\n## 🚀 Scaling \u0026 Pipeline\n\nThe indexing pipeline is designed for a **mass-scale, distributed build**. \n\n## 🛠️ Setup\n\n### 💾 Pre-computed Artefacts (Recommended)\n\nTo run the search engine immediately without running the entire indexing pipeline (Phases 1-4) from scratch, you can download our pre-computed database and FAISS index from the **[Zenodo Dataset](https://doi.org/10.5281/zenodo.20586256)**:\n1. Download `snippets.db` and `mediawiki.index`.\n2. Place both files inside the `backend/` directory of the project.\n\nFor the frozen software source code release of the engine, see **[GitHub Release v2.0.0](https://github.com/ftosoni/mediawiki-code2code-search/releases/tag/v2.0.0)**.\n\n### Backend (Python)\nCreate and activate a virtual environment (optional but recommended), install dependencies, and pre-download the neural models:\n```bash\npython -m venv venv\n# Windows:\n.\\venv\\Scripts\\activate\n# Linux/macOS:\nsource venv/bin/activate\n\npip install -r requirements.txt\npython download_models.py\n```\n\n### Frontend (Static Assets)\n\nThe frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the `frontend/` directory. These files are served directly by the FastAPI backend.\n\nThere is no compilation step required for the frontend.\n\n### Phase 1: Discovery \u0026 Mirroring (Local)\nFirst, discover the ecosystem and mirror it for processing:\n```bash\ncd preprocessing\npython list_repos.py      # Fetches 2,400+ repo URLs\npython download_repos.py  # Shallow clones (approx. 8GB disk space)\n```\n\n### Phase 2: Archiving (Global)\nEnsure all repositories are archived in Software Heritage for on-demand retrieval.\n\n\u003e [!NOTE]\n\u003e `archive_to_swh.py` requires a \"bulk_save\" token. For most users, it is recommended to use:\n```bash\npython archive_individual_to_swh.py\n```\n\n### Phase 3: Extraction (Local/CPU)\nPerform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., `Class::Method`) and handles complex language features.\n\n**Phase 3a: Structural Extraction**\n```bash\npython extract_structural_entities.py\n```\n\n**Phase 3b: Identity Resolution**\nResolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):\n\n*   **Option A: Local Resolution (Recommended)**\n    ```bash\n    python resolve_swh_hashes_local.py\n    ```\n*   **Option B: API-based Resolution**\n    ```bash\n    python resolve_swh_hashes.py\n    ```\n\n### Phase 4: Indexing (Remote/GPU)\nMove `raw_snippets.json` to a GPU-equipped environment to compute neural vectors and build the FAISS index.\n```bash\ncd backend\npython generate_embeddings.py  # Computes and saves embeddings to embeddings.npy\npython build_index.py          # Trains and builds FAISS index from embeddings.npy\n```\n\n### Phase 5: Memory Optimization \u0026 Deployment (Local/Toolforge)\nBefore deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:\n```bash\ncd backend\npython migrate_to_sqlite.py\n```\n\nOnce the index and database are ready, start the FastAPI backend from the root directory:\n\n```bash\n# From the project root\nuvicorn app:app --host 0.0.0.0 --port 8000\n```\nThe server will be available at `http://localhost:8000`. You can access the automatic API documentation at `http://localhost:8000/docs`.\n\n\n---\n\n## 🚀 Deployment (Toolforge)\n\nFollow these steps to deploy the application on Wikimedia Toolforge.\n\n\u003e [!NOTE]\n\u003e The examples below use `supnabla` as the username and `code2codesearch` as the project name. Replace these with your own Toolforge credentials where applicable.\n\n### 1. Upload Assets\nSince the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:\n\n```bash\n# From the project root\nscp -rp \"./models\" supnabla@login.toolforge.org:/data/project/code2codesearch/\nscp -rp \"./backend/mediawiki.index\" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/\nscp -rp \"./backend/snippets.db\" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/\n```\n\n### 2. Configure Permissions\nLog into Toolforge and set the necessary permissions:\n\n```bash\nssh supnabla@login.toolforge.org\n\nchmod -R a+rX /data/project/code2codesearch/models/\nchmod a+r /data/project/code2codesearch/backend/snippets.db\nchmod a+r /data/project/code2codesearch/backend/mediawiki.index\n```\n\n### 3. Deploy\nNow you are ready to deploy the webservice:\n\n```bash\n# Switch to the code2codesearch project\nbecome code2codesearch\n\n# Stop and clean existing build\ntoolforge webservice buildservice stop --mount=all\ntoolforge build clean -y\n\n# Start build from repository\ntoolforge build start https://github.com/ftosoni/mediawiki-code2code-search\n\n# Start webservice with 6GiB RAM\ntoolforge webservice buildservice start --mount=all -m 6Gi\n\n# Monitor logs\ntoolforge webservice logs -f\n```\n\n---\n\n## 🛠️ Technology Stack \u0026 Project Status\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- Project Status \u0026 License --\u003e\n  \u003ca href=\"https://github.com/ftosoni/mediawiki-code2code-search/actions/workflows/python-ci.yml\"\u003e\u003cimg src=\"https://github.com/ftosoni/mediawiki-code2code-search/actions/workflows/python-ci.yml/badge.svg?branch=main\u0026style=flat-square\" alt=\"CI Status\"\u003e\u003c/a\u003e\n  \u003ca href=\"./LICENCE.txt\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache_2.0-blue?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/dev/peps/pep-0008/\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-pep8-orange.svg?style=flat-square\" alt=\"Code Style: PEP8\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://archive.softwareheritage.org/browse/origin/?origin_url=https://github.com/ftosoni/mediawiki-code2code-search\"\u003e\u003cimg src=\"https://archive.softwareheritage.org/badge/origin/https://github.com/ftosoni/mediawiki-code2code-search/\" alt=\"SWH Origin\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://archive.softwareheritage.org/swh:1:dir:fe86b58fb35118c474fce8f7a38b4bc541440653;origin=https://github.com/ftosoni/mediawiki-code2code-search;visit=swh:1:snp:0925f0ac8b48e9b46b741090d50781140d1e037b;anchor=swh:1:rev:fcfcb6a6bff6534ce0a7203d1553219b5947504a\"\u003e\u003cimg src=\"https://archive.softwareheritage.org/badge/swh:1:dir:c30104117db9bb6a8e488e698173fa2b302cbffc/\" alt=\"SWH Directory\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- Frontend \u0026 Design --\u003e\n  \u003ca href=\"https://doc.wikimedia.org/codex/main/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Codex-Design_System-3366cc?logo=wikimedia-commons\u0026logoColor=white\" alt=\"Codex\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript\"\u003e\u003cimg src=\"https://img.shields.io/badge/JavaScript-ES6+-f7df1e?logo=javascript\u0026logoColor=black\" alt=\"JavaScript\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- Backend \u0026 Core --\u003e\n  \u003ca href=\"https://fastapi.tiangolo.com/\"\u003e\u003cimg src=\"https://img.shields.io/badge/FastAPI-009688?logo=fastapi\u0026logoColor=white\" alt=\"FastAPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-3.11+-3776ab?logo=python\u0026logoColor=white\" alt=\"Python 3.11+\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.uvicorn.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Uvicorn-222?logo=gunicorn\u0026logoColor=white\" alt=\"Uvicorn\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- Vector Search \u0026 DB --\u003e\n  \u003ca href=\"https://github.com/facebookresearch/faiss\"\u003e\u003cimg src=\"https://img.shields.io/badge/FAISS-Vector_Index-blueviolet\" alt=\"FAISS\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Vector_Indexes-1024d-blueviolet\" alt=\"Vector indexes (1024d)\"\u003e\n  \u003ca href=\"https://www.sqlite.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/SQLite-metadata_store-003b57?logo=sqlite\u0026logoColor=white\" alt=\"SQLite\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- AI, Extraction \u0026 Archive --\u003e\n  \u003ca href=\"https://huggingface.co/Qwen/Qwen3-Embedding-0.6B\"\u003e\u003cimg src=\"https://img.shields.io/badge/Qwen3_Embedding-0.6B-5374ff?logo=huggingface\u0026logoColor=white\" alt=\"Qwen3 Embedding 0.6B\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://tree-sitter.github.io/tree-sitter/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Tree--sitter-parsers-green\" alt=\"Tree-sitter\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://archive.softwareheritage.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Software_Heritage-archive-002f56\" alt=\"Software Heritage\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003c!-- CI/CD \u0026 Deploy --\u003e\n  \u003ca href=\"https://wikitech.wikimedia.org/wiki/Portal:Toolforge\"\u003e\u003cimg src=\"https://img.shields.io/badge/Toolforge-deploy-3366cc\" alt=\"Toolforge\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ftosoni/mediawiki-code2code-search/actions\"\u003e\u003cimg src=\"https://img.shields.io/badge/GitHub_Actions-CI%2FCD-2088FF?logo=githubactions\u0026logoColor=white\" alt=\"GitHub Actions\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://docs.pytest.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/pytest-tests-0A9EDC?logo=pytest\u0026logoColor=white\" alt=\"pytest\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n## 📄 Licence\n[Apache 2.0 License](./LICENCE.txt). Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fftosoni%2Fmediawiki-code2code-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fftosoni%2Fmediawiki-code2code-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fftosoni%2Fmediawiki-code2code-search/lists"}