https://github.com/ftosoni/mediawiki-code2code-search
MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.
https://github.com/ftosoni/mediawiki-code2code-search
code-search-engine mediawiki neural-search software-heritage
Last synced: 8 days ago
JSON representation
MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.
- Host: GitHub
- URL: https://github.com/ftosoni/mediawiki-code2code-search
- Owner: ftosoni
- License: apache-2.0
- Created: 2026-03-24T19:27:58.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-06-04T14:44:45.000Z (11 days ago)
- Last Synced: 2026-06-04T16:06:39.717Z (11 days ago)
- Topics: code-search-engine, mediawiki, neural-search, software-heritage
- Language: Python
- Homepage: https://code2codesearch.toolforge.org/
- Size: 9.92 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Codemeta: codemeta.json
Awesome Lists containing this project
README
# MediaWiki Code2Code Search
A high-performance semantic code search engine designed for the MediaWiki ecosystem.
Built on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations.
Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).
As featured on [Wikimedia Diff](https://diff.wikimedia.org/2026/04/14/introducing-mediawiki-code2code-search-semantic-search-to-find-code-by-under-the-surface-similarity/).
## β¨ Key Features
- **π Global MediaWiki Indexing**: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
- **π§ Single-Stage Neural Retrieval**: Uses `Qwen3-Embedding-0.6B` with FAISS `IndexIVFPQ` for lightning-fast results (approx. 0.3s).
- **π³ Granular Structural Filtering**: High-precision extraction and filtering of **Functions**, **Types**, **Template Functions**, and **Template Types** across 10 languages.
- **ποΈ Split-Build Architecture**: Optimized for asymmetric hardwareβrun heavy extraction on a laptop and neural vectorization on a GPU.
- **π Massive Localization Footprint**: Fully localized UI supporting **17 languages**.
- **π¨ Codex UI**: A clean, accessible frontend built with Wikimedia's **Codex Design System** for a native look and feel.
- **π Advanced Multi-select Filtering**: Granular control over results by repository group, programming language, and entry type.
## π Project Structure
```
mediawiki-code2code-search/
βββ frontend/ # Codex-based Static Frontend
β βββ css/style.css # Stylesheets using the Codex Design System
β βββ js/main.js # Main frontend application logic
β βββ i18n/ # Localization JSONs supporting 17 languages
βββ backend/ # FAISS Index, SQLite & Vector DB Management
β βββ generate_embeddings.py # Computes neural embeddings from raw snippets (saves embeddings.npy)
β βββ build_index.py # Trains and builds the FAISS search index from saved embeddings
β βββ migrate_to_sqlite.py # RAM optimization script (JSON metadata -> SQLite)
β βββ snippets.db # SQLite metadata store for fast lookups
β βββ mediawiki.index # Compiled FAISS vector index
βββ preprocessing/ # Global-Scale Indexing Pipeline (Phases 1-3)
β βββ list_repos.py # Discovers and lists 2,400+ MediaWiki repositories
β βββ download_repos.py # Handles shallow clones of target repositories
β βββ extract_entities.py # Structural parsing & AST entity extraction
β βββ archive_to_swh.py # Software Heritage archiving pipeline scripts
β βββ resolve_swh_hashes.py # Resolves local Git hashes to SWH SHA1 IDs
βββ tests/ # Parser & API Verification Suite
β βββ test_api.py # Backend API endpoint tests
β βββ test_*_parser.py # Syntax extraction validations for 10+ languages
β βββ example.* # Target language snippets parsed during testing
βββ scripts/ # Internal utilities & metadata migration helpers
βββ manuscript/ # Academic paper & System documentation (LaTeX)
β βββ main.tex # Manuscript source file documenting architecture
β βββ main.pdf # Compiled system documentation/paper
βββ app.py # Root FastAPI web application entry point
βββ download_models.py # Script to pre-download model weights locally
βββ requirements.txt # Python backend dependencies
βββ CITATION.cff # CITATION file for academic/repository reference
```
## π Scaling & Pipeline
The indexing pipeline is designed for a **mass-scale, distributed build**.
## π οΈ Setup
### πΎ Pre-computed Artefacts (Recommended)
To run the search engine immediately without running the entire indexing pipeline (Phases 1-4) from scratch, you can download our pre-computed database and FAISS index from the **[Zenodo Dataset](https://doi.org/10.5281/zenodo.20586256)**:
1. Download `snippets.db` and `mediawiki.index`.
2. Place both files inside the `backend/` directory of the project.
For the frozen software source code release of the engine, see **[GitHub Release v2.0.0](https://github.com/ftosoni/mediawiki-code2code-search/releases/tag/v2.0.0)**.
### Backend (Python)
Create and activate a virtual environment (optional but recommended), install dependencies, and pre-download the neural models:
```bash
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
pip install -r requirements.txt
python download_models.py
```
### Frontend (Static Assets)
The frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the `frontend/` directory. These files are served directly by the FastAPI backend.
There is no compilation step required for the frontend.
### Phase 1: Discovery & Mirroring (Local)
First, discover the ecosystem and mirror it for processing:
```bash
cd preprocessing
python list_repos.py # Fetches 2,400+ repo URLs
python download_repos.py # Shallow clones (approx. 8GB disk space)
```
### Phase 2: Archiving (Global)
Ensure all repositories are archived in Software Heritage for on-demand retrieval.
> [!NOTE]
> `archive_to_swh.py` requires a "bulk_save" token. For most users, it is recommended to use:
```bash
python archive_individual_to_swh.py
```
### Phase 3: Extraction (Local/CPU)
Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., `Class::Method`) and handles complex language features.
**Phase 3a: Structural Extraction**
```bash
python extract_structural_entities.py
```
**Phase 3b: Identity Resolution**
Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):
* **Option A: Local Resolution (Recommended)**
```bash
python resolve_swh_hashes_local.py
```
* **Option B: API-based Resolution**
```bash
python resolve_swh_hashes.py
```
### Phase 4: Indexing (Remote/GPU)
Move `raw_snippets.json` to a GPU-equipped environment to compute neural vectors and build the FAISS index.
```bash
cd backend
python generate_embeddings.py # Computes and saves embeddings to embeddings.npy
python build_index.py # Trains and builds FAISS index from embeddings.npy
```
### Phase 5: Memory Optimization & Deployment (Local/Toolforge)
Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:
```bash
cd backend
python migrate_to_sqlite.py
```
Once the index and database are ready, start the FastAPI backend from the root directory:
```bash
# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000
```
The server will be available at `http://localhost:8000`. You can access the automatic API documentation at `http://localhost:8000/docs`.
---
## π Deployment (Toolforge)
Follow these steps to deploy the application on Wikimedia Toolforge.
> [!NOTE]
> The examples below use `supnabla` as the username and `code2codesearch` as the project name. Replace these with your own Toolforge credentials where applicable.
### 1. Upload Assets
Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:
```bash
# From the project root
scp -rp "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -rp "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -rp "./backend/snippets.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
```
### 2. Configure Permissions
Log into Toolforge and set the necessary permissions:
```bash
ssh supnabla@login.toolforge.org
chmod -R a+rX /data/project/code2codesearch/models/
chmod a+r /data/project/code2codesearch/backend/snippets.db
chmod a+r /data/project/code2codesearch/backend/mediawiki.index
```
### 3. Deploy
Now you are ready to deploy the webservice:
```bash
# Switch to the code2codesearch project
become code2codesearch
# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y
# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search
# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi
# Monitor logs
toolforge webservice logs -f
```
---
## π οΈ Technology Stack & Project Status
## π Licence
[Apache 2.0 License](./LICENCE.txt). Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.