https://github.com/ftosoni/mediawiki-code2code-search

MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.
https://github.com/ftosoni/mediawiki-code2code-search

code-search-engine mediawiki neural-search software-heritage

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/ftosoni/mediawiki-code2code-search
Owner: ftosoni
License: apache-2.0
Created: 2026-03-24T19:27:58.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-06-04T14:44:45.000Z (about 1 month ago)
Last Synced: 2026-06-04T16:06:39.717Z (about 1 month ago)
Topics: code-search-engine, mediawiki, neural-search, software-heritage
Language: Python
Homepage: https://code2codesearch.toolforge.org/
Size: 9.92 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Codemeta: codemeta.json

Awesome Lists containing this project

README

# MediaWiki Code2Code Search

A high-performance semantic code search engine designed for the MediaWiki ecosystem.
Built on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations.
Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

As featured on [Wikimedia Diff](https://diff.wikimedia.org/2026/04/14/introducing-mediawiki-code2code-search-semantic-search-to-find-code-by-under-the-surface-similarity/).

## ✨ Key Features

- **📂 Global MediaWiki Indexing**: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
- **🧠 Single-Stage Neural Retrieval**: Uses `Qwen3-Embedding-0.6B` with FAISS `IndexIVFPQ` for lightning-fast results (approx. 0.3s).
- **🌳 Granular Structural Filtering**: High-precision extraction and filtering of **Functions**, **Types**, **Template Functions**, and **Template Types** across 10 languages.
- **🏗️ Split-Build Architecture**: Optimized for asymmetric hardware—run heavy extraction on a laptop and neural vectorization on a GPU.
- **🌍 Massive Localization Footprint**: Fully localized UI supporting **17 languages**.
- **🎨 Codex UI**: A clean, accessible frontend built with Wikimedia's **Codex Design System** for a native look and feel.
- **🔍 Advanced Multi-select Filtering**: Granular control over results by repository group, programming language, and entry type.

## 📂 Project Structure

```
mediawiki-code2code-search/
├── frontend/
│ ├── css/style.css
│ ├── js/main.js
│ └── i18n/
├── backend/
│ ├── generate_embeddings.py
│ ├── build_index.py
│ ├── migrate_to_sqlite.py
│ ├── snippets.db
│ └── mediawiki.index
├── preprocessing/
│ ├── list_repos.py
│ ├── download_repos.py
│ ├── extract_entities.py
│ ├── archive_to_swh.py
│ └── resolve_swh_hashes.py
├── tests/
│ ├── test_api.py
│ ├── test_*_parser.py
│ └── example.*
├── scripts/
├── manuscript/
│ ├── main.tex
│ └── main.pdf
├── app.py
├── download_models.py
├── requirements.txt
└── CITATION.cff
``` # Codex-based Static Frontend # Stylesheets using the Codex Design System # Main frontend application logic # Localization JSONs supporting 17 languages # FAISS Index, SQLite & Vector DB Management # Computes neural embeddings from raw snippets (saves embeddings.npy) # Trains and builds the FAISS search index from saved embeddings # RAM optimization script (JSON metadata -> SQLite) # SQLite metadata store for fast lookups # Compiled FAISS vector index # Global-Scale Indexing Pipeline (Phases 1-3) # Discovers and lists 2,400+ MediaWiki repositories # Handles shallow clones of target repositories # Structural parsing & AST entity extraction # Software Heritage archiving pipeline scripts # Resolves local Git hashes to SWH SHA1 IDs # Parser & API Verification Suite # Backend API endpoint tests # Syntax extraction validations for 10+ languages # Target language snippets parsed during testing # Internal utilities & metadata migration helpers # Academic paper & System documentation (LaTeX) # Manuscript source file documenting architecture # Compiled system documentation/paper # Root FastAPI web application entry point # Script to pre-download model weights locally # Python backend dependencies # CITATION file for academic/repository reference

## 🚀 Scaling & Pipeline

The indexing pipeline is designed for a **mass-scale, distributed build**.

## 🛠️ Setup

### 💾 Pre-computed Artefacts (Recommended)

To run the search engine immediately without running the entire indexing pipeline (Phases 1-4) from scratch, you can download our pre-computed database and FAISS index from the **[Zenodo Dataset](https://doi.org/10.5281/zenodo.20586256)**:
1. Download `snippets.db` and `mediawiki.index`.
2. Place both files inside the `backend/` directory of the project.

For the frozen software source code release of the engine, see **[GitHub Release v2.0.0](https://github.com/ftosoni/mediawiki-code2code-search/releases/tag/v2.0.0)**.

### Backend (Python)
Create and activate a virtual environment (optional but recommended), install dependencies, and pre-download the neural models:
```bash
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt
python download_models.py
```

### Frontend (Static Assets)

The frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the `frontend/` directory. These files are served directly by the FastAPI backend.

There is no compilation step required for the frontend.

### Phase 1: Discovery & Mirroring (Local)
First, discover the ecosystem and mirror it for processing:
```bash
cd preprocessing
python list_repos.py # Fetches 2,400+ repo URLs
python download_repos.py # Shallow clones (approx. 8GB disk space)
```

### Phase 2: Archiving (Global)
Ensure all repositories are archived in Software Heritage for on-demand retrieval.

> [!NOTE]
> `archive_to_swh.py` requires a "bulk_save" token. For most users, it is recommended to use:
```bash
python archive_individual_to_swh.py
```

### Phase 3: Extraction (Local/CPU)
Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., `Class::Method`) and handles complex language features.

**Phase 3a: Structural Extraction**
```bash
python extract_structural_entities.py
```

**Phase 3b: Identity Resolution**
Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

* **Option A: Local Resolution (Recommended)**
```bash
python resolve_swh_hashes_local.py
```
* **Option B: API-based Resolution**
```bash
python resolve_swh_hashes.py
```

### Phase 4: Indexing (Remote/GPU)
Move `raw_snippets.json` to a GPU-equipped environment to compute neural vectors and build the FAISS index.
```bash
cd backend
python generate_embeddings.py # Computes and saves embeddings to embeddings.npy
python build_index.py # Trains and builds FAISS index from embeddings.npy
```

### Phase 5: Memory Optimization & Deployment (Local/Toolforge)
Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:
```bash
cd backend
python migrate_to_sqlite.py
```

Once the index and database are ready, start the FastAPI backend from the root directory:

```bash
# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000
```
The server will be available at `http://localhost:8000`. You can access the automatic API documentation at `http://localhost:8000/docs`.

---

## 🚀 Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

> [!NOTE]
> The examples below use `supnabla` as the username and `code2codesearch` as the project name. Replace these with your own Toolforge credentials where applicable.

### 1. Upload Assets
Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

```bash
# From the project root
scp -rp "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -rp "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -rp "./backend/snippets.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
```

### 2. Configure Permissions
Log into Toolforge and set the necessary permissions:

```bash
ssh supnabla@login.toolforge.org

chmod -R a+rX /data/project/code2codesearch/models/
chmod a+r /data/project/code2codesearch/backend/snippets.db
chmod a+r /data/project/code2codesearch/backend/mediawiki.index
```

### 3. Deploy
Now you are ready to deploy the webservice:

```bash
# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f
```

---

## 🛠️ Technology Stack & Project Status