An open API service indexing awesome lists of open source software.

https://github.com/ftosoni/mediawiki-code2code-search

MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.
https://github.com/ftosoni/mediawiki-code2code-search

code-search-engine mediawiki neural-search software-heritage

Last synced: 8 days ago
JSON representation

MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.

Awesome Lists containing this project

README

          

# MediaWiki Code2Code Search

A high-performance semantic code search engine designed for the MediaWiki ecosystem.
Built on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations.
Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

As featured on [Wikimedia Diff](https://diff.wikimedia.org/2026/04/14/introducing-mediawiki-code2code-search-semantic-search-to-find-code-by-under-the-surface-similarity/).

## ✨ Key Features

- **πŸ“‚ Global MediaWiki Indexing**: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
- **🧠 Single-Stage Neural Retrieval**: Uses `Qwen3-Embedding-0.6B` with FAISS `IndexIVFPQ` for lightning-fast results (approx. 0.3s).
- **🌳 Granular Structural Filtering**: High-precision extraction and filtering of **Functions**, **Types**, **Template Functions**, and **Template Types** across 10 languages.
- **πŸ—οΈ Split-Build Architecture**: Optimized for asymmetric hardwareβ€”run heavy extraction on a laptop and neural vectorization on a GPU.
- **🌍 Massive Localization Footprint**: Fully localized UI supporting **17 languages**.
- **🎨 Codex UI**: A clean, accessible frontend built with Wikimedia's **Codex Design System** for a native look and feel.
- **πŸ” Advanced Multi-select Filtering**: Granular control over results by repository group, programming language, and entry type.

## πŸ“‚ Project Structure

```
mediawiki-code2code-search/
β”œβ”€β”€ frontend/ # Codex-based Static Frontend
β”‚ β”œβ”€β”€ css/style.css # Stylesheets using the Codex Design System
β”‚ β”œβ”€β”€ js/main.js # Main frontend application logic
β”‚ └── i18n/ # Localization JSONs supporting 17 languages
β”œβ”€β”€ backend/ # FAISS Index, SQLite & Vector DB Management
β”‚ β”œβ”€β”€ generate_embeddings.py # Computes neural embeddings from raw snippets (saves embeddings.npy)
β”‚ β”œβ”€β”€ build_index.py # Trains and builds the FAISS search index from saved embeddings
β”‚ β”œβ”€β”€ migrate_to_sqlite.py # RAM optimization script (JSON metadata -> SQLite)
β”‚ β”œβ”€β”€ snippets.db # SQLite metadata store for fast lookups
β”‚ └── mediawiki.index # Compiled FAISS vector index
β”œβ”€β”€ preprocessing/ # Global-Scale Indexing Pipeline (Phases 1-3)
β”‚ β”œβ”€β”€ list_repos.py # Discovers and lists 2,400+ MediaWiki repositories
β”‚ β”œβ”€β”€ download_repos.py # Handles shallow clones of target repositories
β”‚ β”œβ”€β”€ extract_entities.py # Structural parsing & AST entity extraction
β”‚ β”œβ”€β”€ archive_to_swh.py # Software Heritage archiving pipeline scripts
β”‚ └── resolve_swh_hashes.py # Resolves local Git hashes to SWH SHA1 IDs
β”œβ”€β”€ tests/ # Parser & API Verification Suite
β”‚ β”œβ”€β”€ test_api.py # Backend API endpoint tests
β”‚ β”œβ”€β”€ test_*_parser.py # Syntax extraction validations for 10+ languages
β”‚ └── example.* # Target language snippets parsed during testing
β”œβ”€β”€ scripts/ # Internal utilities & metadata migration helpers
β”œβ”€β”€ manuscript/ # Academic paper & System documentation (LaTeX)
β”‚ β”œβ”€β”€ main.tex # Manuscript source file documenting architecture
β”‚ └── main.pdf # Compiled system documentation/paper
β”œβ”€β”€ app.py # Root FastAPI web application entry point
β”œβ”€β”€ download_models.py # Script to pre-download model weights locally
β”œβ”€β”€ requirements.txt # Python backend dependencies
└── CITATION.cff # CITATION file for academic/repository reference
```

## πŸš€ Scaling & Pipeline

The indexing pipeline is designed for a **mass-scale, distributed build**.

## πŸ› οΈ Setup

### πŸ’Ύ Pre-computed Artefacts (Recommended)

To run the search engine immediately without running the entire indexing pipeline (Phases 1-4) from scratch, you can download our pre-computed database and FAISS index from the **[Zenodo Dataset](https://doi.org/10.5281/zenodo.20586256)**:
1. Download `snippets.db` and `mediawiki.index`.
2. Place both files inside the `backend/` directory of the project.

For the frozen software source code release of the engine, see **[GitHub Release v2.0.0](https://github.com/ftosoni/mediawiki-code2code-search/releases/tag/v2.0.0)**.

### Backend (Python)
Create and activate a virtual environment (optional but recommended), install dependencies, and pre-download the neural models:
```bash
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt
python download_models.py
```

### Frontend (Static Assets)

The frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the `frontend/` directory. These files are served directly by the FastAPI backend.

There is no compilation step required for the frontend.

### Phase 1: Discovery & Mirroring (Local)
First, discover the ecosystem and mirror it for processing:
```bash
cd preprocessing
python list_repos.py # Fetches 2,400+ repo URLs
python download_repos.py # Shallow clones (approx. 8GB disk space)
```

### Phase 2: Archiving (Global)
Ensure all repositories are archived in Software Heritage for on-demand retrieval.

> [!NOTE]
> `archive_to_swh.py` requires a "bulk_save" token. For most users, it is recommended to use:
```bash
python archive_individual_to_swh.py
```

### Phase 3: Extraction (Local/CPU)
Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., `Class::Method`) and handles complex language features.

**Phase 3a: Structural Extraction**
```bash
python extract_structural_entities.py
```

**Phase 3b: Identity Resolution**
Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

* **Option A: Local Resolution (Recommended)**
```bash
python resolve_swh_hashes_local.py
```
* **Option B: API-based Resolution**
```bash
python resolve_swh_hashes.py
```

### Phase 4: Indexing (Remote/GPU)
Move `raw_snippets.json` to a GPU-equipped environment to compute neural vectors and build the FAISS index.
```bash
cd backend
python generate_embeddings.py # Computes and saves embeddings to embeddings.npy
python build_index.py # Trains and builds FAISS index from embeddings.npy
```

### Phase 5: Memory Optimization & Deployment (Local/Toolforge)
Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:
```bash
cd backend
python migrate_to_sqlite.py
```

Once the index and database are ready, start the FastAPI backend from the root directory:

```bash
# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000
```
The server will be available at `http://localhost:8000`. You can access the automatic API documentation at `http://localhost:8000/docs`.

---

## πŸš€ Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

> [!NOTE]
> The examples below use `supnabla` as the username and `code2codesearch` as the project name. Replace these with your own Toolforge credentials where applicable.

### 1. Upload Assets
Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

```bash
# From the project root
scp -rp "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -rp "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -rp "./backend/snippets.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
```

### 2. Configure Permissions
Log into Toolforge and set the necessary permissions:

```bash
ssh supnabla@login.toolforge.org

chmod -R a+rX /data/project/code2codesearch/models/
chmod a+r /data/project/code2codesearch/backend/snippets.db
chmod a+r /data/project/code2codesearch/backend/mediawiki.index
```

### 3. Deploy
Now you are ready to deploy the webservice:

```bash
# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f
```

---

## πŸ› οΈ Technology Stack & Project Status



CI Status
License
Code Style: PEP8
SWH Origin
SWH Directory



Codex
JavaScript



FastAPI
Python 3.11+
Uvicorn



FAISS
Vector indexes (1024d)
SQLite



Qwen3 Embedding 0.6B
Tree-sitter
Software Heritage



Toolforge
GitHub Actions
pytest

## πŸ“„ Licence
[Apache 2.0 License](./LICENCE.txt). Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.