https://github.com/commoncrawl/arabic-seed-processing

Turning 30,000 Arabic domains into a better crawl
https://github.com/commoncrawl/arabic-seed-processing

Last synced: 9 days ago
JSON representation

Turning 30,000 Arabic domains into a better crawl

Host: GitHub
URL: https://github.com/commoncrawl/arabic-seed-processing
Owner: commoncrawl
Created: 2026-06-17T14:18:42.000Z (10 days ago)
Default Branch: main
Last Pushed: 2026-06-17T15:00:29.000Z (10 days ago)
Last Synced: 2026-06-17T16:55:51.946Z (10 days ago)
Language: Jupyter Notebook
Size: 4.05 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Turning 30,000 Arabic domains into a better crawl
Code and data accompanying the work described in [this blog post](https://commoncrawl.org/blog/turning-30000-arabic-domains-into-a-better-crawl) to filter, geolocate and categorise a donation of Arabic seed domains.

## Contents
- `ArabicDomainQuality.xlsx`: The original data received from QCRI.
- `arabic_seeds.ipynb`: A notebook detailing the data processing and analysis.
- `crawl_lang_info.tsv`: Summarised language information for the domains found in the CC-MAIN-2026-{21,17,12} archives.
- `DomainQuality_Dashboard.ipynb`: Additional analysis of the quality of the pre-filtered domains, carried out by researchers at QCRI.

We use [uv](https://docs.astral.sh/uv/) to manage Python dependencies.

## Acknowledgements

Thank you to [Hamdy S. Hussein](https://elmi.hbku.edu.qa/en/persons/hamdy-soliman-mubarak-hussien/), [Dr. Kareem M. Darwish](https://kareemdarwish.com) and [Dr. Mohamed Ahmed Yassin Eltabakh](https://elmi.hbku.edu.qa/en/persons/mohamed-ahmed-yassin-eltabakh/) of the [Qatar Computing Research Institute](https://www.hbku.edu.qa/en/qcri) for providing the initial seed list, quality annotations and exploratory visualisations. These were created as part of the [Fanar Project](http://www.fanar.qa), an Arabic generative AI platform.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/arabic-seed-processing

Awesome Lists containing this project

README