https://github.com/commoncrawl/arabic-seed-processing
Turning 30,000 Arabic domains into a better crawl
https://github.com/commoncrawl/arabic-seed-processing
Last synced: 9 days ago
JSON representation
Turning 30,000 Arabic domains into a better crawl
- Host: GitHub
- URL: https://github.com/commoncrawl/arabic-seed-processing
- Owner: commoncrawl
- Created: 2026-06-17T14:18:42.000Z (10 days ago)
- Default Branch: main
- Last Pushed: 2026-06-17T15:00:29.000Z (10 days ago)
- Last Synced: 2026-06-17T16:55:51.946Z (10 days ago)
- Language: Jupyter Notebook
- Size: 4.05 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Turning 30,000 Arabic domains into a better crawl
Code and data accompanying the work described in [this blog post](https://commoncrawl.org/blog/turning-30000-arabic-domains-into-a-better-crawl) to filter, geolocate and categorise a donation of Arabic seed domains.
## Contents
- `ArabicDomainQuality.xlsx`: The original data received from QCRI.
- `arabic_seeds.ipynb`: A notebook detailing the data processing and analysis.
- `crawl_lang_info.tsv`: Summarised language information for the domains found in the CC-MAIN-2026-{21,17,12} archives.
- `DomainQuality_Dashboard.ipynb`: Additional analysis of the quality of the pre-filtered domains, carried out by researchers at QCRI.
We use [uv](https://docs.astral.sh/uv/) to manage Python dependencies.
## Acknowledgements
Thank you to [Hamdy S. Hussein](https://elmi.hbku.edu.qa/en/persons/hamdy-soliman-mubarak-hussien/), [Dr. Kareem M. Darwish](https://kareemdarwish.com) and [Dr. Mohamed Ahmed Yassin Eltabakh](https://elmi.hbku.edu.qa/en/persons/mohamed-ahmed-yassin-eltabakh/) of the [Qatar Computing Research Institute](https://www.hbku.edu.qa/en/qcri) for providing the initial seed list, quality annotations and exploratory visualisations. These were created as part of the [Fanar Project](http://www.fanar.qa), an Arabic generative AI platform.