https://github.com/camel-lab/arabic_ala-lc_romanization

Romanizing Arabic bibliographic records in the ALA-LC standard.
https://github.com/camel-lab/arabic_ala-lc_romanization

Last synced: 4 months ago
JSON representation

Romanizing Arabic bibliographic records in the ALA-LC standard.

Host: GitHub
URL: https://github.com/camel-lab/arabic_ala-lc_romanization
Owner: CAMeL-Lab
Created: 2021-01-14T10:02:24.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2023-12-19T19:37:30.000Z (over 2 years ago)
Last Synced: 2025-10-12T00:50:12.256Z (8 months ago)
Language: Jupyter Notebook
Homepage:
Size: 59.4 MB
Stars: 18
Watchers: 5
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Arabic ALA-LC Romanization Tool

A tool and dataset for the automatic Romanization of Arabic text in the ALA-LC Romanization standard.

Example of Arabic source and target Romanization with MARC field/subfield tag and tag description (see [publication](https://www.aclweb.org/anthology/2021.wanlp-1.23.pdf) for more details).

Example record with parallel entries along with combined MARC field and subfield tags
Source: Arabic Collections Online: http://hdl.handle.net/2333.1/m37pvs4b

## Publication

- Fadhl Eryani & Nizar Habash. [Automatic Romanization of Arabic Bibliographic Records.](https://www.aclweb.org/anthology/2021.wanlp-1.23.pdf) Proceedings of the Sixth Arabic Natural Language Processing Workshop. 2021.

## Demo

http://romanize-arabic.camel-lab.com/

This demo is based on the “MLE Simple” model described in Eryani & Habash (2021), with slight modifications. This model, selected for its speed, uses a large corpus of training data to calculate the most frequent Romanization given a certain Arabic input. The system achieves 90% accuracy (ignoring capitalization errors) and 84% exact match accuracy. In out-of-vocabulary scenarios, about 3% of the words, the system falls back on a simple character transliteration technique that Romanizes the input as is, without guessing unmarked vowels or splitting proclitics.

## Data

Check out the [first release](https://github.com/CAMeL-Lab/Arabic_ALA-LC_Romanization/releases/tag/v1.0).

## Authors

- [Fadhl Eryani](https://github.com/fadhleryani/)
- Nizar Habash

## Setup

### Basic dependencies
This project was developed using python 3.6, and tested in macOS and linux environments. First you must install required packages:
> pip3 install -r requirements.txt

### MADAMIRA

To run the MADAMIRA morphological analyser and disambiguator, you must have a MADAMIRA distribution in the project directory, which you can obtain from [here](http://innovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-natural-language-processing-applications).

The database used is almor-msa-s31.db (see `documentation/MADAMIRA-UserManual` p6 for more info). If you have access to this database, place the database file inside `MADAMIRA/resources/` and setup the MADAMIRA config file located in `MADAMIRA/config/almor.properties` by updating the following flag:
> `ALMOR.text.MSA.database.name=almor-s31.db`

For info on the Buckwalter Part-of-Speech tag set used by MADAMIRA, see `/documentation/ATB-POSGuidelines-v3.7.pdf`

#### MADAMIRA and Java requirements
- MADAMIRA will not run on versions of Java above 9
- We used openjdk64-1.8.0.272. You can install it using `brew install --cask adoptopenjdk8`, or download from the website: https://adoptopenjdk.net/ (make sure to select jdk8)
- We recommend you setup java using [JENV](https://www.jenv.be/)

### Seq2Seq

To run the Seq2Seq model, you must obtain a copy of Shazal & Usman's [Seq2Seq Transliteration Tool](https://github.com/alishazal/seq2seq-transliteration-tool).

> git clone https://github.com/alishazal/seq2seq-transliteration-tool.git seq2seq

We ran our seq2seq systems with the GPU NVIDIA Tesla V100 PCIe 32 GB on NYU Abu Dhabi's High Performance Computing cluster, known as Dalma. We set the memory flag to 30GB. The .sh scripts that we ran can be found in the folders `/src/train/seq2seq_scripts/` and `/src/predict/seq2seq_scripts`.

## Data

Data for this project came from publicly available catalog databases stored in the [MARC (machine-readable cataloging) standard](https://www.loc.gov/marc/bibliographic/) xml format. If you are only interested in replicating our experimental setup, you can skip this section as you only need the tsvs provided in data.zip under `data/processed` (see release files). Read on for details on downloading the original marcxml dumps, collecting Arabic records, preprocessing and splits.

### Downloading Data

Catalog data in MARC xml format were downloaded from the following sources. To re-download the sources used in this project, simply run `make download_data`, or run each of the following commands:

Arabic Collections Online (ACO):
> git clone https://github.com/NYULibraries/aco-karms/ data/raw_records/aco/

Library of Congress (LOC):
> for val in {01..43}; do wget -nc -P data/raw_records/loc https://www.loc.gov/cds/downloads/MDSConnect/BooksAll.2016.part$$val.xml.gz; done

> gunzip data/raw_records/loc/*

University of Michigan (UMICH):
> wget -nc -P data/raw_records/umich http://www.lib.umich.edu/files/umich_bib.xml.gz

> gunzip data/raw_records/umich/*

or simply run `make download_data`

### Collecting Arabic Records

The first step is to read our MARC records from each data source and collect records tagged as Arabic into new marcxml collections.

UMICH and LOC are comprised of large marcxml collections containing thousands of records per file.

> python3 src/data/collect_arabic_records.py data/raw_records/umich

> python3 src/data/collect_arabic_records.py data/raw_records/loc

ACO records are in a parent folder named `aco/work` with thousands of individual xml files containing a single record each. Furthermore, each of ACO's partner institutions places their records inside a subfolder named `marcxml_out`, which we specify with `--sub_directory_filter`

> python3 src/data/collect_arabic_records.py data/raw_records/aco/work --sub_directory_filter marcxml_out

### Extract parallel lines

Once the Arabic marcxml collections are created, you can parse them and extract parallel Arabic and Romanized entries into a single tsv by running the following command:

> python3 src/loc_transcribe.py extract data/arabic_records/ data/extracted_lines/extracted_lines.tsv

or simply run `make extract_lines`

### Clean, preprocess, and split

Finally, we cleanup `data/extracted_lines/extracted_lines.tsv` and split records into Train, Dev, and Test sets.

> python3 src/loc_transcribe.py preprocess data/extracted_lines/extracted_lines.tsv data/processed/ --split

or simply run `make data_set`

## Running Arabic ALA-LC Romanization models

This section describes how to run the various prediction models we report on in [publication](#publication).
By default, predictions are run on the dev set, but you can replace the `dev` argument with `test` or a path to any tsv file containing an input column labelled `ar`.

### 1. Rules Simple

The baseline model Romanizes any input Arabic text based on
- the regex rules mapped in `src/predict/ar2phon/ar2phon_map.tsv`
- exceptional spellings mapped in `src/predict/ar2phon/loc_exceptional_spellings.tsv`.

> python3 src/loc_transcribe.py predict simple dev

### 2. Rules Morph

This model first runs MADAMIRA on the Arabic input and Romanizes the diacritized and segmented MADAMIRA output.

> python3 src/loc_transcribe.py predict morph dev

### 3. MLE

#### Train

The `--size` flag is used to specify the proportion of training data to use from the Train set. The following command will generate the 7 differently sized models we report on for **MLE Simple**.

> python3 src/loc_transcribe.py train mle --size 1, 0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625

#### Predict **MLE Simple** and **MLE Morph**

This will predict the dev set using the model trained on the full Training set.

- MLE Simple
> python3 src/loc_transcribe.py predict mle dev --mle_model models/mle/size1.0.tsv --backoff predictions_out/simple/dev/simple.out

- MLE Morph
> python3 src/loc_transcribe.py predict mle dev -m models/mle/size1.0.tsv -b predictions_out/morph/dev/morph.out

To generate predictions for all model sizes we report on, simply run `make predict_mle`

### 4. Seq2Seq

As mentioned, we ran the Seq2Seq model in Dalma.

#### Prep data

This command will prepare the intermediate files and Dalma scripts required for training the 7 differently sized Seq2Seq models.

> python3 src/loc_transcribe.py train seq2seq --prep --size 1.0, 0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625

or simply run `make prep_seq2seq`

#### Train Seq2Seq and predict dev

This command will 1) train Seq2Seq models using the specified portions of the Train set and 2) predict Dev using the specified models.

> python3 src/loc_transcribe.py train seq2seq --train --size 1.0, 0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625

#### Predict test

This command generates predictions for the test set using the full Train model.

> python3 src/loc_transcribe.py predict seq2seq --predict_test -s 1.0

### 5. Seq2Seq aligned

To align Seq2Seq predictions and fill missing output using **MLE Morph**, we can run the following command:

> python3 src/loc_transcribe.py predict seq2seq --align_backoff predictions_out/seq2seq/dev/seq2seq_size1.0.out predictions_out/mle_morph/dev/mle_morph_size1.0.out

or run `make align_seq2seq` and `make align_seq2seq_test` to replicate the predictions we report on.

## Evaluation

- Evaluate the dev predictions of (Seq2Seq aligned with Mle Morph)
> python3 src/loc_transcribe.py evaluate predictions_out/aligned_seq2seq/dev/seq2seq_size1.0Xmle_morph_size1.0.out data/processed/dev.tsv

- Evaluate the test predictions of (Seq2Seq aligned with Mle Morph)
> python3 src/loc_transcribe.py evaluate predictions_out/aligned_seq2seq/test/seq2seq_size1.0Xmle_morph_size1.0.out data/processed/test.tsv

or run `make evaluate` to evaluate all models we report on.

## Print Evaluation Scores

To retrieve scores for all evaluated models (contained in `evaluation/`), run the following:

> python3 src/loc_transcribe.py score all

or specify the evaluated files you want to retrieve scores for, e.g.:

> python3 src/loc_transcribe.py score evaluation/dev/mle_morph_size1.0.tsv evaluation/dev/simple.tsv

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/camel-lab/arabic_ala-lc_romanization

Awesome Lists containing this project

README