https://github.com/yahiazakaria445/arabic-answer-grading-using-sequence-models
An NLP project on Arabic data using LSTM model
https://github.com/yahiazakaria445/arabic-answer-grading-using-sequence-models
matplotlib nltk numpy pandas scikit-learn tensorflow
Last synced: 4 months ago
JSON representation
An NLP project on Arabic data using LSTM model
- Host: GitHub
- URL: https://github.com/yahiazakaria445/arabic-answer-grading-using-sequence-models
- Owner: yahiazakaria445
- Created: 2025-03-10T13:33:13.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-10T14:17:51.000Z (11 months ago)
- Last Synced: 2025-04-03T23:18:11.772Z (10 months ago)
- Topics: matplotlib, nltk, numpy, pandas, scikit-learn, tensorflow
- Language: Jupyter Notebook
- Homepage:
- Size: 136 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This project focuses on automatically grading short Arabic answers using advanced Natural Language Processing (NLP) techniques and sequence models such as LSTM.
# Dataset
This project uses The ARabic Dataset for Automatic Short Answer Grading Evaluation
This dataset is intended for exploring research in the field of Automatic Short Answer Grading (ASAG).
[Dataset Link](https://github.com/leilaouahrani/AR-ASAG-Dataset)
# libraries used
- numpy
- pandas
- matplotlib
- seaborn
- nltk
- scikit-learn
- keras
- tensorflow
# Text Preprocessing
Arabic text is cleaned, normalized, and tokenized to handle linguistic variations such as diacritics, punctuation, and synonyms. Using `Tokenizer`, text is converted into sequences of tokens (integers), and then `pad_sequences` is applied to ensure all input sequences have a uniform length, which is essential for feeding data into model.
# texts_to_sequences & Word Embedding
After preprocessing the text, the `Tokenizer` is used to convert words into numerical sequences using the `texts_to_sequences` method. These sequences are then padded to ensure consistent input length. The resulting sequences are passed through a word embedding layer (`Word2Vec`) allowing the model to represent each word in a dense vector space that captures semantic and contextual relationships. This encoding forms the foundation for LSTM sequence model input.
# Model Architecture
| model | optimizer | metrics |loss function |accuracy score|
| -------- | -------- | -------- | -------- |-------- |
| LSTM | adam |accuracy score |categorical crossentropy |85%
# testing sample
|answer| rate |
| -------- | -------- |
| 1 | B |
|2 | A|
| 3 | D |
| a | B |