Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus
This project involves training a Word2Vec model on an Arabic BERT corpus to generate high-quality word embeddings, enhancing natural language processing (NLP) applications in Arabic text analysis.
https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus
arabic-dataset arabic-nlp jupyter-notebook nlp nlp-machine-learning python word2vec-model
Last synced: about 23 hours ago
JSON representation
This project involves training a Word2Vec model on an Arabic BERT corpus to generate high-quality word embeddings, enhancing natural language processing (NLP) applications in Arabic text analysis.
- Host: GitHub
- URL: https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus
- Owner: NabilaAgha
- Created: 2025-02-02T22:59:58.000Z (8 days ago)
- Default Branch: main
- Last Pushed: 2025-02-02T23:03:35.000Z (8 days ago)
- Last Synced: 2025-02-03T00:17:43.514Z (8 days ago)
- Topics: arabic-dataset, arabic-nlp, jupyter-notebook, nlp, nlp-machine-learning, python, word2vec-model
- Language: Jupyter Notebook
- Homepage: https://github.com/NabilaAgha
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🔠 **Word2Vec for Arabic BERT Corpus**
## 📌 Project Overview
This project focuses on training a **Word2Vec model** using the **Arabic BERT Corpus** to generate high-quality **word embeddings**, improving **natural language processing (NLP)** applications for Arabic text analysis. By leveraging **Gensim**, we train a model that captures semantic relationships between words, making it valuable for **text classification, sentiment analysis, and machine translation**.## 📂 **Dataset**
The dataset used for training is the **[Arabic BERT Corpus](https://www.kaggle.com/datasets/abedkhooli/arabic-bert-corpus)**, a large collection of **preprocessed Arabic text** designed for deep learning models.## 🚀 **Key Features**
✔️ **Preprocess Arabic Text** – Tokenization, normalization, and stopword removal.
✔️ **Train Word2Vec Model** – Learn vector representations for words.
✔️ **Visualize Word Embeddings** – Explore relationships between words.
✔️ **Optimize for NLP Tasks** – Improve Arabic text-based machine learning models.## 🛠️ **Technologies Used**
- Python 🐍
- Gensim 🧠
- NLP Libraries (NLTK, spaCy) 📖
- Pandas & NumPy 📊## 📥 **Getting Started**
Clone this repository and install the required dependencies to start training your own **Word2Vec model** on Arabic text!
![dataset-cover](https://github.com/user-attachments/assets/2ca3c459-ebb7-4980-bbec-29ce0190f0fd)