Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus

This project involves training a Word2Vec model on an Arabic BERT corpus to generate high-quality word embeddings, enhancing natural language processing (NLP) applications in Arabic text analysis.
https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus

arabic-dataset arabic-nlp jupyter-notebook nlp nlp-machine-learning python word2vec-model

Last synced: about 23 hours ago
JSON representation

This project involves training a Word2Vec model on an Arabic BERT corpus to generate high-quality word embeddings, enhancing natural language processing (NLP) applications in Arabic text analysis.

Host: GitHub
URL: https://github.com/nabilaagha/word2vec-for-arabic-bert-corpus
Owner: NabilaAgha
Created: 2025-02-02T22:59:58.000Z (8 days ago)
Default Branch: main
Last Pushed: 2025-02-02T23:03:35.000Z (8 days ago)
Last Synced: 2025-02-03T00:17:43.514Z (8 days ago)
Topics: arabic-dataset, arabic-nlp, jupyter-notebook, nlp, nlp-machine-learning, python, word2vec-model
Language: Jupyter Notebook
Homepage: https://github.com/NabilaAgha
Size: 16.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🔠 **Word2Vec for Arabic BERT Corpus**

## 📌 Project Overview
This project focuses on training a **Word2Vec model** using the **Arabic BERT Corpus** to generate high-quality **word embeddings**, improving **natural language processing (NLP)** applications for Arabic text analysis. By leveraging **Gensim**, we train a model that captures semantic relationships between words, making it valuable for **text classification, sentiment analysis, and machine translation**.

## 📂 **Dataset**
The dataset used for training is the **[Arabic BERT Corpus](https://www.kaggle.com/datasets/abedkhooli/arabic-bert-corpus)**, a large collection of **preprocessed Arabic text** designed for deep learning models.

## 🚀 **Key Features**
✔️ **Preprocess Arabic Text** – Tokenization, normalization, and stopword removal.
✔️ **Train Word2Vec Model** – Learn vector representations for words.
✔️ **Visualize Word Embeddings** – Explore relationships between words.
✔️ **Optimize for NLP Tasks** – Improve Arabic text-based machine learning models.

## 🛠️ **Technologies Used**
- Python 🐍
- Gensim 🧠
- NLP Libraries (NLTK, spaCy) 📖
- Pandas & NumPy 📊

## 📥 **Getting Started**
Clone this repository and install the required dependencies to start training your own **Word2Vec model** on Arabic text!
![dataset-cover](https://github.com/user-attachments/assets/2ca3c459-ebb7-4980-bbec-29ce0190f0fd)