https://github.com/farbodbj/persian-gender-by-name
A comprehensive dataset for determining gender based on Persian names, enriched with English representations.
https://github.com/farbodbj/persian-gender-by-name
dataset farsi farsi-datasets nlp persian persian-dataset
Last synced: 2 months ago
JSON representation
A comprehensive dataset for determining gender based on Persian names, enriched with English representations.
- Host: GitHub
- URL: https://github.com/farbodbj/persian-gender-by-name
- Owner: farbodbj
- License: apache-2.0
- Created: 2025-02-05T20:39:54.000Z (8 months ago)
- Default Branch: github-master
- Last Pushed: 2025-03-29T11:01:17.000Z (7 months ago)
- Last Synced: 2025-03-29T12:19:15.019Z (7 months ago)
- Topics: dataset, farsi, farsi-datasets, nlp, persian, persian-dataset
- Homepage:
- Size: 847 KB
- Stars: 32
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Persian Gender Detection by Name
A comprehensive dataset for determining gender based on Persian names, enriched with English representations.
## Overview
The **Persian Gender Detection by Name** dataset is the largest of its kind, comprising approximately **27,000** entries. Each entry includes a Persian name, its corresponding gender, and the English transliteration. This dataset is designed to facilitate accurate gender detection and enhance searchability through multiple name representations.
## Features
- **Extensive Data**: ~27,000 name-gender-English tuples.
- **Multiple Representations**: Various spellings and formats for each name to improve search flexibility.
- **High Quality**: Aggregated from reliable sources and meticulously hand-cleaned for accuracy.
- **Expandable**: Plans to incorporate more names and data sources in the future.## Data Sources
This dataset aggregates information from the following primary sources:
- [Iranian Names Database By Gender](https://github.com/nikahd99/iranian-Names-Database-By-Gender)
- [Persian Names Gender Dataset on Kaggle](https://www.kaggle.com/datasets/misssahar75/persian-names-gender)
- [Persian Names with Gender and Transliteration Data](https://www.kaggle.com/datasets/titanz123/persian-names)Additionally, supplementary data was scraped and manually cleaned to ensure consistency and completeness.
## Data Structure
The dataset is organized in a CSV format with the following columns:
- **Name**: The Persian name.
- **Gender**: Assigned gender (e.g., Male, Female).
- **English Representation**: The transliterated version of the Persian name.**Example:**
| Name | Gender | English Representation |
|-------|--------|------------------------|
| علی | M | Ali |
| زهرا | F | Zahra |## Usage
This dataset is ideal for:
- Developing gender prediction models based on Persian names.
- Academic research in linguistics, gender studies, and natural language processing.
- Enhancing search algorithms with multilingual name representations.## Future Enhancements
Future updates will focus on:
- Expanding the dataset with additional names and gender associations.
- Incorporating more diverse sources to cover a broader range of names.
- Refining data quality through ongoing cleaning and validation processes.## Citaion
```
@dataset{bijary_persian_gender_by_name_2024,
author = {Farbod Bijary},
title = {Persian Gender Detection by Name},
year = {2024},
publisher = {Hugging Face},
license = {Apache-2.0},
url = {https://huggingface.co/datasets/farbodbij/persian-gender-by-name},
}
```
## AcknowledgmentsThanks to the contributors of the original datasets and those who assisted in data aggregation and cleaning.