https://github.com/r-mahesh45/hr---resume-text-classification

Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.
https://github.com/r-mahesh45/hr---resume-text-classification

data datacleaning extract-transform-load feature-extraction nlp nltk-tokenizer text-mining text-processing

Last synced: 22 days ago
JSON representation

Host: GitHub
URL: https://github.com/r-mahesh45/hr---resume-text-classification
Owner: R-Mahesh45
Created: 2024-06-15T06:21:03.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-05T18:01:26.000Z (8 months ago)
Last Synced: 2025-06-08T08:03:52.584Z (4 months ago)
Topics: data, datacleaning, extract-transform-load, feature-extraction, nlp, nltk-tokenizer, text-mining, text-processing
Language: Jupyter Notebook
Homepage:
Size: 11.2 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

### Step-by-Step Explanation 🌟

#### 1. Data Preparation 📋
- **Objective**: Load the dataset and prepare it for analysis.
- **Actions**:
- Create a sample DataFrame containing document information such as file names, text content, and categories.
- This DataFrame will be used for further processing.

#### 2. Text Preprocessing 🧹
- **Objective**: Clean and transform the raw text data into a format suitable for analysis.
- **Actions**:
- **Convert to Lowercase**: Convert all text to lowercase to ensure uniformity.
- **Tokenize**: Split the text into individual words.
- **Remove Stopwords**: Remove common words (e.g., "and", "the") that do not carry significant meaning.
- **Stemming**: Reduce words to their root forms (e.g., "running" to "run").
- **Lemmatization**: Further reduce words to their base forms (e.g., "better" to "good").

#### 3. Feature Extraction 🔍
- **Objective**: Convert the processed text data into numerical features using TF-IDF.
- **Actions**:
- **TF-IDF Vectorization**: Transform the text data into a matrix of TF-IDF features. TF-IDF (Term Frequency-Inverse Document Frequency) captures the importance of each word in the document relative to the entire corpus.

#### 4. Encode the Target Variable 🔢
- **Objective**: Convert the categorical target variable (document category) into numerical values.
- **Actions**:
- Use `LabelEncoder` to encode the categories into numerical values.

#### 5. Split the Data ✂️
- **Objective**: Split the dataset into training and testing sets for model evaluation.
- **Actions**:
- Use `train_test_split` to split the data into training (80%) and testing (20%) sets.

#### 6. Model Training 🧠
- **Objective**: Train a machine learning model on the training data.
- **Actions**:
- **Model Selection**: Use a Logistic Regression model.
- **Training**: Fit the model on the training data.

#### 7. Model Evaluation 📊
- **Objective**: Evaluate the model's performance on both the training and testing sets.
- **Actions**:
- **Predictions**: Make predictions on the training and testing sets.
- **Accuracy**: Calculate the accuracy of the model on both sets.
- **Classification Report**: Generate a classification report to provide detailed metrics (precision, recall, F1-score) for each class.
- **Bar Plot**: Visualize the accuracy on the training and testing sets using a bar plot.

#### 8. Create Word Cloud ☁️
- **Objective**: Visualize the most important words in the dataset based on their TF-IDF scores.
- **Actions**:
- **TF-IDF Means**: Calculate the average TF-IDF scores for each feature (word).
- **Word Cloud Generation**: Generate a word cloud where the size of each word indicates its importance (TF-IDF score).

### Summary 🌟

This step-by-step explanation guides you through the process of preparing text data, extracting features, training and evaluating a machine learning model, and visualizing important words using a word cloud. This comprehensive approach provides insights into the importance of different words in your dataset and helps you understand the performance of your classification model.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r-mahesh45/hr---resume-text-classification

Awesome Lists containing this project

README