An open API service indexing awesome lists of open source software.

https://github.com/r-mahesh45/hr---resume-text-classification

Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.
https://github.com/r-mahesh45/hr---resume-text-classification

data datacleaning extract-transform-load feature-extraction nlp nltk-tokenizer text-mining text-processing

Last synced: 22 days ago
JSON representation

Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.

Awesome Lists containing this project

README

          

### Step-by-Step Explanation ๐ŸŒŸ

#### 1. Data Preparation ๐Ÿ“‹
- **Objective**: Load the dataset and prepare it for analysis.
- **Actions**:
- Create a sample DataFrame containing document information such as file names, text content, and categories.
- This DataFrame will be used for further processing.

#### 2. Text Preprocessing ๐Ÿงน
- **Objective**: Clean and transform the raw text data into a format suitable for analysis.
- **Actions**:
- **Convert to Lowercase**: Convert all text to lowercase to ensure uniformity.
- **Tokenize**: Split the text into individual words.
- **Remove Stopwords**: Remove common words (e.g., "and", "the") that do not carry significant meaning.
- **Stemming**: Reduce words to their root forms (e.g., "running" to "run").
- **Lemmatization**: Further reduce words to their base forms (e.g., "better" to "good").

#### 3. Feature Extraction ๐Ÿ”
- **Objective**: Convert the processed text data into numerical features using TF-IDF.
- **Actions**:
- **TF-IDF Vectorization**: Transform the text data into a matrix of TF-IDF features. TF-IDF (Term Frequency-Inverse Document Frequency) captures the importance of each word in the document relative to the entire corpus.

#### 4. Encode the Target Variable ๐Ÿ”ข
- **Objective**: Convert the categorical target variable (document category) into numerical values.
- **Actions**:
- Use `LabelEncoder` to encode the categories into numerical values.

#### 5. Split the Data โœ‚๏ธ
- **Objective**: Split the dataset into training and testing sets for model evaluation.
- **Actions**:
- Use `train_test_split` to split the data into training (80%) and testing (20%) sets.

#### 6. Model Training ๐Ÿง 
- **Objective**: Train a machine learning model on the training data.
- **Actions**:
- **Model Selection**: Use a Logistic Regression model.
- **Training**: Fit the model on the training data.

#### 7. Model Evaluation ๐Ÿ“Š
- **Objective**: Evaluate the model's performance on both the training and testing sets.
- **Actions**:
- **Predictions**: Make predictions on the training and testing sets.
- **Accuracy**: Calculate the accuracy of the model on both sets.
- **Classification Report**: Generate a classification report to provide detailed metrics (precision, recall, F1-score) for each class.
- **Bar Plot**: Visualize the accuracy on the training and testing sets using a bar plot.

#### 8. Create Word Cloud โ˜๏ธ
- **Objective**: Visualize the most important words in the dataset based on their TF-IDF scores.
- **Actions**:
- **TF-IDF Means**: Calculate the average TF-IDF scores for each feature (word).
- **Word Cloud Generation**: Generate a word cloud where the size of each word indicates its importance (TF-IDF score).

### Summary ๐ŸŒŸ

This step-by-step explanation guides you through the process of preparing text data, extracting features, training and evaluating a machine learning model, and visualizing important words using a word cloud. This comprehensive approach provides insights into the importance of different words in your dataset and helps you understand the performance of your classification model.