https://github.com/r-mahesh45/hr---resume-text-classification
Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.
https://github.com/r-mahesh45/hr---resume-text-classification
data datacleaning extract-transform-load feature-extraction nlp nltk-tokenizer text-mining text-processing
Last synced: 22 days ago
JSON representation
Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.
- Host: GitHub
- URL: https://github.com/r-mahesh45/hr---resume-text-classification
- Owner: R-Mahesh45
- Created: 2024-06-15T06:21:03.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-05T18:01:26.000Z (8 months ago)
- Last Synced: 2025-06-08T08:03:52.584Z (4 months ago)
- Topics: data, datacleaning, extract-transform-load, feature-extraction, nlp, nltk-tokenizer, text-mining, text-processing
- Language: Jupyter Notebook
- Homepage:
- Size: 11.2 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Step-by-Step Explanation ๐
#### 1. Data Preparation ๐
- **Objective**: Load the dataset and prepare it for analysis.
- **Actions**:
- Create a sample DataFrame containing document information such as file names, text content, and categories.
- This DataFrame will be used for further processing.#### 2. Text Preprocessing ๐งน
- **Objective**: Clean and transform the raw text data into a format suitable for analysis.
- **Actions**:
- **Convert to Lowercase**: Convert all text to lowercase to ensure uniformity.
- **Tokenize**: Split the text into individual words.
- **Remove Stopwords**: Remove common words (e.g., "and", "the") that do not carry significant meaning.
- **Stemming**: Reduce words to their root forms (e.g., "running" to "run").
- **Lemmatization**: Further reduce words to their base forms (e.g., "better" to "good").#### 3. Feature Extraction ๐
- **Objective**: Convert the processed text data into numerical features using TF-IDF.
- **Actions**:
- **TF-IDF Vectorization**: Transform the text data into a matrix of TF-IDF features. TF-IDF (Term Frequency-Inverse Document Frequency) captures the importance of each word in the document relative to the entire corpus.#### 4. Encode the Target Variable ๐ข
- **Objective**: Convert the categorical target variable (document category) into numerical values.
- **Actions**:
- Use `LabelEncoder` to encode the categories into numerical values.#### 5. Split the Data โ๏ธ
- **Objective**: Split the dataset into training and testing sets for model evaluation.
- **Actions**:
- Use `train_test_split` to split the data into training (80%) and testing (20%) sets.#### 6. Model Training ๐ง
- **Objective**: Train a machine learning model on the training data.
- **Actions**:
- **Model Selection**: Use a Logistic Regression model.
- **Training**: Fit the model on the training data.#### 7. Model Evaluation ๐
- **Objective**: Evaluate the model's performance on both the training and testing sets.
- **Actions**:
- **Predictions**: Make predictions on the training and testing sets.
- **Accuracy**: Calculate the accuracy of the model on both sets.
- **Classification Report**: Generate a classification report to provide detailed metrics (precision, recall, F1-score) for each class.
- **Bar Plot**: Visualize the accuracy on the training and testing sets using a bar plot.#### 8. Create Word Cloud โ๏ธ
- **Objective**: Visualize the most important words in the dataset based on their TF-IDF scores.
- **Actions**:
- **TF-IDF Means**: Calculate the average TF-IDF scores for each feature (word).
- **Word Cloud Generation**: Generate a word cloud where the size of each word indicates its importance (TF-IDF score).### Summary ๐
This step-by-step explanation guides you through the process of preparing text data, extracting features, training and evaluating a machine learning model, and visualizing important words using a word cloud. This comprehensive approach provides insights into the importance of different words in your dataset and helps you understand the performance of your classification model.