https://github.com/relostar-devil/census-income-prediction
An end-to-end ML project using 1994 US Census data to classify income (>50K/<50K). The Jupyter Notebook covers data preprocessing, EDA, and model evaluation with multiple classifiers.
https://github.com/relostar-devil/census-income-prediction
matplotlib numpy pandas scikitlearn-machine-learning seaborn
Last synced: 10 months ago
JSON representation
An end-to-end ML project using 1994 US Census data to classify income (>50K/<50K). The Jupyter Notebook covers data preprocessing, EDA, and model evaluation with multiple classifiers.
- Host: GitHub
- URL: https://github.com/relostar-devil/census-income-prediction
- Owner: Relostar-Devil
- Created: 2025-02-12T00:42:01.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-02-12T00:46:26.000Z (12 months ago)
- Last Synced: 2025-02-12T01:35:17.674Z (12 months ago)
- Topics: matplotlib, numpy, pandas, scikitlearn-machine-learning, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Census Income
A comprehensive machine learning project for predicting whether an individual earns over \$50K per year based on the US Census Income dataset from the UCI Machine Learning Repository. This notebook-driven project guides you through data preprocessing, exploratory data analysis (EDA), and predictive modeling using several state-of-the-art algorithms.
## Project Overview
- **Objective:**
Predict the income level of individuals (<=50K or >50K) by analyzing demographic features from census data.
- **Dataset:**
The dataset contains detailed information on over 48,000 individuals collected during the 1994 US census. For more details about the dataset, please refer to its description on the UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/census+income
- **Approach:**
The project involves:
- Data ingestion and cleaning
- Exploratory Data Analysis (EDA) with visualizations
- Feature engineering and scaling using tools like `LabelEncoder` and `StandardScaler`
- Training and evaluating multiple classification models, including:
- Logistic Regression
- Support Vector Classifier (SVC)
- Random Forest Classifier
- Gradient Boosting Classifier
## Technical Highlights
- **Data Handling:**
- Uses `pandas` and `numpy` for efficient data manipulation.
- Reads multiple CSV data sources seamlessly.
- **Visualization & EDA:**
- Employs `matplotlib` and `seaborn` to create insightful visualizations.
- Generates data summaries and statistical insights.
- **Machine Learning Pipeline:**
- Splits the dataset into training and testing sets using `train_test_split`.
- Implements several machine learning algorithms from the `scikit-learn` package.
- Evaluates model performance with `accuracy_score` and `classification_report`.
## Project Structure
- `Census-Income.ipynb` – Main Jupyter Notebook containing the full workflow from data ingestion to model evaluation.
- Data files – CSV files containing the census data.
- Additional documentation and visualizations are embedded within the notebook.