https://github.com/sheikh-adeel/hypothesis-testing-education-data

Conduct two-sample t-test on the mean district literacy rates from sample data of two states
https://github.com/sheikh-adeel/hypothesis-testing-education-data

hypothesis-testing jupyter-notebook pandas python scipy-stats two-sample-t-test

Last synced: 28 days ago
JSON representation

Conduct two-sample t-test on the mean district literacy rates from sample data of two states

Host: GitHub
URL: https://github.com/sheikh-adeel/hypothesis-testing-education-data
Owner: sheikh-adeel
Created: 2025-08-10T15:05:17.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-08-10T22:01:22.000Z (10 months ago)
Last Synced: 2025-08-14T00:23:15.176Z (10 months ago)
Topics: hypothesis-testing, jupyter-notebook, pandas, python, scipy-stats, two-sample-t-test
Language: Jupyter Notebook
Homepage:
Size: 31.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **Hyptothesis Test of Sample Mean District Literacy Rates**

## **Overview**
In this project, we will conduct a two-sample t-test on the mean district literacy rates for two states: Uttar Pradesh and Bihar. For the purpose of this project, we will only consider 20 randomly chosen districts from each state. We want to determine if the difference between the two mean district literacy rates is statistically significant or due to chance. The purpose of this is to help decide how to distribute funding to improve literacy.

## **Data Dictionary**
This project uses a dataset called `2015_16_Districtwise.csv` from [Kaggle](https://www.kaggle.com/datasets/rajanand/education-in-india?resource=download&select=2015_16_Districtwise.csv). This dataset represents a list of school districts in India. The data includes district and state names, total population, and the literacy rate.

*Note: The dataset has been modified for this project.*

The dataset contains:

**680 rows** – each row is a different school district

**7 columns**

| Column name | Type | Description |
|-------------|------|-------------|
| DISTNAME | str | names of the districts |
| STATNAME | str | names of the states where districts are located |
| BLOCKS | int64 | number of blocks in each district |
| VILLAGES | int64 | number of villages in each district |
| CLUSTERS | int64 | number of clusters in each district |
| TOTPOPULAT | float64 | population in each district |
| OVERALL_LI | float64 | literacy rate in each district |

## **Steps**
### **1. Imports**
- Import libraries and packages
- Load the dataset into a dataFrame

### **2. Data cleaning and exploration**
- Assess the size of the dataset
- Determine the shape of the dataset
- Get basic information about the dataset
- Check for duplicate rows and null values in the data
- Determine the total population and number of districts with respect to state

### **3. Organizing data**
- Filter the data for the two states: Uttar Pradesh and Bihar
- Simulate random sampling with 20 districts from each of the two states
- Choose an arbitrary number for the random seed
- Calculate the sample means

### **4. Conducting the hypothesis test**
- State the null hypothesis and the alternative hypothesis
- $H_0$: There is no difference in the mean district literacy rates between Uttar Pradesh and Bihar
- $H_A$: There is a difference in the mean district literacy rates between Uttar Pradesh and Bihar
- Choose a significance level of 5%
- Find the p-value

### **5. Results and evaluation**
- Compare the p-value with the significance level
- Reject or fail to reject the null hypothesis

## **Findings**
- There are 680 school districts from India represented in this dataset.
- 46 rows had `Null` values and were removed.
- The states of Uttar Pradesh, Maharashtra and Bihar has the highest population.
- Kerala has the highest mean literacy rate and Bihar has lowest mean literacy rate.
- Uttar Pradesh has the highest number of districts.
- From the simulated random samples, Uttar Pradesh has a mean district literacy rate of about 70.8%, while Bihar has a mean district literacy rate of about 64.6%.
- The observed difference between the sample mean district literacy rates of the two states is 6.2 percentage points (70.8% - 64.6%).
- The p-value is 0.64%, which is less than the significance level of 5%.
- We reject the null hypothesis and conclude that there is a statistically significant difference between the mean district literacy rates of the two states: Uttar Pradesh and Bihar.

## **Recommendations**
- There is a statistically significant difference in mean district literacy rates of Uttar Pradesh and Bihar, the resources should be allocated accordingly.
- Bihar should receive more resources to improve literacy.
- There are other states, like Madhya Pradesh, Rajasthan and Andhra Pradesh, with low mean district literacy rates and high population that might need more resources than other states to improve literacy rates.

## **Next steps**
- Conduct hypothesis tests for other states to determine which states should be focused to improve literacy rates

## **Possible questions**
- How was the literacy rate calculated?
- How was the total population determined? Is the population considered from only urban areas? Or any specific age group was considered?
- Are all the states and districts covered in the dataset?
- What factors were considered while collecting this data? Is there any bias in the data collection process?

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sheikh-adeel/hypothesis-testing-education-data

Awesome Lists containing this project

README