{"id":18510406,"url":"https://github.com/rakibhhridoy/exploratorydataanalysis-python","last_synced_at":"2026-04-30T08:39:45.076Z","repository":{"id":131513649,"uuid":"281356665","full_name":"rakibhhridoy/ExploratoryDataAnalysis-Python","owner":"rakibhhridoy","description":"Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. ","archived":false,"fork":false,"pushed_at":"2020-08-18T11:11:08.000Z","size":2091,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-12-25T20:25:10.352Z","etag":null,"topics":["ab-testing","chitest","data-science","eda","exploratory-data-analysis","ftest","hypotheses","hypothesis-testing","inferential-statistics","numpy","pandas","python","statistical-analysis","statistics","statsmodels","ttest"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rakibhhridoy.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-21T09:40:25.000Z","updated_at":"2024-02-25T13:44:46.000Z","dependencies_parsed_at":"2023-03-21T15:10:28.107Z","dependency_job_id":null,"html_url":"https://github.com/rakibhhridoy/ExploratoryDataAnalysis-Python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rakibhhridoy%2FExploratoryDataAnalysis-Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rakibhhridoy%2FExploratoryDataAnalysis-Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rakibhhridoy%2FExploratoryDataAnalysis-Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rakibhhridoy%2FExploratoryDataAnalysis-Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rakibhhridoy","download_url":"https://codeload.github.com/rakibhhridoy/ExploratoryDataAnalysis-Python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239225764,"owners_count":19603162,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ab-testing","chitest","data-science","eda","exploratory-data-analysis","ftest","hypotheses","hypothesis-testing","inferential-statistics","numpy","pandas","python","statistical-analysis","statistics","statsmodels","ttest"],"created_at":"2024-11-06T15:23:13.491Z","updated_at":"2026-04-30T08:39:45.014Z","avatar_url":"https://github.com/rakibhhridoy.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# *Exploratory Data Analysis in Python \u0026 Hypothese Testing*\n\u003e1: Definining Exploratory Data Analysis with an overview of the whole project.\n\n\u003e2: Importing libraries and Exploring the Dataset.\n\n\u003e3: Checking missing values and Outliers.\n\n\u003e4: Creating visual methods to analyze the data.\n\n\u003e5: Analyzing trends, patterns, and relationships in the Data. Hypotheses Testing\n\n### Exploratory Data Analysis\n```\nIn statistics, exploratory data analysis is an approach to analyzing \ndata sets to summarize their main characteristics, often with visual methods. \nA statistical model can be used or not, but primarily EDA is for seeing what \nthe data can tell us beyond the formal modeling or hypothesis testing task. \nExploratory data analysis was promoted by John Tukey to encourage statisticians \nto explore the data, and possibly formulate hypotheses that could lead to new \ndata collection and experiments. EDA is different from initial data analysis (IDA),\nwhich focuses more narrowly on checking assumptions required for model fitting and \nhypothesis testing, and handling missing values and making transformations of variables \nas needed. EDA encompasses IDA.\n```\n\n## Importing libraries and Exploring the Dataset\n```python\nimport numpy as np \nimport pandas as pd \nfrom matplotlib import pyplot as plt\nimport seaborn as sns\nimport statsmodels.api as sm\nimport scipy.stats as stats\nfrom sklearn.preprocessing import LabelEncoder\nimport copy\nsns.set() \n```\n```python\ninsurance_df.info()\n```\n\n    \u003cclass 'pandas.core.frame.DataFrame'\u003e\n    RangeIndex: 1338 entries, 0 to 1337\n    Data columns (total 7 columns):\n     #   Column    Non-Null Count  Dtype  \n    ---  ------    --------------  -----  \n     0   age       1338 non-null   int64  \n     1   sex       1338 non-null   object \n     2   bmi       1338 non-null   float64\n     3   children  1338 non-null   int64  \n     4   smoker    1338 non-null   object \n     5   region    1338 non-null   object \n     6   charges   1338 non-null   float64\n    dtypes: float64(2), int64(2), object(3)\n    memory usage: 73.3+ KB\n    \n\nExpected output:\n\n    The data should consist of 1338 instances with 7 attributes. 2 integer type, 2 float type and 3 object type (Strings in the column)\n\n### Checking missing values and Outliers\n```python\ninsurance_df.isna().apply(pd.value_counts)\n```\n\n\n```python\ninsurance_df.describe().T\n```\nOutput should include this Analysis:\n\n- All the statistics seem reasonable.\n\n- Age column: data looks representative of the true age distribution of the adult population with (39) mean.\n\n- Children Column: Few people have more than 2 children (75% of the people have 2 or less children).\n\n- The claimed amount is higly skewed as most people would require basic medi-care and only few suffer from diseases which cost more to get rid of.\n\n\n![png](plots/output_15_0.png)\n\n\nOutput should include this Analysis:\n\n- bmi looks normally distributed.\n\n- Age looks uniformly distributed.\n\n- As seen in the previous step, charges are highly skewed.\n\n\n![png](plots/output_19_0.png)\n\n\nOutput should include this Analysis:\n\n- There are lot more non-smokers than smokers.\n\n- Instances are distributed evenly accross all regions.\n\n- Gender is also distributed evenly.\n\n- Most instances have less than 3 children and very few have 4 or 5 children.\n\n\n```python\n# Label encoding the variables before doing a pairplot because pairplot ignores strings\n\ninsurance_df_encoded = copy.deepcopy(insurance_df)\ninsurance_df_encoded.loc[:,['sex', 'smoker', 'region']] = insurance_df_encoded.loc[:,['sex', 'smoker', 'region']].apply(LabelEncoder().fit_transform) \n\nsns.pairplot(insurance_df_encoded)  #pairplot\nplt.show()\n```\n![png](plots/output_21_0.png)\n\n\nOutput should include this Analysis:\n\n- There is an obvious correlation between 'charges' and 'smoker'\n\n- Looks like smokers claimed more money than non-smokers\n\n- There's an interesting pattern between 'age' and 'charges'. Notice that older people are charged more than the younger ones\n\n### Analyzing trends, patterns, and relationships in the Data.\n\n\n```python\nprint(\"Do charges of people who smoke differ significantly from the people who don't?\")\ninsurance_df.smoker.value_counts()\n```\n\n    Do charges of people who smoke differ significantly from the people who don't?\n    \n\n\n\n\n    no     1064\n    yes     274\n    Name: smoker, dtype: int64\n\n\n\n![png](plots/output_25_0.png)\n![png](plots/output_26_0.png)\n\nThere is no apparent relation between gender and charges\n\n\n```python\n# T-test to check dependency of smoking on charges\nHo = \"Charges of smoker and non-smoker are same\"   # Stating the Null Hypothesis\nHa = \"Charges of smoker and non-smoker are not the same\"   # Stating the Alternate Hypothesis\n\nx = np.array(insurance_df[insurance_df.smoker == 'yes'].charges)  # Selecting charges corresponding to smokers as an array\ny = np.array(insurance_df[insurance_df.smoker == 'no'].charges) # Selecting charges corresponding to non-smokers as an array\n\nt, p_value  = stats.ttest_ind(x,y, axis = 0)  #Performing an Independent t-test\n\nif p_value \u003c 0.05:  # Setting our significance level at 5%\n    print(f'{Ha} as the p_value ({p_value}) \u003c 0.05')\nelse:\n    print(f'{Ho} as the p_value ({p_value}) \u003e 0.05')\n```\n\n    Charges of smoker and non-smoker are not the same as the p_value (8.271435842177219e-283) \u003c 0.05\n    \n\nThus, Smokers seem to claim significantly more money than non-smokers\n\n\n```python\n#Does bmi of males differ significantly from that of females?\nprint (\"Does bmi of males differ significantly from that of females?\")\ninsurance_df.sex.value_counts()   #Checking the distribution of males and females\n```\n\n    Does bmi of males differ significantly from that of females?\n    \n\n\n\n\n    male      676\n    female    662\n    Name: sex, dtype: int64\n\n\n\n\n```python\n# T-test to check dependency of bmi on gender\nHo = \"Gender has no effect on bmi\"   # Stating the Null Hypothesis\nHa = \"Gender has an effect on bmi\"   # Stating the Alternate Hypothesis\n\nx = np.array(insurance_df[insurance_df.sex == 'male'].bmi)  # Selecting bmi values corresponding to males as an array\ny = np.array(insurance_df[insurance_df.sex == 'female'].bmi) # Selecting bmi values corresponding to females as an array\n\nt, p_value  = stats.ttest_ind(x,y, axis = 0)  #Performing an Independent t-test\n\nif p_value \u003c 0.05:  # Setting our significance level at 5%\n    print(f'{Ha} as the p_value ({p_value.round()}) \u003c 0.05')\nelse:\n    print(f'{Ho} as the p_value ({p_value.round(3)}) \u003e 0.05')\n```\n\n    Gender has no effect on bmi as the p_value (0.09) \u003e 0.05\n    \n\nbmi of both the genders are identical\n\n\n\n```python\n#Is the proportion of smokers significantly different in different genders?\n\n\n# Chi_square test to check if smoking habits are different for different genders\nHo = \"Gender has no effect on smoking habits\"   # Stating the Null Hypothesis\nHa = \"Gender has an effect on smoking habits\"   # Stating the Alternate Hypothesis\n\ncrosstab = pd.crosstab(insurance_df['sex'],insurance_df['smoker'])  # Contingency table of sex and smoker attributes\n\nchi, p_value, dof, expected =  stats.chi2_contingency(crosstab)\n\nif p_value \u003c 0.05:  # Setting our significance level at 5%\n    print(f'{Ha} as the p_value ({p_value.round(3)}) \u003c 0.05')\nelse:\n    print(f'{Ho} as the p_value ({p_value.round(3)}) \u003e 0.05')\ncrosstab\n```\n\n    Gender has an effect on smoking habits as the p_value (0.007) \u003c 0.05\n    \n\n\n\nProportion of smokers in males is significantly different from that of the females\n\n```python\n# Chi_square test to check if smoking habits are different for people of different regions\nHo = \"Region has no effect on smoking habits\"   # Stating the Null Hypothesis\nHa = \"Region has an effect on smoking habits\"   # Stating the Alternate Hypothesis\n\ncrosstab = pd.crosstab(insurance_df['smoker'], insurance_df['region'])  # Contingency table of sex and smoker attributes\n\nchi, p_value, dof, expected =  stats.chi2_contingency(crosstab)\n\nif p_value \u003c 0.05:  # Setting our significance level at 5%\n    print(f'{Ha} as the p_value ({p_value.round(3)}) \u003c 0.05')\nelse:\n    print(f'{Ho} as the p_value ({p_value.round(3)}) \u003e 0.05')\ncrosstab\n```\n\n    Region has no effect on smoking habits as the p_value (0.062) \u003e 0.05\n    \n\n* Smoking habbits of people of different regions are similar\n\n\n\n```python\n# Is the distribution of bmi across women with no children, one child and two children, the same ?\n# Test to see if the distributions of bmi values for females having different number of children, are significantly different\n\nHo = \"No. of children has no effect on bmi\"   # Stating the Null Hypothesis\nHa = \"No. of children has an effect on bmi\"   # Stating the Alternate Hypothesis\n\n\nfemale_df = copy.deepcopy(insurance_df[insurance_df['sex'] == 'female'])\n\nzero = female_df[female_df.children == 0]['bmi']\none = female_df[female_df.children == 1]['bmi']\ntwo = female_df[female_df.children == 2]['bmi']\n\n\nf_stat, p_value = stats.f_oneway(zero,one,two)\n\n\nif p_value \u003c 0.05:  # Setting our significance level at 5%\n    print(f'{Ha} as the p_value ({p_value.round(3)}) \u003c 0.05')\nelse:\n    print(f'{Ho} as the p_value ({p_value.round(3)}) \u003e 0.05')\n```\n\n    No. of children has no effect on bmi as the p_value (0.716) \u003e 0.05\n    \n\n\n### Get Touch With Me\nConnect- [Linkedin](https://linkedin.com/in/rakibhhridoy) \u003cbr\u003e\nWebsite- [RakibHHridoy](https://rakibhhridoy.github.io)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frakibhhridoy%2Fexploratorydataanalysis-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frakibhhridoy%2Fexploratorydataanalysis-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frakibhhridoy%2Fexploratorydataanalysis-python/lists"}