{"id":15995758,"url":"https://github.com/jm199504/data-analysis-practice","last_synced_at":"2026-05-02T04:37:30.802Z","repository":{"id":247098320,"uuid":"825008957","full_name":"jm199504/Data-Analysis-Practice","owner":"jm199504","description":"数据分析练习（Titanic / BankCustomers）","archived":false,"fork":false,"pushed_at":"2024-07-07T02:32:22.000Z","size":1272,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-10T09:12:19.993Z","etag":null,"topics":["data-analysis","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jm199504.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-06T14:05:27.000Z","updated_at":"2024-07-07T02:32:25.000Z","dependencies_parsed_at":"2024-10-08T07:21:01.108Z","dependency_job_id":"0e75c545-7d20-40b8-a204-a01a911a201a","html_url":"https://github.com/jm199504/Data-Analysis-Practice","commit_stats":null,"previous_names":["jm199504/data-analysis-practice"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jm199504","download_url":"https://codeload.github.com/jm199504/Data-Analysis-Practice/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247271489,"owners_count":20911586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","python"],"created_at":"2024-10-08T07:20:52.104Z","updated_at":"2026-05-02T04:37:30.778Z","avatar_url":"https://github.com/jm199504.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 数据分析练习\n\n![author](https://img.shields.io/static/v1?label=Author\u0026message=junmingguo\u0026color=green)\n![language](https://img.shields.io/static/v1?label=Language\u0026message=python3\u0026color=orange) ![topics](https://img.shields.io/static/v1?label=Topics\u0026message=data-analysis\u0026color=blue)\n\n\n\n### 1. Titanic\n\n#### 1.1 抽取80%的数据作为训练数据\n\n- 读取全量数据集\n- 使用`sample`方法进行抽取训练集，设置随机状态参数\n- 使用`to_csv`保存训练集到新的csv文件\n\n```python\nimport pandas as pd\n\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('Titanic.csv')\n\n# 随机抽取80%的数据\ntrain = df.sample(frac=0.8, random_state=123)\n\n# 将抽取的数据保存到 train.csv 文件中\ntrain.to_csv('train.csv', index=False)\n```\n\n#### 1.2 查看训练数据的前5行和后5行\n\n```python\n# 查看前五行数据\ntrain.head()\n\n# 查看后五行数据\ntrain.tail()\n```\n\n#### 1.3 输出各字段缺失值数量\n\n```python\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 检测缺失值\nmissing_values = df.isnull().sum()\n\n# 输出各字段的缺失值数量，其中Age、Cabin、Embarked存在缺失值\nprint(missing_values)\n```\n\n输出结果：\n\n```\nPassengerId 0\nSurvived 0\nPclass 0\nName 0\nSex 0\nAge 139\nSibSp 0\nParch 0\nTicket 0\nFare 0\nCabin 555\nEmbarked 1\n```\n\n#### 1.4 对缺失值进行填充\n\n- 使用`fillna`方法填充缺失值，第一个参数即为缺失值的默认值，通常可以考虑均值/指定值/众数等等\n- 其中`df['Embarked'].mode()[0]` 指的是 `Embarked` 列中的众数（即出现频率最高的值）\n\n```python\n# 对上述存在缺失值的字段进行填补\ndf['Age'].fillna(df['Age'].mean(), inplace=True)\ndf['Cabin'].fillna('Unknown', inplace=True)\ndf['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)\n```\n\n#### 1.5 检测重复值\n\n```python\n# 检测重复值\nduplicate_rows = df.duplicated()\nduplicate_rows_count = duplicate_rows.sum()\nprint(\"重复行数:\", duplicate_rows_count)\n```\n\n#### 1.6 数据降重\n\n```python\ndf.drop_duplicates(inplace=True)\n```\n\n#### 1.7 基本统计分析(包含数量、均值、方差、最小值、最大值等)\n```python\nstatistics = df.describe()\nprint(statistics)\n```\n\n| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |\n|-----------------:|------------:|-----------:|-----------:|---------:|-----------:|-----------:|\n| 713.000000 | 713.000000 | 713.000000 | 713.000000 | 0.507714 | 0.360449 | 31.026296 |\n| 451.237027 | 0.366059 | 2.312763 | 29.422613 | 1.086309 | 0.781065 | 47.260244 |\n| 257.904310 | 0.482064 | 0.834015 | 12.728972 | 0.000000 | 0.000000 | 0.000000 |\n| 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |\n| 228.000000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.895800 |\n| 455.000000 | 0.000000 | 3.000000 | 29.422613 | 0.000000 | 0.000000 | 13.862500 |\n| 677.000000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 30.500000 |\n| 891.000000 | 1.000000 | 3.000000 | 74.000000 | 8.000000 | 6.000000 | 512.329200 |\n\n#### 1.8 【分析一】海难发生前，一等舱有 XX 人，二等舱 XX 人，三等舱 XX 人，分别占总人数的 XX%，XX%，XX%\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计海难发生前不同舱位的乘客人数\nfirst_class_count = df[df['Pclass'] == 1]['PassengerId'].count()\nsecond_class_count = df[df['Pclass'] == 2]['PassengerId'].count()\nthird_class_count = df[df['Pclass'] == 3]['PassengerId'].count()\n\n# 计算不同舱位的乘客人数占总人数的比例，并保留2位小数\ntotal_passengers = df['PassengerId'].count()\nfirst_class_percent = round((first_class_count / total_passengers) * 100, 2)\nsecond_class_percent = round((second_class_count / total_passengers) * 100, 2)\nthird_class_percent = round((third_class_count / total_passengers) * 100, 2)\n\n# 打印结果\nprint(f\"一等舱人数：{first_class_count}\")\nprint(f\"二等舱人数：{second_class_count}\")\nprint(f\"三等舱人数：{third_class_count}\")\nprint(f\"一等舱乘客占比：{first_class_percent}%\")\nprint(f\"二等舱乘客占比：{second_class_percent}%\")\nprint(f\"三等舱乘客占比：{third_class_percent}%\")\n```\n\n输出结果：\n\n```\n一等舱人数：171\n二等舱人数：148\n三等舱人数：394\n一等舱乘客占比：23.98%\n二等舱乘客占比：20.76%\n三等舱乘客占比：55.26%\n```\n\n分析题一答案：\n\n```\n# 【分析一的结论】\n# 一等舱人数：171\n# 二等舱人数：148\n# 三等舱人数：394\n# 一等舱乘客占比：23.98%\n# 二等舱乘客占比：20.76%\n# 三等舱乘客占比：55.26%\n```\n\n#### 1.9 【分析二】海难发生后，一等舱、二等舱、三等舱的乘客人数剩余 XX、XX、XX 人，分别占总人数的 XX%，XX%，XX%\n\n```python\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计海难发生后不同舱位的乘客人数\nfirst_class_survived = df[(df['Pclass'] == 1) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nsecond_class_survived = df[(df['Pclass'] == 2) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nthird_class_survived = df[(df['Pclass'] == 3) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\n\n# 计算不同舱位的乘客人数占总人数的比例，并保留2位小数\ntotal_passengers_survived = df[df['Survived'] == 1]['PassengerId'].count()\nfirst_class_percent_survived = round((first_class_survived / total_passengers_survived) * 100, 2)\nsecond_class_percent_survived = round((second_class_survived / total_passengers_survived) * 100, 2)\nthird_class_percent_survived = round((third_class_survived / total_passengers_survived) * 100, 2)\n\n# 打印结果\nprint(f\"海难发生后，一等舱乘客剩余人数： {first_class_survived}\")\nprint(f\"海难发生后，二等舱乘客剩余人数： {second_class_survived}\")\nprint(f\"海难发生后，三等舱乘客剩余人数： {third_class_survived}\")\nprint(f\"海难发生后，一等舱乘客占比： {first_class_percent_survived}%\")\nprint(f\"海难发生后，二等舱乘客占比： {second_class_percent_survived}%\")\nprint(f\"海难发生后，三等舱乘客占比： {third_class_percent_survived}%\")\n```\n\n输出结果：\n\n```\n海难发生后，一等舱乘客剩余人数： 106\n海难发生后，二等舱乘客剩余人数： 65\n海难发生后，三等舱乘客剩余人数： 90\n海难发生后，一等舱乘客占比： 40.61%\n海难发生后，二等舱乘客占比： 24.9%\n海难发生后，三等舱乘客占比： 34.48%\n```\n\n分析题二答案：\n\n```\n# 【分析二的结论】\n# 海难发生后，一等舱乘客剩余人数： 106\n# 海难发生后，二等舱乘客剩余人数： 65\n# 海难发生后，三等舱乘客剩余人数： 90\n# 海难发生后，一等舱乘客占比： 40.61%\n# 海难发生后，二等舱乘客占比： 24.9%\n# 海难发生后，三等舱乘客占比： 34.48%\n```\n\n#### 1.10 【分析三】一等舱生还率为 XX%，二等舱为 XX%，三等舱为 XX%。\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计不同舱位的乘客人数\nfirst_class_total = df[df['Pclass'] == 1]['PassengerId'].count()\nsecond_class_total = df[df['Pclass'] == 2]['PassengerId'].count()\nthird_class_total = df[df['Pclass'] == 3]['PassengerId'].count()\n\n# 统计不同舱位生还的乘客人数\nfirst_class_survived = df[(df['Pclass'] == 1) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nsecond_class_survived = df[(df['Pclass'] == 2) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nthird_class_survived = df[(df['Pclass'] == 3) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\n\n# 计算不同舱位的生还率，并保留两位小数\nfirst_class_percent_survived = round((first_class_survived / first_class_total) * 100, 2)\nsecond_class_percent_survived = round((second_class_survived / second_class_total) * 100, 2)\nthird_class_percent_survived = round((third_class_survived / third_class_total) * 100, 2)\n\n# 打印结果\nprint(f\"一等舱生还率为 {first_class_percent_survived}%\")\nprint(f\"二等舱生还率为 {second_class_percent_survived}%\")\nprint(f\"三等舱生还率为 {third_class_percent_survived}%\")\n```\n\n输出结果：\n\n```\n一等舱生还率为 61.99%\n二等舱生还率为 43.92%\n三等舱生还率为 22.84%\n```\n\n分析题三答案：\n\n```\n# 【分析三的结论】\n# 一等舱生还率为 61.99%\n# 二等舱生还率为 43.92%\n# 三等舱生还率为 22.84%\n# 可见客舱等级越高，生还率越高。\n```\n\n#### 1.11【分析三的可视化】使用柱状图表示不同舱位的生还率\n\n```python\nimport matplotlib.pyplot as plt\n\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计不同舱位的乘客人数\nclass_counts = df['Pclass'].value_counts(sort=False)\n\n# 统计不同舱位生还的乘客人数\nsurvived_counts = df[df['Survived'] == 1]['Pclass'].value_counts(sort=False)\n\n# 计算不同舱位的生还率，并保留两位小数\nsurvival_rates = round((survived_counts / class_counts) * 100, 2)\n\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Passenger Class')\nplt.xlabel('Passenger Class')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic1_survival_rates_by_passenger_class.png?raw=true)\n\n\n\n#### 1.12【分析四】乘客的性别与生还率关系\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 计算不同性别的生还人数\nsurvived_counts = df[df['Survived'] == 1]['Sex'].value_counts()\ntotal_counts = df['Sex'].value_counts()\n\n# 计算不同性别的生还率\nsurvival_rates = round((survived_counts / total_counts) * 100, 2)\n\nsurvival_rates\n```\n\n输出结果：\n\n```\nSex\nfemale 72.13\nmale 18.12\nName: count, dtype: float64\n```\n\n分析题四答案：\n\n```\n# 【分析四的结论】：\n# 男性的生还率为18.12%\n# 女性的生还率为72.13%\n# 女性乘客可能更容易生还。\n```\n\n#### 1.13【分析四的可视化】使用柱状图表示乘客的性别与生还率关系\n\n```python\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Gender')\nplt.xlabel('Gender')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic2_survival_rates_by_gender.png?raw=true)\n\n#### 1.14 【分析五】年龄与生还率关系\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 删除年龄缺失值的行\ndf.dropna(subset=['Age'], inplace=True)\n\n# 分割年龄为年龄段\nbins = [0, 12, 18, 65, 100]\nlabels = ['Children', 'Teenager', 'Adult', 'Elderly']\ndf['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)\n\n# 计算不同年龄段的生还人数\nsurvived_counts = df[df['Survived'] == 1]['AgeGroup'].value_counts()\ntotal_counts = df['AgeGroup'].value_counts()\n\n# 计算不同年龄段的生还率\nsurvival_rates = round((survived_counts / total_counts) * 100, 2)\n\nsurvival_rates\n```\n\n输出结果：\n\n```\nAgeGroup\nAdult 36.27\nChildren 55.56\nTeenager 47.22\nElderly 0.00\nName: count, dtype: float64\n```\n\n分析题五答案：\n\n```\n# 【分析五的结论】：根据年龄段进行分类，不同年龄段的乘客生还率如下：\n# 儿童（0-12岁）的生还率为55.56%\n# 少年（12-18岁）的生还率为47.22%\n# 成人（18-65岁）的生还率为36.27%\n# 老年（65-100岁）的生还率为0.00%\n```\n\n#### 1.15 【分析五的可视化】使用柱状图表示年龄与生还率关系\n\n```python\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Age Group')\nplt.xlabel('Age Group')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic3_survival_rates_by_age_group.png?raw=true)\n\n#### 1.16【分析六】不同登船港口的乘客生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 对 Embarked 的缺失值进行处理\ndf['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)\n\n# 统计不同港口上船的乘客人数以及生还人数\nembarked_count = df.groupby('Embarked')['PassengerId'].count()\nsurvived_count = df.groupby('Embarked')['Survived'].sum()\n\n# 计算不同港口上船的乘客生还率\nsurvival_rate = survived_count / embarked_count\n\nsurvival_rate\n```\n\n输出结果：\n\n```\nEmbarked\nC 0.561538\nQ 0.333333\nS 0.321293\ndtype: float64\n```\n\n#### 1.17 【分析六的可视化】使用柱状图表示不同登船港口的乘客生存情况\n\n```python\n# 可视化结果\nplt.bar(['C', 'Q', 'S'], survival_rate, color=['#2a9df4', '#f44336', '#ffc107'])\nplt.xlabel('Embarked')\nplt.ylabel('Survival rate')\nplt.title('Survival Rate of Different Embarked Ports')\nplt.ylim(0.0, 1.0)\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic4_survival_rate_of_different_embarked_ports.png?raw=true)\n\n#### 1.18【分析七】登船港口为C的男性和女性的生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 筛选登船港口为 C 的数据\nembarked_c_df = df[df['Embarked'] == 'C']\n\n# 统计登船港口为 C 的男性和女性生存情况\nmale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'male') \u0026 (embarked_c_df['Survived'] == 1)]\nfemale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'female') \u0026 (embarked_c_df['Survived'] == 1)]\n\n# 输出结果\nprint(\"登船港口为 C 的男性生存人数:\", len(male_survived))\nprint(\"登船港口为 C 的女性生存人数:\", len(female_survived))\n```\n\n输出结果：\n\n```\n登船港口为 C 的男性生存人数: 22\n登船港口为 C 的女性生存人数: 51\n```\n\n#### 1.19 【分析七的可视化】使用柱状图表示登船港口为C的男性和女性的生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 筛选登船港口为 C 的数据\nembarked_c_df = df[df['Embarked'] == 'C']\n\n# 统计登船港口为 C 的男性和女性生存情况\nmale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'male') \u0026 (embarked_c_df['Survived'] == 1)]\nfemale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'female') \u0026 (embarked_c_df['Survived'] == 1)]\n\n# 可视化结果\nlabels = ['Male', 'Female']\nsurvived_counts = [len(male_survived), len(female_survived)]\n\nplt.bar(labels, survived_counts, color=['#2196f3', '#f44336'])\nplt.xlabel('Gender')\nplt.ylabel('Survived Count')\nplt.title('Survival Count of Male and Female Passengers Embarked at C')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic5_survival_count_of_male_and_female_passengers_embarked_at_C.png?raw=true)\n\n### 2.Bank Customer\n\n| id | age | job | marital | education | default | housing | loan | contact | month | ... | campaign | pdays | previous | poutcome | emp_var_rate | cons_price_index | cons_conf_index | lending_rate3m | nr_employed | subscribe |\n|-----:|------:|:-------------|:----------|:-------------------|:----------|:----------|:--------|:-----------|:--------|-------:|-----------:|--------:|-----------:|:-----------|---------------:|-------------------:|------------------:|------------------:|---------------:|:------------|\n| 1 | 51 | admin. | divorced | professional.course | no | yes | yes | cellular | aug | ... | 1 | 112 | 2 | failure | 1.4 | 90.81 | -35.53 | 0.69 | 5219.74 | no |\n| 2 | 50 | services | married | high.school | unknown | yes | no | cellular | may | ... | 1 | 412 | 2 | nonexistent| -1.8 | 96.33 | -40.58 | 4.05 | 4974.79 | yes |\n| 3 | 48 | blue-collar | divorced | basic.9y | no | no | no | cellular | apr | ... | 0 | 1027 | 1 | failure | -1.8 | 96.33 | -44.74 | 1.5 | 5022.61 | no |\n| 4 | 26 | entrepreneur | single | high.school | yes | yes | yes | cellular | aug | ... | 26 | 998 | 0 | nonexistent| 1.4 | 97.08 | -35.55 | 5.11 | 5222.87 | yes |\n| 5 | 45 | admin. | single | university.degree | no | no | no | cellular | nov | ... | 1 | 240 | 4 | success | -3.4 | 89.82 | -33.83 | 1.17 | 4884.7 | no |\n\n字段说明：\n\n- id：每个客户的唯一标识符。这可以是一个客户编号或其他唯一代码，用于区分不同的客户。\n- age：客户的年龄，以年为单位。\n- job：客户的工作或职业。这是判断客户收入水平、经济稳定性和风险状况的重要指标。\n- marital：客户的婚姻状况，包括已婚、单身、离异或丧偶等。婚姻状况可能与客户的财务决策和风险状况相关。\n- education：客户的教育水平。教育水平通常与收入水平和风险状况相关联。\n- default：客户是否有过违约记录。如果有违约，则可能标记为“是”，否则为“否”。\n- housing：指示客户是拥有住房还是租房。这与客户的财务状况有一定关联。\n- loan：表示客户是否有未偿还的贷款。这可以帮助银行了解客户的负债情况。\n- contact：与客户的联系方式，如手机、电话、电子邮件等。\n- month：数据收集或相关活动发生的月份。\n- campaign：客户接收到的营销活动的数量。\n- pdays：自上一次营销活动以来与客户最后一次联系的天数。\n- previous：在过去一个月内与客户的联系次数。\n- poutcome：上一次联系的结果，如“成功”、“失败”或“未发生”。\n\n#### 2.1 查看训练数据的前5行和后5行\n\n```python\n# 查看前五行数据\ntrain.head()\n\n# 查看后五行数据\ntrain.tail()\n```\n\n#### 2.2 输出各字段缺失值数量\n\n```python\n# 检测缺失值\nmissing_values = df.isnull().sum()\n\n# 输出各字段的缺失值数量，其中Age、Cabin、Embarked存在缺失值\nmissing_values\n```\n\n#### 2.3 检测重复值\n\n```python\n# 检测重复值\nduplicate_rows = df.duplicated()\nduplicate_rows_count = duplicate_rows.sum()\nprint(\"重复行数:\", duplicate_rows_count)\n```\n\n#### 2.4 基本统计分析\n\n```python\n# 基本统计分析(包含数量、均值、方差、最小值、最大值等)\nstatistics = df.describe()\nprint(statistics)\n```\n| | id | age | duration | campaign | pdays | previous | emp_var_rate | cons_price_index | cons_conf_index | lending_rate3m | nr_employed |\n|-------|-----------|----------|----------|----------|-------|----------|--------------|------------------|-----------------|----------------|-------------|\n| count | 22500.000 | 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000|\n| mean | 11250.500 | 40.408 | 1146.304 | 3.365 | 773.992| 1.316 | 0.079 | 93.549 | -39.877 | 3.302 | 5137.211|\n| std | 6495.335 | 12.086 | 1432.432 | 7.224 | 326.934| 1.919 | 1.574 | 2.806 | 5.805 | 1.612 | 170.671 |\n| min | 1.000 | 16.000 | 0.000 | 0.000 | 0.000 | 0.000 | -3.400 | 87.640 | -53.280 | 0.600 | 4715.420|\n| 25% | 5625.750 | 32.000 | 143.000 | 1.000 | 557.750| 0.000 | -1.800 | 91.190 | -44.160 | 1.430 | 5008.510|\n| 50% | 11250.500 | 38.000 | 353.000 | 1.000 | 964.000| 0.000 | 1.100 | 93.540 | -40.600 | 3.920 | 5133.955|\n| 75% | 16875.250 | 47.000 | 1873.000 | 3.000 | 1005.000| 2.000 | 1.400 | 95.920 | -35.798 | 4.830 | 5267.678|\n| max | 22500.000 | 101.000 | 5149.000 | 57.000 | 1048.000| 6.000 | 1.400 | 99.460 | -25.550 | 5.270 | 5489.500|\n\n#### 2.5 柱状图：按教育程度和婚姻状况进行分组\n\n```python\nimport matplotlib.pyplot as plt\n\n# 按教育程度和婚姻状况进行分组，并计算每个组的数量\ngrouped_data = df.groupby(['education', 'marital']).size().unstack()\n\n# 绘制柱状图\ngrouped_data.plot(kind='bar', stacked=True)\n\n# 设置图形属性\nplt.xlabel('Education')\nplt.ylabel('Count')\nplt.title('Marital Status by Education')\nplt.xticks(rotation=45)\n\n# 显示图形\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic1_martial_status_by_education.png?raw=true)\n\n分组数据（`grouped_data`）：\n\n| education | marital.divorced | marital.married | marital.single | marital.unknown |\n| ------------------- | ---------------- | --------------- | -------------- | --------------- |\n| basic.4y | 304 | 1724 | 255 | 39 |\n| basic.6y | 145 | 971 | 196 | 37 |\n| basic.9y | 337 | 2185 | 706 | 38 |\n| high.school | 641 | 2641 | 1705 | 44 |\n| illiterate | 29 | 42 | 45 | 45 |\n| professional.course | 375 | 1669 | 772 | 37 |\n| university.degree | 714 | 3379 | 2385 | 46 |\n| unknown | 113 | 567 | 280 | 34 |\n\n#### 2.6 【分析一】统计高中学历婚姻状况的比例\n\n```python\n# 筛选出高中学历的数据\nhigh_school_data = df[df['education'] == 'high.school']\n\n# 统计高中学历下不同婚姻状况的数量\ngrouped_data = high_school_data.groupby('marital').size().reset_index(name='count')\n\n# 计算比例\ntotal_count = grouped_data['count'].sum()\ngrouped_data['ratio'] = grouped_data['count'] / total_count\n\n# 输出统计结果\nprint(grouped_data[['marital', 'ratio']])\n```\n\n输出结果：\n\n```\n marital ratio\n0 divorced 0.127410\n1 married 0.524945\n2 single 0.338899\n3 unknown 0.008746\n```\n\n#### 2.7 【分析一的可视化】使用饼图表示高中学历婚姻状况的比例\n```python\nplt.figure(figsize=(6, 6))\nlabels = grouped_data['marital']\nsizes = grouped_data['count']\nplt.pie(sizes, labels=labels, autopct='%1.1f%%')\nplt.title('Marital Status of High School Graduates')\nplt.show()\n\n# 【分析一的结论】高中学历中结婚的比例达到了52.5%\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic2_marital_status_of_high_school_graduates.png?raw=true)\n\n#### 【分析二】统计每个职业的分布情况\n\n```python\n# 统计每个职业的人数\njob_counts = df['job'].value_counts()\n\n# 根据人数进行排序\njob_counts = job_counts.sort_values()\n\njob_counts\n```\n\n输出结果：\n\n```\njob\nunknown 274\nstudent 573\nunemployed 647\nhousemaid 657\nself-employed 836\nentrepreneur 863\nretired 1006\nmanagement 1600\nservices 2083\ntechnician 3530\nblue-collar 4874\nadmin. 5557\nName: count, dtype: int64\n```\n\n人数最多的职业占比：\n\n```python\nprint(f'{round(max(job_counts) / df[\"job\"].count() * 100,2)}%')\n# 24.7%\n```\n\n【分析二的结论】该份数据中职业为管理人员(admin.)的人数最多，达到了5557，占比 24.7%\n\n#### 【分析二的可视化】使用统计图表示各个职业的人数分布情况\n\n```python\nplt.figure(figsize=(10, 6))\njob_counts.plot(kind='barh')\nplt.title('Number of People by Job')\nplt.xlabel('Count')\nplt.ylabel('Job')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic3_number_of_people_by_job.png?raw=true)\n\n#### 【分析三】统计20-30岁之间用户订阅该产品的比例分布\n\n```python\n# 筛选年龄在 20-30 岁之间的数据\nage_filter = (df['age'] \u003e= 20) \u0026 (df['age'] \u003c= 30)\nsubset_df = df[age_filter]\n\n# 计算各个年龄的订阅比例\nage_counts = subset_df['age'].value_counts()\nage_proportions = (age_counts / age_counts.sum()) * 100\n\nage_proportions\n```\n\n输出结果：\n\n```\nage\n30 21.876399\n29 18.696820\n28 13.949843\n27 11.867443\n26 9.740260\n25 7.120466\n24 6.829378\n23 4.343932\n22 2.642185\n21 1.724138\n20 1.209136\nName: count, dtype: float64\n```\n\n#### 【分析三的可视化】使用饼图绘制20-30岁之间用户订阅该产品的比例分布\n\n```python\nplt.figure(figsize=(8, 6))\nplt.pie(age_proportions, labels=age_proportions.index, autopct='%1.1f%%')\nplt.title('Proportion of Subscribers Aged 20-30')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic4_proportion_of_subscribers_aged_20_30.png?raw=true)\n\n#### 【分析三的可视化】使用柱状图绘制20-30岁之间用户订阅该产品的比例分布\n\n```python\nplt.figure(figsize=(8, 6))\nplt.bar(age_counts.index, age_counts.values)\nplt.xlabel('Age')\nplt.ylabel('Number of Subscribers')\nplt.title('Distribution of Subscribers Aged 20-30')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic5_distribution_of_subscribers_aged_20_30.png?raw=true)\n\n#### 【分析四】统计拥有房屋贷款、个人贷款、房屋贷款\u0026个人贷款的人数，并计算其占比总人数\n\n```python\n# 计算同时拥有房屋贷款和个人贷款的人数\nhousing_count = len(df[(df['housing'] == 'yes')])\nloan_count = len(df[(df['loan'] == 'yes')])\nboth_loans_count = len(df[(df['housing'] == 'yes') \u0026 (df['loan'] == 'yes')])\n\n# 计算占比总人数\ntotal_count = len(df)\nhousing_loans_ratio = round(housing_count / total_count * 100, 2) \nloans_ratio = round(loan_count / total_count * 100, 2) \nboth_loans_ratio = round(both_loans_count / total_count * 100, 2) \n\n# 输出结果\nprint(f\"拥有房屋贷款的人数: {housing_count}\")\nprint(f\"拥有个人贷款的人数: {loan_count}\")\nprint(f\"同时拥有房屋贷款和个人贷款的人数: {both_loans_count}\")\nprint(f\"总人数: {total_count}\")\nprint(f\"拥有房屋贷款的人数占比总人数: {housing_loans_ratio}%\")\nprint(f\"拥有个人贷款的人数占比总人数: {loans_ratio}%\")\nprint(f\"同时拥有房屋贷款和个人贷款的人数占比总人数: {both_loans_ratio}%\")\n```\n\n输出结果：\n\n```\n拥有房屋贷款的人数: 11568\n拥有个人贷款的人数: 3657\n同时拥有房屋贷款和个人贷款的人数: 2055\n总人数: 22500\n拥有房屋贷款的人数占比总人数: 51.41%\n拥有个人贷款的人数占比总人数: 16.25%\n同时拥有房屋贷款和个人贷款的人数占比总人数: 9.13%\n```\n\n#### 【分析四的可视化】使用柱状图统计拥有房屋贷款、个人贷款、房屋贷款\u0026个人贷款的人数，并计算其占比总人数\n\n```python\n# 创建柱状图数据\nlabels = ['Housing', 'Loan', 'Both']\ncounts = [housing_count, loan_count, both_loans_count]\n\n# 设置柱状图参数\nx = range(len(labels))\nwidth = 0.5\n\n# 绘制柱状图\nplt.bar(x, counts, width, align='center')\nplt.xticks(x, labels)\nplt.xlabel('Loan Type')\nplt.ylabel('Count')\nplt.title('Count of Individuals with Housing Loan and Personal Loan')\n\n# 添加数据标签\nfor i, count in enumerate(counts):\n plt.text(x[i], count, str(count), ha='center', va='bottom')\n\n# 显示图形\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic6_count_of_individuals_with_housing_loan_and_personal_loan.png?raw=true)\n\n【分析四的结论】\n\n```\n# 拥有房屋贷款的人数: 11568\n# 拥有个人贷款的人数: 3657\n# 同时拥有房屋贷款和个人贷款的人数: 2055\n# 总人数: 22500\n# 拥有房屋贷款的人数占比总人数: 51.41%\n# 拥有个人贷款的人数占比总人数: 16.25%\n# 同时拥有房屋贷款和个人贷款的人数占比总人数: 9.13%\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjm199504%2Fdata-analysis-practice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjm199504%2Fdata-analysis-practice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjm199504%2Fdata-analysis-practice/lists"}