{"id":15995758,"url":"https://github.com/jm199504/data-analysis-practice","last_synced_at":"2026-05-02T04:37:30.802Z","repository":{"id":247098320,"uuid":"825008957","full_name":"jm199504/Data-Analysis-Practice","owner":"jm199504","description":"数据分析练习（Titanic / BankCustomers）","archived":false,"fork":false,"pushed_at":"2024-07-07T02:32:22.000Z","size":1272,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-10T09:12:19.993Z","etag":null,"topics":["data-analysis","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jm199504.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-06T14:05:27.000Z","updated_at":"2024-07-07T02:32:25.000Z","dependencies_parsed_at":"2024-10-08T07:21:01.108Z","dependency_job_id":"0e75c545-7d20-40b8-a204-a01a911a201a","html_url":"https://github.com/jm199504/Data-Analysis-Practice","commit_stats":null,"previous_names":["jm199504/data-analysis-practice"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jm199504%2FData-Analysis-Practice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jm199504","download_url":"https://codeload.github.com/jm199504/Data-Analysis-Practice/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247271489,"owners_count":20911586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","python"],"created_at":"2024-10-08T07:20:52.104Z","updated_at":"2026-05-02T04:37:30.778Z","avatar_url":"https://github.com/jm199504.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 数据分析练习\n\n![author](https://img.shields.io/static/v1?label=Author\u0026message=junmingguo\u0026color=green)\n![language](https://img.shields.io/static/v1?label=Language\u0026message=python3\u0026color=orange) ![topics](https://img.shields.io/static/v1?label=Topics\u0026message=data-analysis\u0026color=blue)\n\n\n\n### 1. Titanic\n\n#### 1.1 抽取80%的数据作为训练数据\n\n- 读取全量数据集\n- 使用`sample`方法进行抽取训练集，设置随机状态参数\n- 使用`to_csv`保存训练集到新的csv文件\n\n```python\nimport pandas as pd\n\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('Titanic.csv')\n\n# 随机抽取80%的数据\ntrain = df.sample(frac=0.8, random_state=123)\n\n# 将抽取的数据保存到 train.csv 文件中\ntrain.to_csv('train.csv', index=False)\n```\n\n#### 1.2 查看训练数据的前5行和后5行\n\n```python\n# 查看前五行数据\ntrain.head()\n\n# 查看后五行数据\ntrain.tail()\n```\n\n#### 1.3 输出各字段缺失值数量\n\n```python\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 检测缺失值\nmissing_values = df.isnull().sum()\n\n# 输出各字段的缺失值数量，其中Age、Cabin、Embarked存在缺失值\nprint(missing_values)\n```\n\n输出结果：\n\n```\nPassengerId      0\nSurvived         0\nPclass           0\nName             0\nSex              0\nAge            139\nSibSp            0\nParch            0\nTicket           0\nFare             0\nCabin          555\nEmbarked         1\n```\n\n#### 1.4 对缺失值进行填充\n\n- 使用`fillna`方法填充缺失值，第一个参数即为缺失值的默认值，通常可以考虑均值/指定值/众数等等\n- 其中`df['Embarked'].mode()[0]` 指的是 `Embarked` 列中的众数（即出现频率最高的值）\n\n```python\n# 对上述存在缺失值的字段进行填补\ndf['Age'].fillna(df['Age'].mean(), inplace=True)\ndf['Cabin'].fillna('Unknown', inplace=True)\ndf['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)\n```\n\n#### 1.5 检测重复值\n\n```python\n# 检测重复值\nduplicate_rows = df.duplicated()\nduplicate_rows_count = duplicate_rows.sum()\nprint(\"重复行数:\", duplicate_rows_count)\n```\n\n#### 1.6 数据降重\n\n```python\ndf.drop_duplicates(inplace=True)\n```\n\n#### 1.7 基本统计分析(包含数量、均值、方差、最小值、最大值等)\n```python\nstatistics = df.describe()\nprint(statistics)\n```\n\n|      PassengerId |    Survived |     Pclass |        Age |    SibSp |      Parch |       Fare |\n|-----------------:|------------:|-----------:|-----------:|---------:|-----------:|-----------:|\n|       713.000000 |  713.000000 | 713.000000 |  713.000000 | 0.507714 |  0.360449 |  31.026296 |\n|       451.237027 |    0.366059 |   2.312763 |   29.422613 | 1.086309 |  0.781065 |  47.260244 |\n|       257.904310 |    0.482064 |   0.834015 |   12.728972 | 0.000000 |  0.000000 |   0.000000 |\n|         1.000000 |    0.000000 |   1.000000 |    0.420000 | 0.000000 |  0.000000 |   0.000000 |\n|       228.000000 |    0.000000 |   2.000000 |   22.000000 | 0.000000 |  0.000000 |   7.895800 |\n|       455.000000 |    0.000000 |   3.000000 |   29.422613 | 0.000000 |  0.000000 |  13.862500 |\n|       677.000000 |    1.000000 |   3.000000 |   35.000000 | 1.000000 |  0.000000 |  30.500000 |\n|       891.000000 |    1.000000 |   3.000000 |   74.000000 | 8.000000 |  6.000000 | 512.329200 |\n\n#### 1.8 【分析一】海难发生前，一等舱有 XX 人，二等舱 XX 人，三等舱 XX 人，分别占总人数的 XX%，XX%，XX%\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计海难发生前不同舱位的乘客人数\nfirst_class_count = df[df['Pclass'] == 1]['PassengerId'].count()\nsecond_class_count = df[df['Pclass'] == 2]['PassengerId'].count()\nthird_class_count = df[df['Pclass'] == 3]['PassengerId'].count()\n\n# 计算不同舱位的乘客人数占总人数的比例，并保留2位小数\ntotal_passengers = df['PassengerId'].count()\nfirst_class_percent = round((first_class_count / total_passengers) * 100, 2)\nsecond_class_percent = round((second_class_count / total_passengers) * 100, 2)\nthird_class_percent = round((third_class_count / total_passengers) * 100, 2)\n\n# 打印结果\nprint(f\"一等舱人数：{first_class_count}\")\nprint(f\"二等舱人数：{second_class_count}\")\nprint(f\"三等舱人数：{third_class_count}\")\nprint(f\"一等舱乘客占比：{first_class_percent}%\")\nprint(f\"二等舱乘客占比：{second_class_percent}%\")\nprint(f\"三等舱乘客占比：{third_class_percent}%\")\n```\n\n输出结果：\n\n```\n一等舱人数：171\n二等舱人数：148\n三等舱人数：394\n一等舱乘客占比：23.98%\n二等舱乘客占比：20.76%\n三等舱乘客占比：55.26%\n```\n\n分析题一答案：\n\n```\n# 【分析一的结论】\n# 一等舱人数：171\n# 二等舱人数：148\n# 三等舱人数：394\n# 一等舱乘客占比：23.98%\n# 二等舱乘客占比：20.76%\n# 三等舱乘客占比：55.26%\n```\n\n#### 1.9 【分析二】海难发生后，一等舱、二等舱、三等舱的乘客人数剩余 XX、XX、XX 人，分别占总人数的 XX%，XX%，XX%\n\n```python\n# 读取 Titanic.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计海难发生后不同舱位的乘客人数\nfirst_class_survived = df[(df['Pclass'] == 1) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nsecond_class_survived = df[(df['Pclass'] == 2) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nthird_class_survived = df[(df['Pclass'] == 3) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\n\n# 计算不同舱位的乘客人数占总人数的比例，并保留2位小数\ntotal_passengers_survived = df[df['Survived'] == 1]['PassengerId'].count()\nfirst_class_percent_survived = round((first_class_survived / total_passengers_survived) * 100, 2)\nsecond_class_percent_survived = round((second_class_survived / total_passengers_survived) * 100, 2)\nthird_class_percent_survived = round((third_class_survived / total_passengers_survived) * 100, 2)\n\n# 打印结果\nprint(f\"海难发生后，一等舱乘客剩余人数： {first_class_survived}\")\nprint(f\"海难发生后，二等舱乘客剩余人数： {second_class_survived}\")\nprint(f\"海难发生后，三等舱乘客剩余人数： {third_class_survived}\")\nprint(f\"海难发生后，一等舱乘客占比： {first_class_percent_survived}%\")\nprint(f\"海难发生后，二等舱乘客占比： {second_class_percent_survived}%\")\nprint(f\"海难发生后，三等舱乘客占比： {third_class_percent_survived}%\")\n```\n\n输出结果：\n\n```\n海难发生后，一等舱乘客剩余人数： 106\n海难发生后，二等舱乘客剩余人数： 65\n海难发生后，三等舱乘客剩余人数： 90\n海难发生后，一等舱乘客占比： 40.61%\n海难发生后，二等舱乘客占比： 24.9%\n海难发生后，三等舱乘客占比： 34.48%\n```\n\n分析题二答案：\n\n```\n# 【分析二的结论】\n# 海难发生后，一等舱乘客剩余人数： 106\n# 海难发生后，二等舱乘客剩余人数： 65\n# 海难发生后，三等舱乘客剩余人数： 90\n# 海难发生后，一等舱乘客占比： 40.61%\n# 海难发生后，二等舱乘客占比： 24.9%\n# 海难发生后，三等舱乘客占比： 34.48%\n```\n\n####  1.10 【分析三】一等舱生还率为 XX%，二等舱为 XX%，三等舱为 XX%。\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计不同舱位的乘客人数\nfirst_class_total = df[df['Pclass'] == 1]['PassengerId'].count()\nsecond_class_total = df[df['Pclass'] == 2]['PassengerId'].count()\nthird_class_total = df[df['Pclass'] == 3]['PassengerId'].count()\n\n# 统计不同舱位生还的乘客人数\nfirst_class_survived = df[(df['Pclass'] == 1) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nsecond_class_survived = df[(df['Pclass'] == 2) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\nthird_class_survived = df[(df['Pclass'] == 3) \u0026 (df['Survived'] == 1)]['PassengerId'].count()\n\n# 计算不同舱位的生还率，并保留两位小数\nfirst_class_percent_survived = round((first_class_survived / first_class_total) * 100, 2)\nsecond_class_percent_survived = round((second_class_survived / second_class_total) * 100, 2)\nthird_class_percent_survived = round((third_class_survived / third_class_total) * 100, 2)\n\n# 打印结果\nprint(f\"一等舱生还率为 {first_class_percent_survived}%\")\nprint(f\"二等舱生还率为 {second_class_percent_survived}%\")\nprint(f\"三等舱生还率为 {third_class_percent_survived}%\")\n```\n\n输出结果：\n\n```\n一等舱生还率为 61.99%\n二等舱生还率为 43.92%\n三等舱生还率为 22.84%\n```\n\n分析题三答案：\n\n```\n# 【分析三的结论】\n# 一等舱生还率为 61.99%\n# 二等舱生还率为 43.92%\n# 三等舱生还率为 22.84%\n# 可见客舱等级越高，生还率越高。\n```\n\n####  1.11【分析三的可视化】使用柱状图表示不同舱位的生还率\n\n```python\nimport matplotlib.pyplot as plt\n\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 统计不同舱位的乘客人数\nclass_counts = df['Pclass'].value_counts(sort=False)\n\n# 统计不同舱位生还的乘客人数\nsurvived_counts = df[df['Survived'] == 1]['Pclass'].value_counts(sort=False)\n\n# 计算不同舱位的生还率，并保留两位小数\nsurvival_rates = round((survived_counts / class_counts) * 100, 2)\n\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Passenger Class')\nplt.xlabel('Passenger Class')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic1_survival_rates_by_passenger_class.png?raw=true)\n\n\n\n####  1.12【分析四】乘客的性别与生还率关系\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 计算不同性别的生还人数\nsurvived_counts = df[df['Survived'] == 1]['Sex'].value_counts()\ntotal_counts = df['Sex'].value_counts()\n\n# 计算不同性别的生还率\nsurvival_rates = round((survived_counts / total_counts) * 100, 2)\n\nsurvival_rates\n```\n\n输出结果：\n\n```\nSex\nfemale    72.13\nmale      18.12\nName: count, dtype: float64\n```\n\n分析题四答案：\n\n```\n# 【分析四的结论】：\n# 男性的生还率为18.12%\n# 女性的生还率为72.13%\n# 女性乘客可能更容易生还。\n```\n\n####  1.13【分析四的可视化】使用柱状图表示乘客的性别与生还率关系\n\n```python\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Gender')\nplt.xlabel('Gender')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic2_survival_rates_by_gender.png?raw=true)\n\n####  1.14 【分析五】年龄与生还率关系\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 删除年龄缺失值的行\ndf.dropna(subset=['Age'], inplace=True)\n\n# 分割年龄为年龄段\nbins = [0, 12, 18, 65, 100]\nlabels = ['Children', 'Teenager', 'Adult', 'Elderly']\ndf['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)\n\n# 计算不同年龄段的生还人数\nsurvived_counts = df[df['Survived'] == 1]['AgeGroup'].value_counts()\ntotal_counts = df['AgeGroup'].value_counts()\n\n# 计算不同年龄段的生还率\nsurvival_rates = round((survived_counts / total_counts) * 100, 2)\n\nsurvival_rates\n```\n\n输出结果：\n\n```\nAgeGroup\nAdult       36.27\nChildren    55.56\nTeenager    47.22\nElderly      0.00\nName: count, dtype: float64\n```\n\n分析题五答案：\n\n```\n# 【分析五的结论】：根据年龄段进行分类，不同年龄段的乘客生还率如下：\n# 儿童（0-12岁）的生还率为55.56%\n# 少年（12-18岁）的生还率为47.22%\n# 成人（18-65岁）的生还率为36.27%\n# 老年（65-100岁）的生还率为0.00%\n```\n\n#### 1.15 【分析五的可视化】使用柱状图表示年龄与生还率关系\n\n```python\n# 创建柱状图\nplt.bar(survival_rates.index, survival_rates.values)\n\n# 设置图表标题和标签\nplt.title('Survival Rates by Age Group')\nplt.xlabel('Age Group')\nplt.ylabel('Survival Rate (%)')\n\n# 显示图表\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic3_survival_rates_by_age_group.png?raw=true)\n\n#### 1.16【分析六】不同登船港口的乘客生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 对 Embarked 的缺失值进行处理\ndf['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)\n\n# 统计不同港口上船的乘客人数以及生还人数\nembarked_count = df.groupby('Embarked')['PassengerId'].count()\nsurvived_count = df.groupby('Embarked')['Survived'].sum()\n\n# 计算不同港口上船的乘客生还率\nsurvival_rate = survived_count / embarked_count\n\nsurvival_rate\n```\n\n输出结果：\n\n```\nEmbarked\nC    0.561538\nQ    0.333333\nS    0.321293\ndtype: float64\n```\n\n#### 1.17 【分析六的可视化】使用柱状图表示不同登船港口的乘客生存情况\n\n```python\n# 可视化结果\nplt.bar(['C', 'Q', 'S'], survival_rate, color=['#2a9df4', '#f44336', '#ffc107'])\nplt.xlabel('Embarked')\nplt.ylabel('Survival rate')\nplt.title('Survival Rate of Different Embarked Ports')\nplt.ylim(0.0, 1.0)\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic4_survival_rate_of_different_embarked_ports.png?raw=true)\n\n#### 1.18【分析七】登船港口为C的男性和女性的生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 筛选登船港口为 C 的数据\nembarked_c_df = df[df['Embarked'] == 'C']\n\n# 统计登船港口为 C 的男性和女性生存情况\nmale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'male') \u0026 (embarked_c_df['Survived'] == 1)]\nfemale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'female') \u0026 (embarked_c_df['Survived'] == 1)]\n\n# 输出结果\nprint(\"登船港口为 C 的男性生存人数:\", len(male_survived))\nprint(\"登船港口为 C 的女性生存人数:\", len(female_survived))\n```\n\n输出结果：\n\n```\n登船港口为 C 的男性生存人数: 22\n登船港口为 C 的女性生存人数: 51\n```\n\n#### 1.19 【分析七的可视化】使用柱状图表示登船港口为C的男性和女性的生存情况\n\n```python\n# 读取 train.csv 文件\ndf = pd.read_csv('train.csv')\n\n# 筛选登船港口为 C 的数据\nembarked_c_df = df[df['Embarked'] == 'C']\n\n# 统计登船港口为 C 的男性和女性生存情况\nmale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'male') \u0026 (embarked_c_df['Survived'] == 1)]\nfemale_survived = embarked_c_df[(embarked_c_df['Sex'] == 'female') \u0026 (embarked_c_df['Survived'] == 1)]\n\n# 可视化结果\nlabels = ['Male', 'Female']\nsurvived_counts = [len(male_survived), len(female_survived)]\n\nplt.bar(labels, survived_counts, color=['#2196f3', '#f44336'])\nplt.xlabel('Gender')\nplt.ylabel('Survived Count')\nplt.title('Survival Count of Male and Female Passengers Embarked at C')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/titanic/pic5_survival_count_of_male_and_female_passengers_embarked_at_C.png?raw=true)\n\n### 2.Bank Customer\n\n|   id |   age | job          | marital   | education          | default   | housing   | loan   | contact   | month   |   ... |   campaign |   pdays |   previous | poutcome   |   emp_var_rate |   cons_price_index |   cons_conf_index |   lending_rate3m |   nr_employed | subscribe   |\n|-----:|------:|:-------------|:----------|:-------------------|:----------|:----------|:--------|:-----------|:--------|-------:|-----------:|--------:|-----------:|:-----------|---------------:|-------------------:|------------------:|------------------:|---------------:|:------------|\n|    1 |    51 | admin.       | divorced  | professional.course | no        | yes       | yes     | cellular   | aug     |   ... |          1 |     112 |          2 | failure    |            1.4 |              90.81 |            -35.53 |              0.69 |        5219.74 | no          |\n|    2 |    50 | services     | married   | high.school        | unknown   | yes       | no      | cellular   | may     |   ... |          1 |     412 |          2 | nonexistent|           -1.8 |              96.33 |            -40.58 |              4.05 |        4974.79 | yes         |\n|    3 |    48 | blue-collar  | divorced  | basic.9y           | no        | no        | no      | cellular   | apr     |   ... |          0 |    1027 |          1 | failure    |           -1.8 |              96.33 |            -44.74 |              1.5  |        5022.61 | no          |\n|    4 |    26 | entrepreneur | single    | high.school        | yes       | yes       | yes     | cellular   | aug     |   ... |         26 |     998 |          0 | nonexistent|            1.4 |              97.08 |            -35.55 |              5.11 |        5222.87 | yes         |\n|    5 |    45 | admin.       | single    | university.degree  | no        | no        | no      | cellular   | nov     |   ... |          1 |     240 |          4 | success    |           -3.4 |              89.82 |            -33.83 |              1.17 |        4884.7  | no          |\n\n字段说明：\n\n- id：每个客户的唯一标识符。这可以是一个客户编号或其他唯一代码，用于区分不同的客户。\n- age：客户的年龄，以年为单位。\n- job：客户的工作或职业。这是判断客户收入水平、经济稳定性和风险状况的重要指标。\n- marital：客户的婚姻状况，包括已婚、单身、离异或丧偶等。婚姻状况可能与客户的财务决策和风险状况相关。\n- education：客户的教育水平。教育水平通常与收入水平和风险状况相关联。\n- default：客户是否有过违约记录。如果有违约，则可能标记为“是”，否则为“否”。\n- housing：指示客户是拥有住房还是租房。这与客户的财务状况有一定关联。\n- loan：表示客户是否有未偿还的贷款。这可以帮助银行了解客户的负债情况。\n- contact：与客户的联系方式，如手机、电话、电子邮件等。\n- month：数据收集或相关活动发生的月份。\n- campaign：客户接收到的营销活动的数量。\n- pdays：自上一次营销活动以来与客户最后一次联系的天数。\n- previous：在过去一个月内与客户的联系次数。\n- poutcome：上一次联系的结果，如“成功”、“失败”或“未发生”。\n\n#### 2.1 查看训练数据的前5行和后5行\n\n```python\n# 查看前五行数据\ntrain.head()\n\n# 查看后五行数据\ntrain.tail()\n```\n\n#### 2.2 输出各字段缺失值数量\n\n```python\n# 检测缺失值\nmissing_values = df.isnull().sum()\n\n# 输出各字段的缺失值数量，其中Age、Cabin、Embarked存在缺失值\nmissing_values\n```\n\n#### 2.3 检测重复值\n\n```python\n# 检测重复值\nduplicate_rows = df.duplicated()\nduplicate_rows_count = duplicate_rows.sum()\nprint(\"重复行数:\", duplicate_rows_count)\n```\n\n#### 2.4 基本统计分析\n\n```python\n# 基本统计分析(包含数量、均值、方差、最小值、最大值等)\nstatistics = df.describe()\nprint(statistics)\n```\n|       | id        | age      | duration | campaign | pdays | previous | emp_var_rate | cons_price_index | cons_conf_index | lending_rate3m | nr_employed |\n|-------|-----------|----------|----------|----------|-------|----------|--------------|------------------|-----------------|----------------|-------------|\n| count | 22500.000 | 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000| 22500.000|\n| mean  | 11250.500 | 40.408   | 1146.304 | 3.365    | 773.992| 1.316    | 0.079    | 93.549    | -39.877   | 3.302  | 5137.211|\n| std   | 6495.335  | 12.086   | 1432.432 | 7.224    | 326.934| 1.919    | 1.574    | 2.806     | 5.805     | 1.612  | 170.671 |\n| min   | 1.000     | 16.000   | 0.000    | 0.000    | 0.000  | 0.000    | -3.400   | 87.640    | -53.280   | 0.600  | 4715.420|\n| 25%   | 5625.750  | 32.000   | 143.000  | 1.000    | 557.750| 0.000    | -1.800   | 91.190    | -44.160   | 1.430  | 5008.510|\n| 50%   | 11250.500 | 38.000   | 353.000  | 1.000    | 964.000| 0.000    | 1.100    | 93.540    | -40.600   | 3.920  | 5133.955|\n| 75%   | 16875.250 | 47.000   | 1873.000 | 3.000    | 1005.000| 2.000   | 1.400    | 95.920    | -35.798   | 4.830  | 5267.678|\n| max   | 22500.000 | 101.000  | 5149.000 | 57.000   | 1048.000| 6.000   | 1.400    | 99.460    | -25.550   | 5.270  | 5489.500|\n\n#### 2.5 柱状图：按教育程度和婚姻状况进行分组\n\n```python\nimport matplotlib.pyplot as plt\n\n# 按教育程度和婚姻状况进行分组，并计算每个组的数量\ngrouped_data = df.groupby(['education', 'marital']).size().unstack()\n\n# 绘制柱状图\ngrouped_data.plot(kind='bar', stacked=True)\n\n# 设置图形属性\nplt.xlabel('Education')\nplt.ylabel('Count')\nplt.title('Marital Status by Education')\nplt.xticks(rotation=45)\n\n# 显示图形\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic1_martial_status_by_education.png?raw=true)\n\n分组数据（`grouped_data`）：\n\n| education           | marital.divorced | marital.married | marital.single | marital.unknown |\n| ------------------- | ---------------- | --------------- | -------------- | --------------- |\n| basic.4y            | 304              | 1724            | 255            | 39              |\n| basic.6y            | 145              | 971             | 196            | 37              |\n| basic.9y            | 337              | 2185            | 706            | 38              |\n| high.school         | 641              | 2641            | 1705           | 44              |\n| illiterate          | 29               | 42              | 45             | 45              |\n| professional.course | 375              | 1669            | 772            | 37              |\n| university.degree   | 714              | 3379            | 2385           | 46              |\n| unknown             | 113              | 567             | 280            | 34              |\n\n#### 2.6  【分析一】统计高中学历婚姻状况的比例\n\n```python\n# 筛选出高中学历的数据\nhigh_school_data = df[df['education'] == 'high.school']\n\n# 统计高中学历下不同婚姻状况的数量\ngrouped_data = high_school_data.groupby('marital').size().reset_index(name='count')\n\n# 计算比例\ntotal_count = grouped_data['count'].sum()\ngrouped_data['ratio'] = grouped_data['count'] / total_count\n\n# 输出统计结果\nprint(grouped_data[['marital', 'ratio']])\n```\n\n输出结果：\n\n```\n    marital     ratio\n0  divorced  0.127410\n1   married  0.524945\n2    single  0.338899\n3   unknown  0.008746\n```\n\n#### 2.7  【分析一的可视化】使用饼图表示高中学历婚姻状况的比例\n```python\nplt.figure(figsize=(6, 6))\nlabels = grouped_data['marital']\nsizes = grouped_data['count']\nplt.pie(sizes, labels=labels, autopct='%1.1f%%')\nplt.title('Marital Status of High School Graduates')\nplt.show()\n\n# 【分析一的结论】高中学历中结婚的比例达到了52.5%\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic2_marital_status_of_high_school_graduates.png?raw=true)\n\n#### 【分析二】统计每个职业的分布情况\n\n```python\n# 统计每个职业的人数\njob_counts = df['job'].value_counts()\n\n# 根据人数进行排序\njob_counts = job_counts.sort_values()\n\njob_counts\n```\n\n输出结果：\n\n```\njob\nunknown           274\nstudent           573\nunemployed        647\nhousemaid         657\nself-employed     836\nentrepreneur      863\nretired          1006\nmanagement       1600\nservices         2083\ntechnician       3530\nblue-collar      4874\nadmin.           5557\nName: count, dtype: int64\n```\n\n人数最多的职业占比：\n\n```python\nprint(f'{round(max(job_counts) / df[\"job\"].count() * 100,2)}%')\n# 24.7%\n```\n\n【分析二的结论】该份数据中职业为管理人员(admin.)的人数最多，达到了5557，占比 24.7%\n\n#### 【分析二的可视化】使用统计图表示各个职业的人数分布情况\n\n```python\nplt.figure(figsize=(10, 6))\njob_counts.plot(kind='barh')\nplt.title('Number of People by Job')\nplt.xlabel('Count')\nplt.ylabel('Job')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic3_number_of_people_by_job.png?raw=true)\n\n#### 【分析三】统计20-30岁之间用户订阅该产品的比例分布\n\n```python\n# 筛选年龄在 20-30 岁之间的数据\nage_filter = (df['age'] \u003e= 20) \u0026 (df['age'] \u003c= 30)\nsubset_df = df[age_filter]\n\n# 计算各个年龄的订阅比例\nage_counts = subset_df['age'].value_counts()\nage_proportions = (age_counts / age_counts.sum()) * 100\n\nage_proportions\n```\n\n输出结果：\n\n```\nage\n30    21.876399\n29    18.696820\n28    13.949843\n27    11.867443\n26     9.740260\n25     7.120466\n24     6.829378\n23     4.343932\n22     2.642185\n21     1.724138\n20     1.209136\nName: count, dtype: float64\n```\n\n#### 【分析三的可视化】使用饼图绘制20-30岁之间用户订阅该产品的比例分布\n\n```python\nplt.figure(figsize=(8, 6))\nplt.pie(age_proportions, labels=age_proportions.index, autopct='%1.1f%%')\nplt.title('Proportion of Subscribers Aged 20-30')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic4_proportion_of_subscribers_aged_20_30.png?raw=true)\n\n#### 【分析三的可视化】使用柱状图绘制20-30岁之间用户订阅该产品的比例分布\n\n```python\nplt.figure(figsize=(8, 6))\nplt.bar(age_counts.index, age_counts.values)\nplt.xlabel('Age')\nplt.ylabel('Number of Subscribers')\nplt.title('Distribution of Subscribers Aged 20-30')\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic5_distribution_of_subscribers_aged_20_30.png?raw=true)\n\n#### 【分析四】统计拥有房屋贷款、个人贷款、房屋贷款\u0026个人贷款的人数，并计算其占比总人数\n\n```python\n# 计算同时拥有房屋贷款和个人贷款的人数\nhousing_count = len(df[(df['housing'] == 'yes')])\nloan_count = len(df[(df['loan'] == 'yes')])\nboth_loans_count = len(df[(df['housing'] == 'yes') \u0026 (df['loan'] == 'yes')])\n\n# 计算占比总人数\ntotal_count = len(df)\nhousing_loans_ratio = round(housing_count / total_count * 100, 2) \nloans_ratio = round(loan_count / total_count * 100, 2) \nboth_loans_ratio = round(both_loans_count / total_count * 100, 2) \n\n# 输出结果\nprint(f\"拥有房屋贷款的人数: {housing_count}\")\nprint(f\"拥有个人贷款的人数: {loan_count}\")\nprint(f\"同时拥有房屋贷款和个人贷款的人数: {both_loans_count}\")\nprint(f\"总人数: {total_count}\")\nprint(f\"拥有房屋贷款的人数占比总人数: {housing_loans_ratio}%\")\nprint(f\"拥有个人贷款的人数占比总人数: {loans_ratio}%\")\nprint(f\"同时拥有房屋贷款和个人贷款的人数占比总人数: {both_loans_ratio}%\")\n```\n\n输出结果：\n\n```\n拥有房屋贷款的人数: 11568\n拥有个人贷款的人数: 3657\n同时拥有房屋贷款和个人贷款的人数: 2055\n总人数: 22500\n拥有房屋贷款的人数占比总人数: 51.41%\n拥有个人贷款的人数占比总人数: 16.25%\n同时拥有房屋贷款和个人贷款的人数占比总人数: 9.13%\n```\n\n#### 【分析四的可视化】使用柱状图统计拥有房屋贷款、个人贷款、房屋贷款\u0026个人贷款的人数，并计算其占比总人数\n\n```python\n# 创建柱状图数据\nlabels = ['Housing', 'Loan', 'Both']\ncounts = [housing_count, loan_count, both_loans_count]\n\n# 设置柱状图参数\nx = range(len(labels))\nwidth = 0.5\n\n# 绘制柱状图\nplt.bar(x, counts, width, align='center')\nplt.xticks(x, labels)\nplt.xlabel('Loan Type')\nplt.ylabel('Count')\nplt.title('Count of Individuals with Housing Loan and Personal Loan')\n\n# 添加数据标签\nfor i, count in enumerate(counts):\n    plt.text(x[i], count, str(count), ha='center', va='bottom')\n\n# 显示图形\nplt.show()\n```\n\n![](https://github.com/jm199504/Data-Analysis-Practice/blob/main/images/bank_customer/pic6_count_of_individuals_with_housing_loan_and_personal_loan.png?raw=true)\n\n【分析四的结论】\n\n```\n# 拥有房屋贷款的人数: 11568\n# 拥有个人贷款的人数: 3657\n# 同时拥有房屋贷款和个人贷款的人数: 2055\n# 总人数: 22500\n# 拥有房屋贷款的人数占比总人数: 51.41%\n# 拥有个人贷款的人数占比总人数: 16.25%\n# 同时拥有房屋贷款和个人贷款的人数占比总人数: 9.13%\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjm199504%2Fdata-analysis-practice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjm199504%2Fdata-analysis-practice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjm199504%2Fdata-analysis-practice/lists"}