{"id":28002930,"url":"https://github.com/eryks1999/titanic-project","last_synced_at":"2026-04-30T22:38:19.538Z","repository":{"id":291384985,"uuid":"977463897","full_name":"ErykS1999/Titanic-Project","owner":"ErykS1999","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-06T09:51:37.000Z","size":5093,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-09T01:44:50.087Z","etag":null,"topics":["jupyter","matplotlib","numpy","pandas","python","seaborn","titanic-kaggle"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ErykS1999.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-04T09:15:32.000Z","updated_at":"2025-05-06T09:51:41.000Z","dependencies_parsed_at":"2025-05-04T10:35:47.654Z","dependency_job_id":null,"html_url":"https://github.com/ErykS1999/Titanic-Project","commit_stats":null,"previous_names":["eryks1999/titanic-project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ErykS1999%2FTitanic-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ErykS1999%2FTitanic-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ErykS1999%2FTitanic-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ErykS1999%2FTitanic-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ErykS1999","download_url":"https://codeload.github.com/ErykS1999/Titanic-Project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253176444,"owners_count":21866142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jupyter","matplotlib","numpy","pandas","python","seaborn","titanic-kaggle"],"created_at":"2025-05-09T01:44:53.386Z","updated_at":"2026-04-30T22:38:14.515Z","avatar_url":"https://github.com/ErykS1999.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Creating the Titanic project using Pandas, Numpy, Seaborn, Matplotlib\n\n\n## In this github repository I will showcase the step by step guide I took to create a project in a jupyter file:\n\n1- First step is to import the libraries such as Pandas, Numpy, Seaborn, Matplotlib in this case as well as read both of the csv files. \n  - I have also used the .head() method to run the first five rows for both of the csv files.\n\n  ```\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n%matplotlib inline\n\ndata_test = pd.read_csv('titanic/test.csv')\ndata_train = pd.read_csv('titanic/train.csv')\n\n  ```\n2- The second step was to rename the columns from the train csv file to make them more readable. \n\n  ```\ndata_train = data_train.rename(columns={'Name':'Full Name','Pclass':'Class'})\ndata_train.head()\n\n  ```\n\n3- The next step was to find out any missing values that exist in dataset and with that, create a heatmap. \n\n\n  ```\ndata_train.isnull().sum()\n\nplt.figure(figsize=(10,6))\nsns.heatmap(data_train.isnull().astype(int), \n            cbar=False, \n            cmap=sns.color_palette(['deepskyblue', 'palegreen'])) \nplt.title('Heatmap of missing values')\nplt.show()\n\n\n  ```\n\u003cimg width=\"659\" alt=\"Screenshot 2025-05-04 at 11 24 10\" src=\"https://github.com/user-attachments/assets/a5f75ab4-339d-4c98-a0ea-ca654b24c236\" /\u003e\n\n\n4- After creating the heatmap, I went on to create the percentage pie chart to compare the difference in females and males on the ship:\n\n  ```\nplt.pie([male,female],labels=['Male','Female'],autopct='%1.1f%%')\nplt.title('Total Male \u0026 Female Passengers')\nplt.show()\n\n  ```\n\u003cimg width=\"313\" alt=\"Screenshot 2025-05-04 at 11 26 41\" src=\"https://github.com/user-attachments/assets/7b85bfaa-1e75-4b27-b73c-a9185176e5b1\" /\u003e\n\n\n5- The following step was to continue using the pie charts in order to create the survival percentage of men and women, as follows:\n\n ```\nplt.pie([percentage2,100-percentage2],labels=['Survived','Not Survived'],autopct='%1.1f%%',colors= ['green','red'])\nplt.title('Women Survived')\nplt.show()\n\n  ```\n\n\u003cimg width=\"325\" alt=\"Screenshot 2025-05-04 at 11 27 58\" src=\"https://github.com/user-attachments/assets/8422de00-00f6-46f4-86d0-fabb899c534e\" /\u003e\n\n ```\nmen_data = data_train.loc[data_train.Sex =='male']['Survived']\n\nrate_men = sum(men_data)/len(men_data)\n\n\npercentage1 = int(rate_men * 100)\nprint(percentage1)\n\nplt.pie([percentage1,100-percentage1],labels=['Survived','Not Survived'],autopct='%1.1f%%',colors= ['green','red'])\nplt.title('Men Survived')\nplt.show()\n\n  ```\n\u003cimg width=\"357\" alt=\"Screenshot 2025-05-04 at 11 28 33\" src=\"https://github.com/user-attachments/assets/d56583d1-c57a-4360-a4df-06f1265dc008\" /\u003e\n\n\n\n6- For the presentation sake, the creation of a bar chart with the amounts of men and women was also created:\n\n\n ```\nplt.figure(figsize=(6, 6))\nplt.title('Women vs Men Survived')\nplt.bar(['Women','Men'],[percentage2,percentage1], color =['red','blue'])\nplt.ylabel('Percentage Survived')\nplt.xlabel('Gender')\nplt.grid(False)\nplt.show()\n\n  ```\n\n\u003cimg width=\"423\" alt=\"Screenshot 2025-05-04 at 11 29 44\" src=\"https://github.com/user-attachments/assets/f35c1b50-f113-4ff7-a19d-7659d4939270\" /\u003e\n\n\n\n7 - After calculating the amount on the ship, we couldn't forget to create a pie chart for the amount of children that were on the ship, comparing to the adults:\n\n ```\nadults = data_train.loc[data_train.Age \u003e= 18]['Age'].count()\nprint(adults)\n\nchildren = data_train.loc[data_train.Age \u003c18]['Age'].count()\nprint(children)\n\nplt.pie([adults,children],labels=['Adults','Children'],autopct='%1.1f%%',colors= ['Orange','Red'])\nplt.title('Adults vs Children')\nplt.show()\n\n  ```\n\n\u003cimg width=\"323\" alt=\"Screenshot 2025-05-04 at 11 31 21\" src=\"https://github.com/user-attachments/assets/dce03466-52a0-4b25-b36b-8ac86fb3f9cd\" /\u003e\n\n8 - For a more clear view, a bar graph has also been created to show the total amount of passengers with each one split into their own category:\n\n ```\ntotal_amount_ppl = data_train.loc[data_train.Survived == 1]['Survived'].count()\nprint(total_amount_ppl)\n\nsurvivors = [survived_women,survived_men,survived_children,total_amount_ppl]\n\ndata_series = pd.Series(survivors,index=['Women','Men','Children','Total Number'])\n\ndata_graph = data_series.plot(kind='bar',title='Amount of passanger survived',xlabel='Gender',ylabel='Amount', x='data',y='1000',color = ['violet','blue','gold','black'])\n\nfor i, value in enumerate(survivors):\n    plt.text(i, value + 10, str(value), ha='center', fontsize=10)  # Adjust +10 if needed for spacing\n\nplt.ylim(0, max(survivors) + 50)  # Optional: add space above bars\nplt.show()\n\n\n  ```\n\n\u003cimg width=\"462\" alt=\"Screenshot 2025-05-04 at 12 52 59\" src=\"https://github.com/user-attachments/assets/7562cf1c-a191-4805-9923-762955eab08d\" /\u003e\n\n\n9 - The following step was to create a line graph showcasing the survival rate in percentages between the classes. The below were taken to achieve this:\n\n\n ```\nsurvived_1st = data_train.loc[(data_train.Survived == 1) \u0026 (data_train.Class == 1),'Class'].count()\n\ntotal_1st = data_train.loc[data_train.Class == 1]['Class'].count()\n\npercentage_1st = int((survived_1st/total_1st *100))\nprint(f\"{percentage_1st}% survived in first class\")\n\n  ```\n\n\n\n ```\nsurvived_2nd = data_train.loc[(data_train.Survived == 1) \u0026 (data_train.Class == 2), 'Class'].count()\n\ntotal_2nd = data_train.loc[data_train.Class == 2]['Class'].count()\npercentage_2nd = int((survived_2nd/total_2nd *100))\nprint(f\"{percentage_2nd}% survived in second class\")\n\n  ```\n\n ```\nsurvived_3rd = data_train.loc[(data_train.Survived == 1)\u0026 (data_train.Class == 3), 'Class'].count()\ntotal_3rd = data_train.loc[data_train.Class == 3]['Class'].count()\n\npercentage_3rd = int((survived_3rd/total_3rd*100))\nprint(f\"{percentage_3rd}% survived in third class\")\n  ```\n\n ```\nx = [1, 2, 3]\n\ny = [percentage_1st, percentage_2nd, percentage_3rd]\n\nplt.figure(figsize=(10, 6))\nplt.plot(x, y, marker='o')\nplt.title('Percentage of People Survived in Each Class')\nplt.xlabel('Class')\nplt.ylabel('Percentage')\nplt.xticks([1, 2, 3])\nplt.ylim(0, 100)\nplt.grid(True)\nplt.show()\n  ```\n\n\u003cimg width=\"683\" alt=\"Screenshot 2025-05-04 at 11 36 04\" src=\"https://github.com/user-attachments/assets/0b124987-ad1e-4c49-b44b-2a2231c9e96d\" /\u003e\n\n\n\n10 - Like for my previous graphs, after creating percentage amounts, sum amount in digits graph has also been created:\n\n\n ```\nclasses = [1, 2, 3]\nvalues = [survived_1st, survived_2nd, survived_3rd]\ncolors = ['gold', 'green', 'brown']\n\n# Create the bar chart\nplt.bar(classes, values, color=colors)\nplt.title('Amount of People That Survived per Class')\nplt.xlabel('Class')\nplt.ylabel('Amount')\nplt.xticks(classes)\n\n# Add labels above bars\nfor i, val in enumerate(values):\n    plt.text(classes[i], val + 5, str(val), ha='center', fontsize=10)  # Adjust +5 for spacing\n\nplt.ylim(0, max(values) + 50)  # Optional: ensure space above bars\nplt.show()\n  ```\n\n\u003cimg width=\"451\" alt=\"Screenshot 2025-05-04 at 11 38 55\" src=\"https://github.com/user-attachments/assets/92f8c86b-a323-4296-8f11-a737479e7755\" /\u003e\n\n\n\n11 - Moving on to a more technical approach, I wanted to use double bars in order to create a more varied and more understandable outcome:\n\n\n ```\nclass_category = ['First Class','Second Class','Third Class']\n\nclass_full = [first_class_full,second_class_full,third_class_full]\n\nclass_died = [first_class_died,second_class_died,third_class_died]\n\nx = range(len(class_category))\nbar_width = 0.3\n\nplt.bar(x,class_full,width=bar_width,label='Total Amount',color='skyblue')\nplt.bar([p + bar_width for p in x], class_died, width=bar_width, label='Non Survivals', color='green')\nplt.title('Amount of people in each class')\nplt.xlabel('Class')\nplt.ylabel('Amount')\nplt.xticks([p + bar_width/2 for p in x], class_category)\nplt.legend()\nplt.show() \n  ```\n\n\u003cimg width=\"457\" alt=\"Screenshot 2025-05-04 at 11 40 20\" src=\"https://github.com/user-attachments/assets/5252ca9c-b464-4fef-8a4d-43fecf1dc6f8\" /\u003e\n\n12 - Creating a visual pie chart to show the percentage of people in each class also helped us to understand everything more clearly.\n\n\n\n ```\nfirst_class = data_train.loc[data_train.Class == 1]['Class'].count()\nsecond_class = data_train.loc[data_train.Class == 2]['Class'].count()\nthird_class = data_train.loc[data_train.Class == 3]['Class'].count()\n\nfirst_percentage = int((first_class/total_amount_ppl *100))\nsecond_percentage = int((second_class/total_amount_ppl *100 +1))\nthird_percentage = int((third_class/total_amount_ppl *100))\n\ntotal_percentage = first_percentage,second_percentage,third_percentage\n\nplt.pie(total_percentage,labels=['First Class','Second Class','Third Class'],autopct='%1.1f%%',colors= ['gold','green','brown'])\nplt.title('Class Distribution')\nplt.show()\n  ```\n\u003cimg width=\"310\" alt=\"Screenshot 2025-05-04 at 11 45 57\" src=\"https://github.com/user-attachments/assets/16d789ea-5d07-47a3-8bf9-42e16a642ca0\" /\u003e\n\n\n13 - The next step was to create a scatter plot comparing the age vs survival rate. This was my first time creating a scatter plot. \n\n\n ```\nplt.figure(figsize=(10, 8))\nsns.scatterplot(x='Age', y='Fare', hue='Survived', data=data_train, palette='viridis', size='Class', sizes=(50, 200), alpha=0.7)\nplt.title('Age vs Fare by Survival Status')\nplt.xlabel('Age')\nplt.ylabel('Fare')\nplt.legend(title='Survived')\nplt.savefig('age_fare_scatter.png')\nplt.show()\n\n  ```\n\u003cimg width=\"677\" alt=\"Screenshot 2025-05-04 at 11 49 51\" src=\"https://github.com/user-attachments/assets/f22c5137-3352-4422-8d53-f91c218944e2\" /\u003e\n\n14 - Whilst creating the graph regarding the amount of people that embarked on each station has allowed me to learn how to create the labels, as follows:\n\n ```\nembarks = [count_Cherbourg,count_Southampton,count_Queenstown]\ndata_series = pd.Series(embarks,index=['Cherbourg','Southampton','Queenstown'])\n\ndata_graph = data_series.plot(kind='bar',title='Amount of people that embarked on each station',xlabel='Harbour',ylabel='Amount', x='data',y='1000',color = ['orange','blue','lime'])\n\n\nfor i, value in enumerate(embarks):\n    plt.text(i, value + 10, str(value), ha='center', fontsize=10)  # Adjust +10 if needed for spacing\n\nplt.ylim(0, max(embarks) + 50)  # Optional: add space above bars\nplt.show()\n\n  ```\n\u003cimg width=\"453\" alt=\"Screenshot 2025-05-04 at 11 52 12\" src=\"https://github.com/user-attachments/assets/16baa86c-d151-45e1-a428-03a40219b113\" /\u003e\n\n\n\n15 - Like previously created, two bar charts are made next to each other. Managing the bar width is very important in order for the bars not to overlap.\n\n ```\ncategories = ['Cherbourg','Southampton','Queenstown']\ntotal_counts = [count_Cherbourg,count_Southampton,count_Queenstown]\nembark_survived = [surived_Cherbourg,survived_Southampton,survived_Queenstown]\n\nbar_width = 0.2 \nx = range(len(categories))\n\nplt.bar(x, total_counts, width=bar_width, label='Total', color='skyblue')\nplt.bar([p + bar_width for p in x], embark_survived, width=bar_width, label='Survived', color='green')\n\nplt.xlabel('Embark Harbour')\nplt.ylabel('Count')\nplt.title('Total vs Embark Harbour')\nplt.xticks([p + bar_width/2 for p in x], categories)\nplt.legend()\n\nplt.tight_layout()\nplt.show()\n\n  ```\n\u003cimg width=\"502\" alt=\"Screenshot 2025-05-04 at 11 58 59\" src=\"https://github.com/user-attachments/assets/1fba2269-4308-4378-8bfd-3c3df5422df1\" /\u003e\n\n\n\n16 - One of the most interesting information that I have learnt are the pallette colours use in seaborn and in matplotlib. Using them effectively clearly helps in making the visualisation more eye-appealing. \n\n ```\nvalues = [one_person_family,two_people_family,three_people_family,four_people_family,five_people_family]\n\nsiblings = ['0 siblings','1 sibling','2 siblings','3 siblings','4 siblings']\n\ncolors = ['mediumspringgreen','springgreen','limegreen','green','darkgreen']\n\nplt.figure(figsize=(10, 6))\nbars = plt.bar(siblings,values,color = colors)\nplt.title('Amount of people with different number of siblings')\nplt.xlabel('Amount')\nplt.ylabel('Number of siblings')\nplt.show()\n\n  ```\n\u003cimg width=\"678\" alt=\"Screenshot 2025-05-04 at 12 00 45\" src=\"https://github.com/user-attachments/assets/5e1d3787-c4ad-428b-831b-5be5f35eda50\" /\u003e\n\n17 - Below is another example of using double bars but using a different bar width due to the fact that there were more x axis inputs.\n\n ```\nsiblings = ['0 siblings','1 sibling','2 siblings','3 siblings','4 siblings']\n\nvalues = [one_person_family,two_people_family,three_people_family,four_people_family,five_people_family]\n\ndied_values = [one_person_family_survived,two_people_family_survived,three_people_family_survived,four_people_family_survived,five_people_family_survived]\n\nbar_width = 0.3 \nx = range(len(siblings))\n\nplt.bar(x,values,width=bar_width,label='Total Amount',color='skyblue')\nplt.bar([p + bar_width for p in x], died_values, width=bar_width, label='Non Survivals', color='green')\nplt.title('Amount of people with different number of siblings')\nplt.xlabel('Number od siblings')\nplt.ylabel('Amount')\nplt.xticks([p + bar_width/2 for p in x], siblings)\nplt.legend()\nplt.show()\n\n\n  ```\n\u003cimg width=\"457\" alt=\"Screenshot 2025-05-04 at 12 04 01\" src=\"https://github.com/user-attachments/assets/8d56cab8-dbd9-4bd4-83bc-f84fac6afb74\" /\u003e\n\n18 - The following boxplot was the first one which was created by me in seaborn. Very effective way to create a clear way to present the average age of each gender. \n\n\n ```\nplt.figure(figsize=(10,6))\nsns.boxplot(x='Sex', y='Age', data=data_train, palette = ['turquoise','hotpink'])\nplt.xlabel('Gender')\nplt.ylabel('Age')\nplt.title('Age Distribution by Sex')\n\n\n  ```\n\u003cimg width=\"662\" alt=\"Screenshot 2025-05-04 at 12 05 42\" src=\"https://github.com/user-attachments/assets/63f6befc-97a9-42a7-a346-339e96f8a49a\" /\u003e\n\n19 - To practice even more, I have compared the class with age in order to compare the average age of passengers per class.\n\n\n ```\nplt.figure(figsize=(10,6))\nsns.boxplot(x='Class', y='Age', data=data_train,palette = ['bisque','goldenrod','darkorange'])\nplt.xlabel('Class')\nplt.ylabel('Age')\nplt.title('Age Distribution by Class')\n\n\n  ```\n\n\u003cimg width=\"669\" alt=\"Screenshot 2025-05-04 at 12 08 31\" src=\"https://github.com/user-attachments/assets/2ad9634b-242e-4b9f-a35b-4bd6fab375fd\" /\u003e\n\n\n20 - One of the most expanded and my proudest graphs which I created in this project was the following:\n\n ```\nage_groups = ['Below 12','13 - 17','18-30','31-49','50+']\nage_survivors = [child,teen,young_adult,adults,senior]\noriginal = [child_original,teen_original,young_adult_original,adults_original,senior_original]\n\n\nbar_width = 0.3\nx = range(len(age_groups))\n\nplt.bar(x,original,width=bar_width,label = 'Original Amount',color = 'purple')\nplt.bar([p + bar_width for p in x], age_survivors, width=bar_width, label='Survivors', color='green')\nplt.title('Amount of people with different age groups')\nplt.xlabel('Age Group')\nplt.ylabel('Amount')\nplt.xticks([p + bar_width/2 for p in x], age_groups)\nplt.legend()\nplt.show()\n\n  ```\n\n\u003cimg width=\"448\" alt=\"Screenshot 2025-05-04 at 12 11 23\" src=\"https://github.com/user-attachments/assets/d3ffd2c0-777a-4d02-93a9-3f3fb17c902a\" /\u003e\n\n\n21 - The final step was to use a countplot as a bar chart in order to find out the number of Parent and Children per passenger.\n\n ```\nax = sns.countplot(x='Parch', data = data_train, palette = ['lawngreen','darkgreen'])\n\nax.set_title('Parent/Child Amount Vs Passenger Count')\n\n\nax.set_xlabel('Parent/Child')\nax.set_ylabel('Passenger Count')\n\n\n  ```\n\u003cimg width=\"453\" alt=\"Screenshot 2025-05-04 at 12 13 27\" src=\"https://github.com/user-attachments/assets/bc6f4fda-fb93-4aa5-a616-8cda19eafbe8\" /\u003e\n\n\n## 22- The total project consisted of 20 graphs to analyse as much as possible. I am very proud to showcase this project to yourself. The skills implemented in this project has risen comparing to the previous Student analysis project. For any recommendations, please let me know! Thank you for reading.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feryks1999%2Ftitanic-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feryks1999%2Ftitanic-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feryks1999%2Ftitanic-project/lists"}