{"id":20766637,"url":"https://github.com/elijah-1994/pre-process-e-commerce-dataset","last_synced_at":"2025-03-11T18:49:48.797Z","repository":{"id":159281862,"uuid":"634563104","full_name":"Elijah-1994/Pre-Process-E-Commerce-Dataset","owner":"Elijah-1994","description":"Importing, Cleaning, and Pre-Processing E-Commerce Data for Analysis Using MySQL.","archived":false,"fork":false,"pushed_at":"2023-04-30T14:45:51.000Z","size":267,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-18T06:27:23.623Z","etag":null,"topics":["analytics","data","dataanalytics","datacleaning","dataprocessing","mysql","mysql-database","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Elijah-1994.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-30T14:42:12.000Z","updated_at":"2023-04-30T14:47:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"faae156a-ae55-4aef-a9a2-394fd52b73a3","html_url":"https://github.com/Elijah-1994/Pre-Process-E-Commerce-Dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2FPre-Process-E-Commerce-Dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2FPre-Process-E-Commerce-Dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2FPre-Process-E-Commerce-Dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2FPre-Process-E-Commerce-Dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Elijah-1994","download_url":"https://codeload.github.com/Elijah-1994/Pre-Process-E-Commerce-Dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243094246,"owners_count":20235478,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","data","dataanalytics","datacleaning","dataprocessing","mysql","mysql-database","sql"],"created_at":"2024-11-17T11:25:12.468Z","updated_at":"2025-03-11T18:49:48.766Z","avatar_url":"https://github.com/Elijah-1994.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## Pre Process E-Commerce Dataset Project\n\u0026nbsp;\n\nThe aim of this project is to import, clean, and pre-process an E-commerce dataset into a MySQL database, ready for data analytics.\u003cbr /\u003e\n\n\u0026nbsp;\n\n\n## Milestone 1 - Import Raw Dataset\n\n\u0026nbsp;\n\n__Raw Data__ \n\nThe first step is to review the raw data located within OnlineRetail.csv.The raw data contains the following columns:\n\n* Invoice No: The invoice number of the items.\n* StockCode: The stock number of the items.\n* Description: The description of the items.\n* Quantity: The quantity of the items.\n* UnitPrice: The unit price of the items.\n* CustomerID: The customer ID.\n* Country: The country of the customer.\n\n\u0026nbsp;\n\n\u003cins\u003e__Connect to the MySQL Database__\u003c/ins\u003e\n\nIn order to connect to the MySQL Database the following code is written as shown in Figure 1 below:\n\n\u003ckbd\u003e![Alt text](Figures/Figure_1.PNG)\u003ckbd\u003e\n\n*Figure 1 - Code to connect to MySQL Database*\n\n\u0026nbsp;\n\n\u003cins\u003e__Create Orders M Table__\u003c/ins\u003e\n\nThe next step is to create a table called 'orders' with data types appropriate for each field using MySQL query. The table is shown in Figure 2 below.\n\n\u003ckbd\u003e![Alt text](Figures/Figure_2.PNG)\u003ckbd\u003e \n\n*Figure 2 - Query to create table*\n\n\u0026nbsp;\n\n\u003cins\u003e__Import the raw data into MySQL Table__\u003c/ins\u003e\n\nNow that the table has been created in MySQL, the raw data can now be imported into the MySQL table as shown in Figure 3 below. The __STR_TO_Date__ function is called to convert the invoice date into the correct format. The appropriate arguments are also called for in the stated CSV format.\n\n\u003ckbd\u003e![Alt text](Figures/Figure_3.PNG)\u003ckbd\u003e\n\n*Figure 3 - Query to import raw csv data into MySQL table*\n\n\u0026nbsp;\n\n\u003cins\u003e__Review Table__\u003c/ins\u003e\n\nThe next step is to review created table to ensure that the raw data has correctly been formatted and inputted into the correct columns, which is shown in Figure 4 below.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_4.PNG)\u003ckbd\u003e\n\n*Figure 4 - Query to review orders table*\n\n\u0026nbsp;\n\n## Milestone 2 - Analyse the Missing Customer IDs and devise plan to impute them\n\n\u0026nbsp;\n\n__Raw Data__ \n\nOn inspection, the raw data contains many missing rows of customerID which are imported with a 0 value. To conduct a thorough analysis of the data each CustomerID must have a unique identifier. Hence the first step is to confirm how many missing rows are within the table to gauge what needs to be done with these missing rows, for example, if the missing rows account for more than 5% then these rows do not need to be dropped but instead be inputted with artificial Customer IDs instead.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_5.PNG)\u003ckbd\u003e\n\n*Figure 5 - Query to count the number of rows in the table and the number of rows with missing customer id data*\n\n\u0026nbsp;\n\nFigure 5 shows that the rows with the missing customer IDs account for a significant proportion of the rows (25%) and therefore it was decided to fill the missing rows with artificial customer IDs. The next step is to explore some rows to ensure that the rows with missing customer IDs have valid invoice numbers associated with them. (Figure 6).\n\n\n\u003ckbd\u003e![Alt text](Figures/Figure_6.PNG)\u003ckbd\u003e \n\n*Figure 6 - Query to review orders table*\n\n\u0026nbsp;\n\nThe next step is to determine how many imputed customer IDs we will need to be generated, by computing the number of distinct invoice numbers that have missing customer IDs as shown in Figure 7 below.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_7.PNG)\u003ckbd\u003e\n\n*Figure 7 - Query to count missing customer IDs based on unique invoice numbers*\n\n\u0026nbsp;\n\n\nThe final step of milestone 2 is to find the smallest valid Customer ID in the data set, to ensure that an auto-incremented imputed customer ID will not overlap with actual values (Figure 8).\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_8.PNG)\u003ckbd\u003e\n\n*Figure 8 - Query to the find smallest valid Customer ID*\n\n\u0026nbsp;\n\n## Milestone 3 - Impute New Customer IDs for rows without them\n\n\u0026nbsp;\n\n__New Table__ \n\nAs mentioned in Milestone 2 above it was decided to impute artificial customer IDs for rows that are missing them. The strategy is to generate an auto-incremented customer ID, starting at 1, associated with each unique invoice that is missing a customer ID (Figure 9). The next step is to make sure that the auto-incremented imputed_id values start at 1 as shown in Figure 10 below and preview the table (11). \n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_9.PNG)\u003ckbd\u003e\n\n*Figure 9 - Query to create null_customer_ids table *\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_10.PNG)\u003ckbd\u003e\n\n*Figure 10 - Query to auto increment imputed id values*\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_11.PNG)\u003ckbd\u003e\n\n*Figure 11 - Query to review null_customer_ids table *\n\n\u0026nbsp;\n\nThe next step is to confirm the imputed IDs remain within the expected range, that does not overlap with the real IDs (Figure 12):\n\n\u003ckbd\u003e![Alt text](Figures/Figure_12.PNG)\u003ckbd\u003e \n\n*Figure 12 - Query to confirm imputed IDs expected range*\n\n\u0026nbsp;\n\n__Join Table__ \n\nTo ensure the join of the null_customer_ids with orders tables is performed efficiently, indexes on the InvoiceNo columns of both tables are created as shown in Figure 13 and Figure 14 below.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_13.PNG)\u003ckbd\u003e\n\n*Figure 13 - Query to crete index of the InvoiceNo column for the null_customer_ids table*\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_14.PNG)\u003ckbd\u003e\n\n*Figure 14 - Query to crete index of the InvoiceNo column for the orders table*\n\n\u0026nbsp;\n\n__Inner join__ \n\nNow that the indexes are created, an inner join is conducted to merge the imputed IDs with rows with missing customer IDs, joined by the invoice number as shown in Figure 15, and the confirmation that there are now no missing Customer IDs is shown in Figure 16 below.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_15.PNG)\u003ckbd\u003e\n\n*Figure 15 - Query to create inner join for both tables*\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_16.PNG)\u003ckbd\u003e\n\n*Figure 16 - Query to confirm the number of missing rows*\n\nThe final step is to inspect some of the rows with imputed IDs (these may be identified as being less than 12346) to ensure they appear as expected (Figure 17).\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_17.PNG)\u003ckbd\u003e\n\n*Figure 17 - Query to review orders table*\n\n\u0026nbsp;\n\n## Milestone 4 - Identify rows that do not represent customer orders\n\n\u0026nbsp;\n\n\nOn inspection, the data contains rows for accounting corrections, fees, and other expenses that do not represent real customer behavior, and this data must be cleaned out prior to analysing it.\n\n\u0026nbsp;\n\n__Unit Price__ \n\nThese data entries are often associated with high UnitPrice values or zero or negative UnitPrice values. Therefore the first step is to inspect rows with high UnitPrice values(Figure 18) and see if there is a better way to identify these expense-related entries. Also, inspect rows with negative or zero UnitPrice, and see if there are any hints as to what these rows represent and if they may be dropped.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_18.PNG)\u003ckbd\u003e\n\n*Figure 18 - Query to inspect UnitPrice rows*\n\n\u0026nbsp;\n\n__StockCode__ \n\nOn inspection, the StockCode field contains information identifying rows that do not reflect actual purchases of items. Hence the next step is to \nlist the unique StockCode values associated with these outliers (Figure 19).\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_19.PNG)\u003ckbd\u003e\n\n*Figure 19 - Query to inspect StockCode field*\n\n\u0026nbsp;\n\nThe final step OF milestone 4 is to inspect rows with zero or negative UnitPrice fields. On inspection, It is not clear what these rows represent, but they do seem to be missing valid item description data and picked up imputed customer IDs. As they are missing significant data, dropping these rows seems justifiable. \n\n\u003ckbd\u003e![Alt text](Figures/Figure_20.PNG)\u003ckbd\u003e\n\n*Figure 20 - Query to inspect rows with zero or negative UnitPrice fields*\n\n\u0026nbsp;\n\n## Milestone 5 - Drop rows that do represent customer orders\n\n\u0026nbsp;\n\n__Drop rows__ \n\nThe next step is to drop rows from the orders table that have zero or negative UnitPrice values. Also, rows that do not reflect the purchase of real items need to be dropped(Figure 21). These may be identified by a StockCode of DOT, M, D, S, POST, BANK CHARGES, C2, AMAZONFEE, CRUK, or B (Figure 22). The resulting orders table will be clean and ready for further analysis.\n\n\u0026nbsp;\n\n\u003ckbd\u003e![Alt text](Figures/Figure_21.PNG)\u003ckbd\u003e\n\n*Figure 21 - Query to drop rows*\n\n\u0026nbsp;\n\n\n\u003ckbd\u003e![Alt text](Figures/Figure_22.PNG)\u003ckbd\u003e\n\n*Figure 22 - Extra Query to drop rows*\n\n\u0026nbsp;\n\nThe final step of the data cleaning process is to confirm the results in the table where the highest UnitPrice values reflect actual customer behavior as shown in Figure 23 below.\n\n\u003ckbd\u003e![Alt text](Figures/Figure_23.PNG)\u003ckbd\u003e\n\n*Figure 23 - Query to confirm results*\n\n\u0026nbsp;\n\nThe table is now ready to be used for data analytics.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijah-1994%2Fpre-process-e-commerce-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felijah-1994%2Fpre-process-e-commerce-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijah-1994%2Fpre-process-e-commerce-dataset/lists"}