Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/gino-freud-hobayan/google-data-analytics-capstone-project.

Cyclistic Bike-Share (SQL)
https://github.com/gino-freud-hobayan/google-data-analytics-capstone-project.
data-cleaning data-visualization exploratory-data-analysis sql
Last synced: 5 days ago
JSON representation
Cyclistic Bike-Share (SQL)
Host: GitHub
URL: https://github.com/gino-freud-hobayan/google-data-analytics-capstone-project.
Owner: Gino-Freud-Hobayan
Created: 2023-08-09T09:28:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-10-03T23:28:11.000Z (over 1 year ago)
Last Synced: 2024-11-12T12:05:36.118Z (2 months ago)
Topics: data-cleaning, data-visualization, exploratory-data-analysis, sql
Homepage: https://gino-freud-hobayan.github.io/
Size: 186 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Google Data Analytics Professional Certificate - Capstone Project 

### By: Gino Freud D. Hobayan  (https://gino-freud-hobayan.github.io/)







# Descriptive and Diagnostic Analytics on Bike-share data

- **Descriptive Analytics** tells you **WHAT** happened in the past. 

- **Diagnostic Analytics** helps you understand **WHY** something happened in the past.






## **Case Study: How Does a Bike-Share Navigate Speedy Success?**



This is my Capstone Project for the [Google Data Analytics Professional Certificate](https://www.coursera.org/professional-certificates/google-data-analytics)





### **In this Scenario:**

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. 

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. 

Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. 

From these insights, my team will design a new marketing strategy to convert casual riders into annual members. 

But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data

visualizations.




The six phases of the data analysis process I learned from the course:

![6 phases - Googla DA](https://github.com/Gino-Freud-Hobayan/Google-Data-Analytics-Capstone_Gino/assets/117270964/f9c7d03b-f6d0-408f-bf37-07c007ef790d)





### **It's very important that we make sure that the dataset is credible and that we perform proper data cleaning in order to get useful and accurate insights.**

### **Otherwise, it might just lead to headaches and confusion once we get to our analysis.**

### **Just like what Mr. Stephen R. Covey said:**










# 1. ASK

#### In the Ask step, we define the problem we're solving and make sure that we fully understand stakeholder expectations.

- Define the problem we’re trying to solve

- Make sure that we fully understand the stakeholder’s expectations

- Take a step back and see the whole situation in context





## **About the company** 

In 2016, Cyclistic launched a successful bike-share offering. 

Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. 

The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.

One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes,

and annual memberships. 

- Customers who purchase single-ride or full-day passes are referred to as **casual riders.**

- Customers who purchase annual memberships are **Cyclistic members.**

**Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.**

Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will

be key to future growth. 

**Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a

very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic

program and have chosen Cyclistic for their mobility needs.**

Moreno has set a clear goal: **Design marketing strategies aimed at converting casual riders into annual members.**

In order to do that, however, the marketing analyst team needs to better understand: 

- how annual members and casual riders differ

- why casual riders would buy a membership

- how digital media could affect their marketing tactics. 

Moreno and her team are **interested in analyzing the Cyclistic historical bike trip data to identify trends.**




## Stakeholders:

**1. Lily Moreno:**

The director of marketing and my manager in this scenario. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.

These may include email, social media, and other channels.

**2. Cyclistic marketing analytics team:**

A team of data analysts who are responsible for collecting, analyzing, and

reporting data that helps guide Cyclistic marketing strategy. I joined this team six months ago and have been busy

learning about Cyclistic’s mission and business goals — as well as how I, a junior data analyst, can help Cyclistic achieve them.

**3.  Cyclistic executive team:**

The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.




#### What are my stakeholders saying their problems are?

#### Now that I’ve identified the issues, how can I help the stakeholders resolve their questions?






**How do annual members and casual riders use Cyclistic bikes differently?**

**What makes casual riders purchase a membership?**

**and how digital media could affect the company's marketing tactics with the goal of converting casual riders into members.**

![SMART questions](https://github.com/Gino-Freud-Hobayan/Google-Data-Analytics-Capstone-Project./assets/117270964/666be9a3-d120-4c8c-b89a-1f5b4daa1f9d)




## **Business Task:**

### **Analyze the usage patterns and motivations of annual members and casual riders of Cyclistic bike-share, and develop digital marketing strategies to convert more casual riders into annual members.**




## **Deliverables:**

**1. A clear statement of the business task**

**2. A description of all data sources used**

**3. Documentation of any cleaning or manipulation of data**

**4. A summary of my analysis**

**5. Supporting visualizations and key findings**

**6. My top recommendations based on my analysis**








# 2. PREPARE

In this step, I will be preparing the data and checking its reliability using the ROCCC analysis.




## **Prepare the data:**

1.  The Data is located in cloud storage in zip format (CSV file). This data is publicly available and I got access to it when I got the PDF

2.  Divvy system data (first-party data) is owned by the City of Chicago and released on a monthly schedule.

3.  The data is also processed to remove trips that are taken by staff as they service and inspect the system.

4.  Every ride data is anonymous.

5.  Are there issues with bias or credibility in this data? none. I used the ROCCC analysis to check.




### ROCCC for the Reliability of the Dataset

The dataset follows the ROCCC Analysis as described below:

- Reliable - yes, not biased

- Original - yes, can locate the original public data

- Comprehensive - yes, not missing important information

- Current - yes, updated monthly

- Cited - yes





6. We are dealing with Structured data.



Each CSV file consists of 13 columns:

- ride_id,

- rideable_type,

- started_at,

- ended_at,

- start_station_name,

- start_station_id,

- end_station_name,

- end_station_id,

- start_lat,

- start_lng,

- end_lat,

- end_lng,

- member_casual 





## **Limitations:**

    

#### 1. The analysis is based on the available dataset from Cyclistic. 

#### 2. Most of the data is anonymous (due to data privacy)  

#### 3. Additionally, the dataset contains a lot of null values thus affecting the accuracy of the analysis.

#### 4. For this analysis, we will only analyze historical trip data from July 2022 to June 2023 (12 CSV files).

#### I will be using the most recent 12 months of data.

 







## CREATE DATABASE and CREATE TABLE (SQL)

- DATABASE NAME = Bikeshare_database

- TABLE NAME = BikeShare_table

SQL Query:

```sql

CREATE DATABASE Bikeshare_database;

-- Specifying the database we want to use

USE Bikeshare_database;

-- Create a table that will hold all of the data from the CSV files.

-- six decimal places for longitude and latitude for accuracy.

CREATE TABLE BikeShare_table

(

	pk_ride_id VARCHAR(16) PRIMARY KEY NOT NULL,

	rideable_type VARCHAR(13),

	started_at DATETIME,

	ended_at DATETIME,

	start_station_name VARCHAR(100),

	start_station_id VARCHAR(50),

	end_station_name VARCHAR(100),

	end_station_id VARCHAR(50),

	start_lat DECIMAL(8,6),

	start_lng DECIMAL(9,6),

	end_lat DECIMAL(8,6),

	end_lng DECIMAL(9,6),

	member_casual VARCHAR(6)

);

--- Insert all the data from the CSV file into our database table 'BikeShare_table' for each month

-- July 2022

BULK INSERT BikeShare_table

FROM "C:\Users\GINO\Desktop\Google Capstone\Cyclistic - bike share data\202207-divvy-tripdata.csv"

WITH (

		FORMAT = 'CSV',

		FIELDTERMINATOR = ',',

		ROWTERMINATOR = '\n',

		FIRSTROW = 2			

	);

-- REPEATED THIS SAME PROCESS UNTIL ALL 12 CSV FILES HAVE BEEN UPLOADED.

```




## Count the total number of records that were uploaded

```sql

SELECT

	COUNT(*) AS total_num_of_records

FROM 

	BikeShare_table;

```



### Inference:

- We're going to be working with 5,779,444 records (More than 5 million records and 12 months' worth of data)





## Get the Head(50) and a general view of the data

```sql

SELECT TOP 50

	*

FROM 

	BikeShare_table;

```





### Inference:

- We can clearly see that there are null values in this dataset, we will have to deal with those values to ensure our analysis is accurate.






## CREATE A BACKUP DATABASE



- Created a backup of the original data first. I also did this in Python and Excel in some of my projects.
















# 3. PROCESS

In this step, I will be processing and cleaning the data for exploratory data analysis.

**I spent a lot of time on this step, especially in data cleaning in order to ensure data integrity prior to analysis.**

Working with clean data is very important in order to get valuable insights from our data analysis, 

because if the data is dirty it will only lead to headaches and inaccurate results.




### **"If the ladder is not leaning against the right wall, every step we take just gets us to the wrong place faster."**

### **- Stephen R. Covey**







# DATA CLEANING:

### **This is done in order to ensure data integrity**









## Check for Duplicates

SQL Query:

```sql

SELECT

	pk_ride_id,

	COUNT(pk_ride_id) AS count_of_ride_id

FROM 

	BikeShare_table

GROUP BY 

	pk_ride_id

HAVING COUNT(pk_ride_id) > 1;

-- We used HAVING since WHERE does not work for aggregate like: COUNT(column name)

-- ANSWER: No duplicates found on pk_ride_id

```







## Check for null values

SQL Query:

```sql

/* 

WHEN the column is:

- IS NULL 

- OR  ('NA') 

- OR ('N/A') 

- OR LEN(column) = zero

- ISDATE(started_at/ended_at) = 0

WHEN this condition is met, THEN return the value "1".

(To act as a counter for the number of reps)

*/

-- Check all our columns for N/A or NULL values.

-- We'll be using CASE Statements to act as a counter

SELECT

	COUNT (*) AS TOTAL_NUM_OF_RECORDS,

	COUNT (CASE WHEN pk_ride_id IS NULL OR pk_ride_id = 'NA' OR pk_ride_id = 'N/A' OR LEN(pk_ride_id) = 0 THEN 1 END) AS pk_ride_id,

	COUNT (CASE WHEN rideable_type IS NULL OR rideable_type = 'NA' OR rideable_type = 'N/A' OR LEN(rideable_type) = 0 THEN 1 END) AS rideable_type,

	COUNT (CASE WHEN started_at IS NULL OR ISDATE(started_at) = 0 THEN 1 END) AS started_at,

	COUNT (CASE WHEN ended_at IS NULL OR ISDATE(ended_at) = 0  THEN 1 END) AS ended_at,

	COUNT (CASE WHEN start_station_name IS NULL OR start_station_name = 'NA' OR start_station_name = 'N/A' OR LEN(start_station_name) = 0 THEN 1 END) AS start_station_name,

	COUNT (CASE WHEN start_station_id IS NULL OR start_station_id = 'NA' OR start_station_id = 'N/A' OR LEN(start_station_id) = 0 THEN 1 END) AS start_station_id,

	COUNT (CASE WHEN end_station_name IS NULL OR end_station_name = 'NA' OR end_station_name = 'N/A' OR LEN(end_station_name) = 0  THEN 1 END) AS end_station_name,

	COUNT (CASE WHEN end_station_id IS NULL OR end_station_id = 'NA' OR end_station_id = 'N/A' OR LEN(end_station_id) = 0 THEN 1 END) AS end_station_id,

	COUNT (CASE WHEN start_lat IS NULL THEN 1 END) AS start_lat,

	COUNT (CASE WHEN start_lng IS NULL THEN 1 END) AS start_lng,

	COUNT (CASE WHEN end_lat IS NULL THEN 1 END) AS end_lat,

	COUNT (CASE WHEN end_lng IS NULL THEN 1 END) AS end_lng,

	COUNT (CASE WHEN member_casual IS NULL OR member_casual = 'NA' OR member_casual = 'N/A' OR LEN(member_casual) = 0 THEN 1 END) AS member_casual

FROM 

	BikeShare_table;

```



#### Inference:

- Some columns have null values: start_station_name, start_station_id, end_station_name, end_station_id, end_lat, end_lng

- We have properly identified which columns have null values in them and how many null values are in total.

	That's the first step, identify the problem, now we can deal with those columns with null values.





## Deal with LEADING and TRAILING spaces

SQL Query:

```sql

-- REMOVE THE TRAILING AND LEADING SPACES for columns with STRING dtypes only.

-- LTRIM (removes leading spaces)

-- RTRIM (removes trailing spaces)

-- TRIM (removes both leading and trailing spaces)

UPDATE BikeShare_table

SET pk_ride_id = TRIM(pk_ride_id);

UPDATE BikeShare_table

SET rideable_type = TRIM(rideable_type);

UPDATE BikeShare_table

SET start_station_name = TRIM(start_station_name)

WHERE start_station_name IS NOT NULL;

UPDATE BikeShare_table

SET start_station_id = TRIM(start_station_id)

WHERE start_station_id IS NOT NULL;

UPDATE BikeShare_table

SET end_station_name = TRIM(end_station_name)

WHERE end_station_name IS NOT NULL;

UPDATE BikeShare_table

SET end_station_id = TRIM(end_station_id)

WHERE end_station_id IS NOT NULL;

UPDATE BikeShare_table

SET member_casual = TRIM(member_casual);

```






(It's like pre-processing on Python.)










## Fill the null/missing values

SQL Query:

```sql

SELECT

  *

FROM

  ______

```





## Addressing Outliers and Anomalies

SQL Query:

```sql

-- EXAMPLE: Age of people who go to the gym to workout

-- Mean age = 25 yrs old

-- Outlier = 8 yrs old or 80 yrs old

SELECT

  *

FROM

  ______

```












# 4. ANALYZE

In this step, I will be **analyzing the data to find patterns, trends, and insights.**

We will explore the data, and perhaps look at **the total number of rows, distinct values, maximum, minimum, or mean values.**




## Descriptive Statistics:

- Categorical data

- Numerical data







# Descriptive Statistics (Numerical data)

This is similar to **.describe()** in Python

This includes the **Five-number summary: min, 25% (Q1), 50% (Q2/median), 75% (Q3), and max**

![five num summary + boxplot](https://github.com/Gino-Freud-Hobayan/Google-Data-Analytics-Capstone-Project./assets/117270964/bd19041a-65b1-4e6e-959b-e984e74a26d4)




SQL Query:

```sql

/* 

DESCRIPTIVE STATISTICS

We will first check the Numerical Descriptive Statistics of our rides

- Count: It shows the number of values that are not missing in each of the columns.

- Mean: It shows the average of the values in each of the columns.

- Std:  It shows the standard deviation of the values in each of the columns.

- Min:  It shows the smallest value in each of the columns.

- 25% (Q1): Median of the first half of the data

- 50% (Q2): Median 

- 75% (Q3): Median of the second half of the data

- Max:  It shows the maximum value in each of the columns.

*/ 

SELECT

  *

FROM

  ______

```

### Inference:

-  .....

-  ....

-  ......

  





## Sample

SQL Query:

```sql

SELECT

  *

FROM

  ______

```

### Inference:

-  .....





## Sample

SQL Query:

```sql

SELECT

  *

FROM

  ______

```

### Inference:

-  .....





## Sample

SQL Query:

```sql

SELECT

  *

FROM

  ______

```

### Inference:

-  .....





## Sample

SQL Query:

```sql

SELECT

  *

FROM

  ______

```

### Inference:

-  .....



















# 5. SHARE

In this step, I made visualizations of the data and shared my insights about it.

Tableau link for the Data visualization: 




### Insights/Key Findings:

1. .......

2. ...

3. .......

4. ....

5. .......

6. ....








# 6. ACT

In this final step, I will share and present to the stakeholders the summary of my insights/key findings

that is backed up with compelling data insights and professional data visualizations.




### Conclusion:





### Recommendations:

1. .......

2. .......

3. .......

4. .......

5. .......

6. .......













![Thank you wordcloud1](https://github.com/Gino-Freud-Hobayan/Google-DA-Capstone_Gino/assets/117270964/ba5b72d0-12b7-48ec-a36f-0b1fe218665a)