An open API service indexing awesome lists of open source software.

https://github.com/joshuathadi/data-science

Assignments and notes from the IBM Data Science Professional Certificate. Extracting insights from large datasets to support strategic decision-making.
https://github.com/joshuathadi/data-science

coursera-assignment data-science-notes data-science-projects ibm ibm-data-science-professional-certificate ibm-data-science-projects

Last synced: 2 months ago
JSON representation

Assignments and notes from the IBM Data Science Professional Certificate. Extracting insights from large datasets to support strategic decision-making.

Awesome Lists containing this project

README

          


Google Colab Icon

Data Science is an interdisciplinary field that uses statistical techniques, programming, data analysis, and machine learning to extract insights and knowledge from structured and unstructured data. It lies at the intersection of mathematics, computer science, and domain expertise.


Data Science Roadmap


YouTube Thumbnail






>[!IMPORTANT]
>## Data Science Assignment
>Welcome to the Data Science assignment repository! This assignment, developed as part of a Coursera course, covers key data science concepts and practical coding exercises in Jupyter Notebook. Below is a summary of what you will find in this repository.



Coursera


IBM_badge

### Objectives

- Understand the role of a Data Scientist and the data science lifecycle
- Learn Python, SQL, and data science tools such as Jupyter Notebooks, Git, and Watson Studio
- Perform data collection, cleaning, and preparation for analysis
- Conduct Exploratory Data Analysis to uncover trends and insights
- Visualize data using Matplotlib, Seaborn, and interactive dashboards
- Apply basic machine learning techniques for prediction and classification
- Evaluate model performance and interpret results
- Complete hands-on projects and a capstone to build a job-ready portfolio

---

## Data Science and Data Analysis Projects

> [!IMPORTANT]
>

1] EDA - Exploratory Data Analysis


>Exploratory Data Analysis (EDA) is a crucial step in the data science lifecycle where raw data is explored, summarized, and visualized
>to understand its structure and characteristics before applying any machine learning or statistical models.


Exploratory Data Analysis on Olympics

This project involves performing EDA on a dataset containing information about Olympic athletes, events, and medal counts.
The goal is to uncover insights about athlete performance, country participation, and trends over time.

* [Kaggle - Olympic_dataset](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data)
* [Python source code for EDA-olympic program](https://github.com/JoshuaThadi/Data-Science/tree/main/EDA)

## Project Overview

This project focuses on **Exploratory Data Analysis (EDA)** of the **Olympics dataset** to uncover meaningful patterns, trends, and insights
from historical Olympic data. By applying data analysis and visualization techniques, this project aims to better understand athlete performance,
country-wise dominance, medal distributions, and the evolution of the Olympic Games over time.

The analysis is performed using Python-based data science tools and follows a structured, professional EDA workflow.


>## About the Olympics Dataset
>
>The Olympics dataset contains historical records of Olympic Games, including:
>
>* Athlete details (name, gender, age)
>* Country / National Olympic Committee (NOC)
>* Sport and event categories
>* Medal counts (Gold, Silver, Bronze)
>* Year, season, and host city
>
>This dataset provides rich opportunities to analyze sports trends across decades.


## Key Objectives of This Project

* Analyze medal distribution across countries
* Identify top-performing nations and athletes
* Study gender participation trends over time
* Compare performance across different sports
* Explore the evolution of the Olympics across years
* Detect missing values, duplicates, and inconsistencies


## Tools & Technologies Used

* **Python** - High level programming language
* **Pandas** – data manipulation and cleaning
* **NumPy** – numerical operations
* **Matplotlib** – data visualization
* **Seaborn** – advanced statistical plots
* **Jupyter Notebook** – interactive analysis


## EDA Workflow Followed

1. **Data Loading & Inspection**
* Understanding shape, columns, and data types

2. **Data Cleaning**
* Handling missing values
* Removing duplicates
* Fixing inconsistencies

3. **Univariate Analysis**
* Distribution of medals, athletes, and events

4. **Bivariate & Multivariate Analysis**
* Country vs medals
* Gender vs participation
* Sports vs medal counts

5. **Data Visualization**
* Bar charts, histograms, heatmaps, line plots

6. **Insights & Conclusions**
* Key findings and observations


## Key Insights (Sample)

* Certain countries consistently dominate specific sports
* Male participation was higher historically, with a steady rise in female participation
* Medal distribution is highly skewed toward a few top-performing nations
* Some sports contribute disproportionately to total medal counts

> Detailed insights are available inside the notebook.


## Future Improvements

* Apply **statistical analysis** for deeper insights
* Perform **time-series analysis** on medal trends
* Build **machine learning models** for medal prediction
* Create **interactive dashboards** using Plotly or Power BI


## Project Structure

```
├── EDA/
│ └── EDA-olympics/
│ ├── EDA-olympic.ipynb
│ └── dataset_olympics.csv
```


## Author

**Joshua Thadi**
AI/ML & Data Science Enthusiast
Founder & CEO – Yehoarc


## Conclusion

This project demonstrates how **Exploratory Data Analysis** transforms raw Olympic data into meaningful insights.
EDA is not just a step—it is a mindset that enables analysts and data scientists to ask the right questions and build reliable, high-impact solutions.

If you find this project useful, feel free to star the repository and explore further!

---

> [!NOTE]
>### Data Science resources and information
> * Topic and subjects to learn about data science and data analysis

Data science - details

☆ Key Components of Data Science

1] Data Collection: Gathering data from various sources: databases, APIs, sensors, web scraping, etc.

2] Data Cleaning and Preprocessing: Handling missing data, removing duplicates, fixing errors, normalizing formats.

3] Exploratory Data Analysis (EDA: Using statistics and visualization to understand patterns, trends, and anomalies.

4] Feature Engineering: Creating meaningful variables from raw data to improve model performance.

5] Model Building: Applying machine learning algorithms (e.g., regression, classification, clustering.

6] Model Evaluation: Testing model accuracy using metrics like precision, recall, F1-score, RMSE, etc.

7] Deployment: Integrating the model into a real-world application using tools like Flask, Docker, or cloud services

8] Monitoring and Maintenance: Tracking model performance over time and retraining when necessary.


Exploratory Data Analysis


Feature Engineering


Data Collection


Monitoring and Maintenance


Data Cleaning


Model Evaluation


Deployment


Model Building






✪ Core Python Libraries / Modules

Data Manipulation & Analysis – NumPy, Pandas, Dask

Data Visualization – Matplotlib, Seaborn, Plotly, Altair

Machine Learning – scikit-learn, XGBoost, LightGBM, CatBoost, Hugging Face Transformers, TensorFlow, PyTorch

Deep Learning – Keras, PyTorch Lightning, ONNX

Model Deployment – Flask, FastAPI, Streamlit, Gradio, Docker


Pandas


NumPy


Matplotlib


Python


R


SQL


Azure


Tableau


Power BI


Seaborn


Scikit-learn


TensorFlow


PyTorch


Jupyter Notebooks


Google Colab


AWS






📚 Core Subjects in Data Science

1] Statistics & Probability – Foundational math for inference and predictions

2] Linear Algebra – Vectors, matrices — core of ML models

3] Calculus – Gradient descent, optimization

4] Machine Learning – Algorithms to learn from data

5] Deep Learning – Neural networks and deep architectures

6] NLP (Natural Language Processing) – Working with text and language

7] Computer Vision – Image and video analysis

8] Big Data – Working with large-scale data

9] Data Engineering – Pipelines, ETL, data storage

10] Model Deployment – Turning models into APIs/apps

11] MLOps – Production lifecycle of ML models

12] Data Visualization – Communicating insights effectively

13] Cloud & DevOps – Using AWS, Azure, GCP for scalable data solutions


Data Visualization



Statistics & Probability



Linear Algebra



Calculus



Big Data



Cloud & DevOps


Deep Learning


Computer Vision


Data Engineering


Machine Learning


Model Deployment


MLOps


Natural Language Processing






📌 Topics to Cover

1] Exploratory Data Analysis (EDA) – Missing data, outliers, visualization

2] Feature Engineering – Encoding, scaling, transformations

3] Model Evaluation – Accuracy, precision, recall, ROC, AUC

4] Hyperparameter Tuning – GridSearch, RandomSearch, Optuna

5] Dimensionality Reduction – PCA, t-SNE, UMAP

6] Time Series Analysis – ARIMA, LSTM, Prophet

7] Unsupervised Learning – Clustering (KMeans, DBSCAN), PCA

8] Supervised Learning – Regression, classification

9] Neural Networks – CNN, RNN, GAN, transformers

10] Recommendation Systems – Collaborative filtering, content-based

11] Data Cleaning & Wrangling – Imputation, normalization, data types



Supervised Learning


Neural Networks



Recommendation Systems



Data Cleaning



Feature Engineering



Hyperparameter Tuning



Dimensionality Reduction



Unsupervised Learning



Model Evaluation



EDA



Time Series Analysis






★ Why is Data Science Important?



Data Science enables organizations to:

1] Make data-driven decisions

2] Predict future trends

3] Automate processes using machine learning

4] Improve customer experiences and optimize operations

🌐 Datasets & Practice


1] Kaggle Datasets

2] UCI Machine Learning Repository

3] Google Dataset Search

4] Data.gov


Kaggle Datasets


Google Dataset Search


UCI Repository


Data.gov






📖 Learning Resources


1] Python for Data Science – freeCodeCamp

2] Coursera Data Science Specialization

3] Fast.ai Courses

4] Harvard CS109 – Data Science


Python for Data Science


Coursera JHU


Harvard CS109


Fast.ai






✫ Applications of Data Science

1] Drug Discovery & Personalized Medicine

Use Case: Analyzing genetic data and molecular structures to discover new drugs faster and more effectively.

How: Machine learning models predict how a drug will interact with human proteins, reducing the need for trial-and-error in labs.


Drug Discovery & Personalized Medicine






2] Satellite Image Analysis & Earth Observation

Use Case: Monitoring deforestation, urban expansion, and climate change from space.

How: Computer vision applied to satellite imagery to track environmental changes in near real-time.

Satellite Image Analysis & Earth Observation






3] Neuroinformatics & Brain-Computer Interfaces (BCIs)

Use Case: Interpreting brain signals to control external devices or assist people with disabilities.

How: ML models decode EEG/fMRI data to enable mind-controlled prosthetics or communication devices.


Neuroinformatics & Brain-Computer Interfaces (BCIs)






4] Legal Analytics & Predictive Judging

Use Case: Predicting the outcome of legal cases or analyzing judge rulings.

How: NLP and ML models analyze vast amounts of case law and court data to assist legal research and strategy.


Legal Analytics & Predictive Judging






5] Content Generation & Scriptwriting

Use Case: Assisting in writing movie scripts or generating realistic dialogue.

How: NLP and generative models trained on film scripts, books, or dialogues to suggest or generate creative writing.


Content Generation & Scriptwriting






6] Game Analytics & Dynamic Difficulty Adjustment
Use Case: Making video games adapt to player skill in real time for better engagement.

How: Analyzing gameplay data to adjust difficulty, recommend challenges, or predict player churn.


Game Analytics & Dynamic Difficulty Adjustment






7] Smart City Optimization
Use Case: Managing traffic, energy consumption, and emergency response in real time.

How: Integrating IoT sensor data with predictive analytics to optimize urban infrastructure.


Smart City Optimization






8] Synthetic Biology & Genomic Sequencing

Use Case: Designing synthetic organisms or editing genes more efficiently.

How: Data science models help map and understand genetic patterns to identify gene targets for editing (CRISPR, etc.)


Synthetic Biology & Genomic Sequencing






9] Adaptive Learning Systems in EdTech

Use Case: Personalizing learning paths for students.

How: Tracking student performance data and recommending content or pace adjustment using ML.


Adaptive Learning Systems in EdTech






10] Social Good & Policy Simulation

Use Case: Simulating the outcome of policy changes (e.g., taxation, healthcare).

How: Data models trained on socio-economic datasets to project real-world impact of policies.


Social Good & Policy Simulation

---


⚠️ This repository is uniquely designed by @JoshuaThadi.