{"id":15158058,"url":"https://github.com/sumeetgedam/data_analysis","last_synced_at":"2026-02-02T22:51:00.328Z","repository":{"id":251359181,"uuid":"836567827","full_name":"sumeetgedam/Data_Analysis","owner":"sumeetgedam","description":"Repository to track Data Analysis done on various datasets available online","archived":false,"fork":false,"pushed_at":"2024-09-15T14:07:28.000Z","size":22728,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-21T08:48:41.537Z","etag":null,"topics":["algorithms","arima-forecasting","data-visualization","pandas","plotly-express","plotly-graph-objects","prophet-facebook","python","r2-score","random-forest","seaborn","sklearn","timeseries-forecasting","trends","visualization"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sumeetgedam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-01T05:59:55.000Z","updated_at":"2024-09-15T14:08:31.000Z","dependencies_parsed_at":"2024-08-25T10:26:03.008Z","dependency_job_id":"0fdcdebf-0406-4235-833b-6d62d24012ab","html_url":"https://github.com/sumeetgedam/Data_Analysis","commit_stats":{"total_commits":17,"total_committers":2,"mean_commits":8.5,"dds":0.05882352941176472,"last_synced_commit":"de08471bee8e735a5e4d90b92417c944ebb0e3a1"},"previous_names":["sumeetgedam/data_analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sumeetgedam/Data_Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sumeetgedam%2FData_Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sumeetgedam%2FData_Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sumeetgedam%2FData_Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sumeetgedam%2FData_Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sumeetgedam","download_url":"https://codeload.github.com/sumeetgedam/Data_Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sumeetgedam%2FData_Analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263674272,"owners_count":23494545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","arima-forecasting","data-visualization","pandas","plotly-express","plotly-graph-objects","prophet-facebook","python","r2-score","random-forest","seaborn","sklearn","timeseries-forecasting","trends","visualization"],"created_at":"2024-09-26T20:22:12.301Z","updated_at":"2026-02-02T22:51:00.293Z","avatar_url":"https://github.com/sumeetgedam.png","language":"Jupyter Notebook","readme":"# Data_Analysis\n\nData Analysis on some famous datasets available online.\n\n![](./Assets/dataset-cover.png)\n\n## Content :clipboard:\n\n1. [Kaggle](#kaggle)\n\n    - [S\u0026P 500 Analysis and Prediction](#sp-500-analysis-and-prediction)\n    - [Market Analysis and EDA](#stock-market-analysis-and-eda)\n    - [Market Analysis Basics](#stock-market-analysis)\n    - [Time Series EDA on World War II](#time-series-eda-on-world-war-ii)\n    - [Time Series Basics](#time-series-basics)\n    - [Spotify Analysis](#spotify-analysis)\n    - [Airbnb Analysis](#airbnb-analysis)\n    - [E Commerce Analysis](#e-commerce-analysis)\n    - [NBA Data Analysis](#nba-data-analysis)\n    - [Premier League Data Analysis](#premier-league-data-analysis)\n    - [Diamond Prices Data Analysis](#diamond-prices-data-anaysis)\n    - [Titanic Survival Data Analysis](#titanic-data-anaysis)   \n\n2. [Read](#reads)\n    \n\n## Kaggle\n\n### S\u0026P 500 Analysis and Prediction\n\n  - Dataset\n    - [S\u0026P 500 stock data](https://www.kaggle.com/datasets/camnugent/sandp500/data?select=getSandP.py)\n\n  - Notebook\n    - [S\u0026P 500 Analysis and Prediction](https://www.kaggle.com/code/sumeetgedam/s-p-500-analysis-and-prediction/)\n  \n  - Implementation Points\n    - Explored dataset to understand data \n    - The analysis was focused on AMAZON stock data\n    - Visualized the variation in Close, High, Low , Open using matplotlib\n    - Forecasting was done using Prophet, Facebook's libirary for time series forecasting\n    - Prophets plot and components showed the upward trend in Yearly as well as Monthly data of AMZN stocks\n    - plotly's graph_objects was used for creating OHLC(Open High Low Close) , CandleStick graph\n    - Analyzed American Airlines Stock for understanding seasonality trends\n    - Plotted monthly forecasted data to see the seasonal trends in each year \n\n### Stock Market Analysis and EDA\n\n  - Dataset\n    - [Stock Market](https://www.kaggle.com/datasets/mahnazarjmand/stock-market)\n  \n  - NoteBook\n    - [Stock Market Analysis and EDA](https://www.kaggle.com/code/sumeetgedam/stock-market-analysis-and-eda)\n\n  - Implementation Points\n    - Explored the Dataset to understand the structure\n    - The NYSE Composite, NYA was the focus of analysis among othe available indices\n    - Cleaned the data by dropping NA values and filled some with ffill method\n    - Made *Date* attribute to datetime in pandas, to get price fluctuations over time\n    - Removed Outliers and visualized Adjcent Close over time(Date)\n    - used matplotlib, plotly to create pie charts and CandleStick on the refined data\n    - Visualized correlation between attributes, 100 days simple moving average, 200 days simple moving average with matplotlib\n\n### Stock Market Analysis\n\n- Dataset\n  - Yahoo Fianace by Python [yfinance](https://github.com/ranaroussi/yfinance)\n\n- NoteBook\n  - [Market Analysis Basics](https://www.kaggle.com/code/sumeetgedam/stock-market-analysis-basics)\n\n- Implementation Points\n  - Downloaded stock market data frmo Yahoo Finance website using yfinance\n  - Explored and visualized time-series data using pandas, matplotlib, seaborn\n  - Measured the correlation between stocks\n  - Measured the risk to invest in them and plotted expected risk vs return\n  - Predicted closing price using LSTM for Nvidia (NVDA)\n\n\n### Time Series EDA on World War II\n\n- Dataset\n  - [Aerial Bombing Operations in World War II](https://www.kaggle.com/datasets/usaf/world-war-ii)\n  - [Weather Conditions in World War Two](https://www.kaggle.com/datasets/smid80/weatherww2)\n\n- Notebook\n  - [Time Series EDA World War II](https://www.kaggle.com/code/sumeetgedam/time-series-eda-world-war-ii)\n\n- Implementation Points\n  - cleaned data to remove uncertainty to ease visualization\n  - used scattergeo to see the Bombing path, weather station location\n  - weather data was not staionary as per first Dickey-Fuller test\n  - used popular methods to get a constant mean\n    - moving average\n    - differencing method\n  - things considered to believe data is now 99% stationary\n    - by looking at plot , mean looks constant\n    - variance also looked constant\n    - test stats from Dickey-Fuller test is smaller that 1% of critical values\n  - Forecasted Time Series \n    - used output from differencing method\n    - used prediction method ARIMA after finding the constant by ACF, PACF\n    - visualized the ARIMA model prediction and got the mean squared error\n\n\n\n### Time Series Basics\n\n- Dataset\n   - [Air Passengers](https://www.kaggle.com/datasets/rakannimer/air-passengers/data)\n   - [Shampoo Sales Dataset](https://www.kaggle.com/datasets/redwankarimsony/shampoo-saled-dataset)\n   - [Time Series Data](https://www.kaggle.com/datasets/saurav9786/time-series-data)\n\n- Notebook\n   - [Time Series Basics](https://www.kaggle.com/code/sumeetgedam/time-series-basics)\n\n- Implementation Points\n    - Explored Univariate and multivariate timeSeries\n    - Visualized datasets to understand the Components of TimeSeries\n      - Trend\n        - Deterministic Trends\n        - Stochastic Trends\n      - Seasonality\n      - Cyclic Patterns\n      - Noise\n    - Models for Decomposition of TimeSeries\n      - Additive Model\n        - Additive Decomposition\n      - Multiplicative Model\n        - Multiplicative Decomposition\n    - Visualized the Seasonality in datasets\n    - TimeSeries Forecasting Techniques\n      - Moving Average\n        - Centred Moving Average\n        - Trailing Moving Average\n    - Handling Missing Values\n    - Forcasting Requirements\n      - Outliers\n      - Resampling\n      - Up-sampling\n      - Down-sampling\n    - Measuring Accuracy\n      - Mean Absolute Error\n      - Mean Absolute Percentage Error\n      - Mean Squared Error\n      - Root Mean Square Error\n    - ETS ( Error , Trend, Seasonality ) Models\n      - SES\n        - Simple smoothing with additive errors\n      - Holt\n        - Holt's linear method with additive errors\n          - Double Exponential Smoothing\n      - Holt-Winters\n        - Holt Winter's linear method with additive errors\n          - Multi-step forecast\n    - Auto Regressive Models\n      - Auto-Correlation function ( ACF ), Partial Auto-Correlation function ( PACF )\n      - Stationarity check using Dickey Fuller Test\n      - ARIMA Model ( AutoRegressive Integrated Moving Average )\n      - Auto ARIMA\n        - using AIC ( Akaike Information Criterion ) and BIC (Bayesian Information Criterion ) for model selection\n\n\n### Spotify Analysis\n\n  - Dataset\n    - [Spotify Datasets](https://www.kaggle.com/datasets/lehaknarnauli/spotify-datasets)\n\n  - Notebook\n    - [Spotify Analysis](https://www.kaggle.com/code/sumeetgedam/spotify-dataanalysis-and-predictions)\n\n  - Implementation Points\n    - The dataset provides audio information for almost 600k Spotify tracks with 20 features\n    - Different Visualization including WordCloud, barplot gave us the most popular artist, number of songs per year, most popular songs, etc\n    - Plotting histogram and boxplot showed the skewness of features in dataset\n    - Added a new feature of song being highly popular if popularity is greater than 50, which was resampled using RandomOverSampler\n    - Built a pipeline for columns\n      - duration : SimpleImputer, FunctionTransformer, RobustScaler\n      - categorical: SimpleImputer, OneHotEncoder\n      - numerical columns : SimpleImputer, RobustScaler\n    - Used this Pipeline on LogisticRegression, RandomForestClassifier, XGBClassifier to visualize the confusion matric for each of them as heatmap\n    - The Importance Feature in each of them were found to be explicit, loudness, explicit column of the dataset. \n\n\n### Airbnb Analysis\n\n  - Dataset\n    - [NYC Airbnb Dataset](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)\n\n  - NoteBook\n    - [Airbnb Data Anaysis and Prediction](https://www.kaggle.com/code/sumeetgedam/airbnb-dataanalysis-and-prediction)\n\n  - Implemention Points\n    - The Dataset describes the listing activity and metrics in NYC for 2019\n    - used folium to visualize geographic location on an interactive map\n    - Filled NaN values using KNNImputer\n    - Analyzed categorical and numerical variable by plot countplots on them\n    - Visualized Distribution by Neighbourhood groups which showed Manhattan being the priciest neighbourhood_group for Entire home / apt\n    - Analysized outliers and replaced them with threshold which was calculated using quantile 1 and quantile 3\n    - Added features and predicted the data using different model \n    - The R2Score, Mean absolute error, Mean squared error, root mean squared error values showed CatBoostRegressor being the best even after Hyperparameter optimization\n    - Visualizing the feature importance showed minimum_nights, annual_incoem, total_cost were the top three drivers for this model output\n\n### E-Commerce Analysis\n\n  - Dataset\n    - [E-Commerce Data](https://www.kaggle.com/datasets/carrie1/ecommerce-data)\n\n  - Notebook\n    - [E-Commerce Data Analysis](https://www.kaggle.com/code/sumeetgedam/ecommerce-data-analysis)\n\n  - Implementation Points\n    - Exploring and Columnwise visualization for each column\n    - Analyzing negative values to understand the dataset and adding features to it\n    - Detecting outliers using scatterplot and quantile\n    - Visualizing UnitPrice, Quantity, Sales ( feature )\n    - Cleaned data for modeling and Bucketized UnitPrice, Quantity, dates\n    - Scaled feature and tested the data on different models:\n      - Linear Regression\n      - DecisionTree Regessor\n      - Random Forest Regression\n    - Calculated Mean Absolute Error , Mean Squared Error and  R2 Score of each of them.\n\n\n### NBA Data Analysis\n  - Dataset\n    - [NBA Players stats(2023 season)](https://www.kaggle.com/datasets/amirhosseinmirzaie/nba-players-stats2023-season)\n  \n  - Notebook\n    - [NBA Data Analysis](https://www.kaggle.com/code/sumeetgedam/nba-data-analysis)\n  \n  - Implementation Points\n    - Eplored the datataset and cleaned it for ease of use.\n    - Used plotly.express and plotly.graph_objects to visualize Players Position with respect to various attributes\n    - Dropped columns based on high correlation to improve performance of analysis\n    - Modeling the data using: \n      - Linear Regression\n      - K-Nearest Neighbors (KNN)\n      - Decision Tree Regressor\n      - RandomForest Regressor\n    - Calculated the r2 score , which showed Linear Regression giving the best value out of all\n    - Visual Comparison of Predicted vs Actual Points\n\n\n###  Premier League Data Analysis\n\n- Dataset\n   - [Premier League Player Statistics](https://www.kaggle.com/datasets/rishikeshkanabar/premier-league-player-statistics-updated-daily/data)\n\n- Notebook\n   - [Premier League Data Analysis](https://www.kaggle.com/code/sumeetgedam/premier-league-player-s-analysis)\n\n- Implementation Points\n    - Explored the dataset and visualized missing values\n    - Used plotly.express to visualize : \n      - Countried most represented\n      - Players appearance\n      - Player's age\n    - Used ***plotly.subplots*** to visualize Players Stats by playing position\n    - Used ***plotly.graph_objects*** to plot graphs for the way goal was made and the mistakes by players\n\n### Diamond Prices Data Anaysis\n\n- Dataset\n   - [Diamonds](https://www.kaggle.com/datasets/shivam2503/diamonds)\n\n- Notebook\n   - [Diamond Prices Data Anaysis](https://www.kaggle.com/code/sumeetgedam/data-analysis-on-daimond-prices)\n\n- Implementation Points\n   - We started with exploring the dataset which gave us an idea about the attributes present in it.\n   - We later made some changes and visualized the following:\n     - Carat\n     - Cut\n     - Color\n     - Clarity\n     - Depth\n     - Dimensions \n     - along with their comparision with Price.\n   - Introduced a new feature Volume to see the relationship between volume of diamond and price.\n   - Divided the dataset into test and train to use them for evaluating with different Algorithms including :\n     - Linear Regression\n     - Lasso Regression\n     - AdaBoost Regression\n     - Ridge Regression\n     - GradientBoosting Regression\n     - RandomForest Regerssion\n     - KNeighours Regression\n   - which gave us the ***r2 values*** ( Co-efficient of determination ) and visualized the same which showed us RandomForest Regressor giving the highest r2 value.\n\n\n### Titanic Data Anaysis\n\n- Dataset\n   - [Titanic](https://www.kaggle.com/competitions/titanic)\n\n- Notebook\n   - [Titanic Data Anaysis](https://www.kaggle.com/code/sumeetgedam/smg-titanic)\n\n- Implementation Points\n   - Reviwed the train and test data provided\n   - Calculated the survival rate of men and women\n   - Predicted suvival rate using ***RandomForestClassifier***\n\n\n# Reads\n\n- [Correlation](https://www.investopedia.com/terms/c/correlation.asp)\n- [Prophet by Facebook for Forecasting](https://facebook.github.io/prophet/docs/quick_start.html)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsumeetgedam%2Fdata_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsumeetgedam%2Fdata_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsumeetgedam%2Fdata_analysis/lists"}