An open API service indexing awesome lists of open source software.

https://github.com/feder-cr/datascienceproject


https://github.com/feder-cr/datascienceproject

Last synced: 6 months ago
JSON representation

Awesome Lists containing this project

README

          

University project of the course "INTRODUZIONE ALLA DATA SCIENCE" of the computer science university of Genoa
# Streaming Platform Data Analysis

## Key Features
### Objectives
- Perform comparative analysis between Netflix and Disney+ platforms
- Apply full data science pipeline from data collection to insights
- Hands-on practice of data science concepts and techniques

### Datasets
- Disney and Netflix title datasets
- Metadata like title, type, date added, genre etc.
- Enrichment data like IMDB, TMDB scores
- 4 datasets integrated provide multifaceted view

### Data Loading and Preparation
- Importing CSV datasets into dataframes
- Combining titles, enrichment and country datasets
- Handling missing values and duplicate rows
- Data transformations for analysis suitability
- Extracting year/month from dates
- Determining number of genres

### Exploratory Data Analysis
- Statistical summaries of key variables
- Analysis across dimensions like certification, type etc.
- Comparative analysis across the platforms
- Hypothesis testing for distribution differences

### Multidimensional Analysis
- Constructed OLAP cube with dimensions:
- Month
- Content type
- Production country
- Slicing, filtering and aggregation capabilities

### Predictive Modeling
- Developed classification model using Logistic Regression
- Predict content type - movie or TV show
- Features: popularity, ratings, metadata
- 85% accuracy on test data

### Interactive Visualizations
- Graphics for distributions, comparisons and trends
- Dashboards for slicing OLAP cube on multiple axes

## Key Technologies
- Python
- Jupyter Notebook
- Pandas
- Numpy
- Scikit-Learn
- Matplotlib
- Seaborn

## What I Learned
Importing, cleaning and transforming medium-size datasets
Identifying optimal data formats and structures
Applying multivariate analysis techniques
Training and evaluating classification models
Using visualizations to extract insights