https://github.com/ishansarvaiya/hotstarscraper
This project is a C# application designed to scrape movie and show data from Hotstar and store it in a SQL Server database.
https://github.com/ishansarvaiya/hotstarscraper
csharp dotnet git hotstar scraper selenium sql ssms
Last synced: about 2 months ago
JSON representation
This project is a C# application designed to scrape movie and show data from Hotstar and store it in a SQL Server database.
- Host: GitHub
- URL: https://github.com/ishansarvaiya/hotstarscraper
- Owner: ishansarvaiya
- Created: 2025-07-10T08:36:26.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-16T08:58:28.000Z (11 months ago)
- Last Synced: 2025-07-23T07:10:30.449Z (11 months ago)
- Topics: csharp, dotnet, git, hotstar, scraper, selenium, sql, ssms
- Language: C#
- Homepage:
- Size: 31.3 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# HotstarScraper
This project is a C# application designed to scrape movie and show data from Hotstar and store it in a SQL Server database.
## Demo
https://github.com/user-attachments/assets/b2e8e62e-7249-44c9-9bc3-f5e4bd86f616
## Run Locally
Clone the project
```bash
git clone https://github.com/ishansarvaiya/HotstarScraper.git
```
Go to the project directory
```bash
cd HotstarScraper/HotstarScraper
```
Install dependencies
```bash
dotnet restore
```
Start the server
```bash
dotnet run
```
## Features
- Web Scraping: Utilizes Selenium with ChromeDriver to navigate Hotstar, scroll through content, click on items, and extract details like titles, descriptions, release years, ratings, durations/seasons, languages, genres, and image URLs for both movies and shows.
- Data Persistence: Employs Entity Framework Core to interact with a SQL Server database.
- Database Schema: Defines models for Movie, Show, Genre, Language, and their many-to-many relationships (MovieGenre, ShowGenre, MovieLanguage).
- Data Handling: Includes a DataService responsible for saving scraped data, checking for existing entries, and managing relationships with genres and languages, adding new ones if they don't exist.
- Logging: Integrates Serilog for comprehensive logging of application activities, including information messages, warnings, and errors, with output directed to both console and a daily rolling file.
- Configuration: Reads database connection strings and Serilog settings from an appsettings.json file.
- Headless Browse: Configures ChromeDriver to run in headless mode for efficient scraping without a visible browser UI.
## Structure
- Program.cs: The main entry point of the application, handling setup, database migrations, and orchestrating the scraping and data saving processes.
- HotstarDbContext.cs: The Entity Framework Core DbContext for interacting with the database, defining DbSet properties for all models.
- Data/Configurations: Contains EF Core fluent API configurations for each model, defining primary keys, required properties, maximum lengths, and unique indexes.
- Interfaces: Defines IDataService and IScraperService interfaces for abstraction.
- Models: Contains the POCO classes representing the data entities (Movie, Show, Genre, Language) and data transfer objects for scraped data (MovieScrapeData, ShowScrapeData).
- Services/DataService.cs: Implements IDataService, responsible for saving scraped movie and show data to the database, including handling associated genres and languages.
- Services/ScraperService.cs: Implements IScraperService, containing the core logic for configuring the web driver, navigating Hotstar, and extracting data for movies and shows.
- appsettings.json: Configuration file for database connection strings and Serilog settings.
## Database Schema
- Movies (Id: INT (Primary Key), Title: NVARCHAR(255) (Required), Description: NVARCHAR(4000), ReleaseYear: NVARCHAR(10), Rating: NVARCHAR(10), Duration: NVARCHAR(50), ImageUrl: NVARCHAR(MAX))
- Shows (Id: INT (Primary Key), Title: NVARCHAR(255) (Required), Description: NVARCHAR(4000), ReleaseYear: NVARCHAR(10), Rating: NVARCHAR(10), Season: NVARCHAR(50), ImageUrl: NVARCHAR(MAX))
- Genres (Id: INT (Primary Key), Name: NVARCHAR(255) (Required))
- Languages (Id: INT (Primary Key), Name: NVARCHAR(255) (Required))
- MovieGenres (MovieId: INT (Primary Key, Foreign Key to Movies), GenreId (Primary Key, Foreign Key to Genres))
- MovieLanguages (MovieId: INT (Primary Key, Foreign Key to Movies), LanguageId (Primary Key, Foreign Key to Languages))
- ShowGenres (ShowId: INT (Primary Key, Foreign Key to Shows), GenreId (Primary Key, Foreign Key to Genres))
## NuGet Packages
- Microsoft.EntityFrameworkCore.SqlServer
- Microsoft.EntityFrameworkCore.Tools
- Microsoft.Extensions.Configuration.Json
- Selenium.WebDriver
- Selenium.WebDriver.ChromeDriver
- Serilog.Enrichers.Environment
- Serilog.Enrichers.Process
- Serilog.Enrichers.Thread
- Serilog.Settings.Configuration
- Serilog.Sinks.Console
- Serilog.Sinks.File