{"id":18603965,"url":"https://github.com/farzeennimran/web-scraping-python","last_synced_at":"2026-04-17T05:03:36.775Z","repository":{"id":215823230,"uuid":"739862401","full_name":"farzeennimran/Web-Scraping-python","owner":"farzeennimran","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-23T06:17:47.000Z","size":130,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-16T18:43:43.581Z","etag":null,"topics":["beautifulsoup","beautifulsoup4","data","data-science","datascraping","python","selenium","selenium-webdriver","webscraping","website"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/farzeennimran.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-06T19:14:00.000Z","updated_at":"2024-06-23T06:17:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"759ab8fa-d931-433f-bbdb-c33766fce3e3","html_url":"https://github.com/farzeennimran/Web-Scraping-python","commit_stats":null,"previous_names":["farzeennimran/web-scraping-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/farzeennimran/Web-Scraping-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farzeennimran%2FWeb-Scraping-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farzeennimran%2FWeb-Scraping-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farzeennimran%2FWeb-Scraping-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farzeennimran%2FWeb-Scraping-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/farzeennimran","download_url":"https://codeload.github.com/farzeennimran/Web-Scraping-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farzeennimran%2FWeb-Scraping-python/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804839,"owners_count":22913901,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","beautifulsoup4","data","data-science","datascraping","python","selenium","selenium-webdriver","webscraping","website"],"created_at":"2024-11-07T02:16:06.196Z","updated_at":"2026-04-17T05:03:36.728Z","avatar_url":"https://github.com/farzeennimran.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping with Python\n\n## Introduction\n\nThis project demonstrates web scraping using Python, focusing on two distinct sources: YouTube and IMDb. Using Selenium and BeautifulSoup, data is extracted from these websites and saved into separate CSV files for further analysis.\n\n## Libraries Used\n\n### Selenium\n\nSelenium is a powerful tool for controlling web browsers through programs and performing browser automation. It is widely used for testing web applications but also serves as an excellent tool for web scraping, especially when dealing with dynamic content.\n\n### BeautifulSoup\n\nBeautifulSoup is a Python library for parsing HTML and XML documents. It creates parse trees from page source code that can be used to extract data easily. It is particularly useful for web scraping static content.\n\n## Installation\n\nTo get started, install the necessary libraries:\n\n```bash\n!pip install bs4 selenium requests pandas\n!apt install chromium-chromedriver\n```\n\n## YouTube Scraping with Selenium\n\nThe YouTube scraping script extracts video details from the [Unfold Data Science YouTube channel](https://www.youtube.com/@UnfoldDataScience/videos).\n\n### Code Explanation\n\n1. **Setup Selenium and Chrome WebDriver**:\n   ```python\n   from selenium import webdriver\n   from selenium.webdriver.common.by import By\n   from selenium.webdriver.support.ui import WebDriverWait\n   from selenium.webdriver.support import expected_conditions as EC\n   import pandas as pd\n\n   options = webdriver.ChromeOptions()\n   options.add_argument('--ignore-certificate-errors')\n   options.add_argument('--incognito')\n   options.add_argument('--headless')\n   options.add_argument('--no-sandbox')\n   options.add_argument('--disable-dev-shm-usage')\n\n   driver = webdriver.Chrome(options=options)\n   ```\n\n2. **Navigate to the YouTube Channel**:\n   ```python\n   urlPath = 'https://www.youtube.com/@UnfoldDataScience/videos'\n   driver.get(urlPath)\n   ```\n\n3. **Extract Video Information**:\n   ```python\n   videos = driver.find_elements(By.CLASS_NAME, \"style-scope ytd-rich-item-renderer\")\n\n   titles, views, dates, likesData, commentsData = [], [], [], [], []\n   wait = WebDriverWait(driver, 15)\n\n   for video in videos:\n       titles.append(video.find_element(By.XPATH, './/*[@id=\"video-title\"]').text)\n       views.append(video.find_element(By.XPATH, './/*[@id=\"metadata-line\"]/span[1]').text)\n       dates.append(video.find_element(By.XPATH, './/*[@id=\"metadata-line\"]/span[2]').text)\n       \n       # Navigate to video page to get likes and comments\n       video.find_element(By.XPATH, '//*[@id=\"video-title-link\"]').click()\n       try:\n           likes = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id=\"segmented-like-button\"]/ytd-toggle-button-renderer/yt-button-shape/button'))).text\n           likesData.append(likes)\n           comments = wait.until(EC.presence_of_element_located((By.XPATH, '//div[@id=\"title\"]//*[@id=\"count\"]//span[1]'))).text\n           commentsData.append(comments)\n       except:\n           likesData.append('')\n           commentsData.append('')\n       driver.back()\n\n   driver.quit()\n   ```\n\n4. **Save Data to CSV**:\n   ```python\n   df = pd.DataFrame({\n       'title': titles,\n       'views': views,\n       'dates': dates,\n       'likes': likesData,\n       'comments': commentsData\n   })\n   df.to_csv('Youtube.csv', index=False)\n   ```\n\n### Analysis\n\nLoad the CSV file and perform various analyses, such as calculating average views, finding the highest likes-to-views ratio, and plotting the correlation between likes and comments.\n\n```python\ndf = pd.read_csv('Youtube.csv')\n# Perform analyses here...\n```\n\n## IMDb Scraping with BeautifulSoup\n\nThe IMDb scraping script extracts details about the top-rated movies from the [IMDb Top 1000](https://www.imdb.com/search/title/?groups=top_1000\u0026sort=user_rating,desc\u0026count=100\u0026start=1\u0026ref_=adv_nxt).\n\n### Code Explanation\n\n1. **Setup and Send Request**:\n   ```python\n   import requests\n   from bs4 import BeautifulSoup\n   import pandas as pd\n\n   URL = 'https://www.imdb.com/search/title/?groups=top_1000\u0026sort=user_rating,desc\u0026count=100\u0026start=1\u0026ref_=adv_nxt'\n   headers = {\"Accept-Language\": \"en-US,en;q=0.8\"}\n   response = requests.get(URL, headers=headers)\n   soup = BeautifulSoup(response.content, 'html.parser')\n   ```\n\n2. **Extract Movie Information**:\n   ```python\n   MovieTitle, ReleaseYear, IMDbRating, Genre, Director = [], [], [], [], []\n\n   movies = soup.find_all('div', class_='lister-item mode-advanced')\n\n   for movie in movies:\n       title = movie.find('h3', class_='lister-item-header').find('a').text.strip()\n       year = movie.find('span', class_='lister-item-year').text.strip('()')\n       genre = movie.find('span', class_='genre').text.strip()\n       rating = movie.find('div', class_='inline-block ratings-imdb-rating').text.strip()\n       director = movie.find('p', class_='').find('a').text\n\n       MovieTitle.append(title)\n       ReleaseYear.append(year)\n       IMDbRating.append(rating)\n       Genre.append(genre)\n       Director.append(director)\n   ```\n\n3. **Save Data to CSV**:\n   ```python\n   df = pd.DataFrame({\n       'Movie Titles': MovieTitle,\n       'Release Year': ReleaseYear,\n       'IMDb Rating': IMDbRating,\n       'Directors': Director,\n       'Genre': Genre\n   })\n   df.to_csv('IMDB.csv', index=False)\n   ```\n\n### Analysis\n\nLoad the CSV file and perform various analyses, such as calculating average IMDb rating, finding the most common genre, and identifying the director with the highest average rating.\n\n```python\ndf = pd.read_csv('IMDB.csv')\n# Perform analyses here...\n```\n\n## Output Files\n\n1. **YouTube Data**: [Youtube.csv](Youtube.csv)\n2. **IMDb Data**: [IMDB.csv](IMDB.csv)\n\n## Conclusion\n\nThis project showcases the use of Selenium for dynamic content scraping and BeautifulSoup for static content scraping, providing a comprehensive guide to web scraping in Python. The collected data is stored in CSV files and can be further analyzed to extract meaningful insights.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarzeennimran%2Fweb-scraping-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffarzeennimran%2Fweb-scraping-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarzeennimran%2Fweb-scraping-python/lists"}