https://github.com/prajjwol09/web-scraping
This project is a Python-based web scraper that extracts data on the largest public companies in the US by revenue from Wikipedia.
https://github.com/prajjwol09/web-scraping
beautifulsoup4 csv dataframe pandas requests webscraping
Last synced: 2 months ago
JSON representation
This project is a Python-based web scraper that extracts data on the largest public companies in the US by revenue from Wikipedia.
- Host: GitHub
- URL: https://github.com/prajjwol09/web-scraping
- Owner: Prajjwol09
- Created: 2024-09-10T08:55:54.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-10T08:57:55.000Z (almost 2 years ago)
- Last Synced: 2025-12-12T21:35:56.192Z (7 months ago)
- Topics: beautifulsoup4, csv, dataframe, pandas, requests, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Largest Public Companies in the US Web Scraping Project
Overview:
This project is a Python-based web scraper that extracts the list of the largest public companies in the United States by revenue from Wikipedia. Using the BeautifulSoup library for parsing and requests for fetching the webpage, it scrapes relevant data, structures it in a DataFrame using pandas, and exports the result to a CSV file for further analysis.
Features:
Scrapes data from a Wikipedia page containing a table of the largest public companies in the US.
Extracts company information such as ranking, name, revenue, and other details from the table.
Stores the scraped data in a pandas DataFrame.
Exports the data to a CSV file.
Technologies Used:
Python: The core programming language used to write the script.
Requests: To fetch the HTML content of the Wikipedia page.
BeautifulSoup: For parsing and navigating the HTML content to extract data.
Pandas: For data manipulation and exporting the scraped data to a CSV file.
Jupyter Notebook (Optional): For testing and experimenting with the code interactively.
Prerequisites:
Ensure you have the following libraries installed:
requests: For making HTTP requests to fetch the webpage.
BeautifulSoup: For parsing the HTML page.
pandas: For data manipulation and CSV export.