An open API service indexing awesome lists of open source software.

https://github.com/prajjwol09/web-scraping

This project is a Python-based web scraper that extracts data on the largest public companies in the US by revenue from Wikipedia.
https://github.com/prajjwol09/web-scraping

beautifulsoup4 csv dataframe pandas requests webscraping

Last synced: 2 months ago
JSON representation

This project is a Python-based web scraper that extracts data on the largest public companies in the US by revenue from Wikipedia.

Awesome Lists containing this project

README

          

Largest Public Companies in the US Web Scraping Project

Overview:

This project is a Python-based web scraper that extracts the list of the largest public companies in the United States by revenue from Wikipedia. Using the BeautifulSoup library for parsing and requests for fetching the webpage, it scrapes relevant data, structures it in a DataFrame using pandas, and exports the result to a CSV file for further analysis.

Features:

Scrapes data from a Wikipedia page containing a table of the largest public companies in the US.

Extracts company information such as ranking, name, revenue, and other details from the table.

Stores the scraped data in a pandas DataFrame.

Exports the data to a CSV file.

Technologies Used:

Python: The core programming language used to write the script.

Requests: To fetch the HTML content of the Wikipedia page.

BeautifulSoup: For parsing and navigating the HTML content to extract data.

Pandas: For data manipulation and exporting the scraped data to a CSV file.

Jupyter Notebook (Optional): For testing and experimenting with the code interactively.

Prerequisites:

Ensure you have the following libraries installed:

requests: For making HTTP requests to fetch the webpage.

BeautifulSoup: For parsing the HTML page.

pandas: For data manipulation and CSV export.