Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/t-mohamed-shafeek/web-scraping

Web scraping wikipedia.
https://github.com/t-mohamed-shafeek/web-scraping

beautifulsoup4 extraction loading python requests transformation website wikipedia

Last synced: 25 days ago
JSON representation

Web scraping wikipedia.

Awesome Lists containing this project

README

        

**Overview of Web Scraping Project using Python BeautifulSoup Library** :

**Project Overview:**
The web scraping project utilizes Python's BeautifulSoup library to extract data from the Wikipedia website. Specifically, it scrapes a table containing a list of the largest companies in the USA. The project's primary objective is to retrieve and organize the data for further analysis or display.

**Technologies Used:**
- Python programming language
- BeautifulSoup library for web scraping
- Jupyter Notebook for development and presentation

**Data Source:**
The data is sourced from the Wikipedia page that lists the largest companies in the USA. The specific URL of the Wikipedia page is provided in the Jupyter Notebook.

**Web Scraping Methodology:**
The web scraping methodology involves the following steps:
1. Sending an HTTP request to the Wikipedia page using Python's requests library.
2. Parsing the HTML content of the page using BeautifulSoup to extract relevant data.
3. Locating the table containing the list of largest companies and extracting the necessary information such as company names, rankings, and financial data.

**Code Structure:**
The code is organized into different sections or cells within the Jupyter Notebook. Each cell performs a specific task, such as importing libraries, sending HTTP requests, parsing HTML, and extracting data. The code is properly commented to explain each step's functionality.

**Data Output:**
The extracted data from the table is stored in a structured format, such as a pandas DataFrame, which allows easy manipulation and analysis of the data. The DataFrame might include columns like "Rank", "Company Name", "Industry", "Revenue (USD)", and "Employees".

**User Instructions:**
The Jupyter Notebook provides clear instructions on how to run the web scraping script successfully. This includes installing the required libraries, importing necessary modules, and executing the code cells in the correct order.

**Ethical Considerations:**
The project ensures ethical web scraping practices by respecting the Wikipedia website's terms of service and not overloading their servers with excessive requests. It also adheres to robots.txt rules and avoids scraping sensitive or copyrighted information.

**Sample Output:**
A few rows of the DataFrame displaying the scraped data are included in the Jupyter Notebook to give users an idea of the extracted information.

**Limitations:**
The project may have some limitations, such as occasional changes in the website's structure or format, which could impact the scraping process. Additionally, web scraping may not always be the most efficient method to obtain data, and API access could be a more reliable alternative.

**Future Plans:**
The project's future plans may include adding more functionalities, such as visualizing the data with graphs, comparing different years' data, or updating the scraped information periodically.

**Conclusion:**
Overall, the web scraping project successfully extracts data from the Wikipedia page, presenting the list of largest companies in the USA. It demonstrates effective use of Python's BeautifulSoup library and showcases the potential of web scraping for data collection and analysis tasks.