https://github.com/jatindhiman05/Web-Scraping

python scraping-websites web-scraping web-scraping-python

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/jatindhiman05/Web-Scraping
Owner: Jatindhiman05
Created: 2024-07-05T03:10:12.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-10T16:28:35.000Z (over 1 year ago)
Last Synced: 2025-01-12T19:19:03.379Z (10 months ago)
Topics: python, scraping-websites, web-scraping, web-scraping-python
Language: Jupyter Notebook
Homepage:
Size: 56.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# GitHub Repositories Scraper

This Python script scrapes data from GitHub repositories and saves the information to a CSV file.

## Features

- Retrieves the repository name, number of stars, number of forks, and the repository URL.
- Supports scraping from a single repository URL or a search page with multiple repositories.
- Saves the scraped data to a CSV file named `repository_info.csv`.

## Dependencies

The script requires the following Python libraries:

- `requests`: for making HTTP requests to GitHub.
- `BeautifulSoup`: for parsing the HTML content of the GitHub pages.
- `pandas`: for storing and saving the scraped data in a CSV format.

## Functions

- get_repo_name(doc): Extracts the repository names from the HTML document.
- get_stars(doc): Extracts the number of stars for each repository.
- get_forks(doc): Extracts the number of forks for each repository.
- get_repo_url(doc, base_url): Constructs the full repository URLs based on the base URL.
- scrape_github_id(repo_url): Scrapes data for a single repository URL and returns a pandas DataFrame.
- Mega_scrape(repo_url): Scrapes data for a search page with multiple repositories and saves the results to a CSV file.

## Run Code

``Mega_scrape('https://github.com/search?q=topic%3Apython&type=Repositories')``

This will scrape data for all the repositories displayed on the search page and save the results to a CSV file named repository_info.csv.

``df = scrape_github_id('https://github.com/pandas-dev/pandas')``

This will scrape data for the specified repository and return a pandas DataFrame.

## Notes

- The script assumes that the GitHub page structure remains consistent. If the HTML layout changes, the script may need to be updated accordingly.
- The script currently supports scraping up to 30 repositories per page. If the search results exceed 30 repositories, the script will automatically navigate to the next page and continue scraping.
- The scraped data is saved to a CSV file named repository_info.csv in the same directory as the script.

# GitHub Topics Scraper

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
- Repo Name,Username,Stars,Repo URL

`three.js,mrdoob,69700,https://github.com/mrdoob/three.js`
`libgdx,libgdx,18300,https://github.com/libgdx/libgdx`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jatindhiman05/Web-Scraping

Awesome Lists containing this project

README