An open API service indexing awesome lists of open source software.

https://github.com/narayan954/github-web-scraping


https://github.com/narayan954/github-web-scraping

Last synced: 6 months ago
JSON representation

Awesome Lists containing this project

README

          

# github-web-scraping
## using nodejs

### About
This is a web scraping project which scraps the [github website](https://github.com/topics) to get all issues from top 8 repositories present inside each of the given 3 topics in this page (these 3 topics are randomly listed each time we refresh this page)
The following activities are carried out when we run this project-
1. Separate directories are created for each of these 3 topics.
2. Different pdf files are created, one for each repository inside the topic directory
3. In each of these pdf files, the links for all issues of the particular repository are listed.

### How to run this project
1. Clone this repository in your local environment.
2. Run command `npm install` to install all the required packages.
3. Run command `npm start` to run the project.
4. After running the project, you will see 3 directories created inside the results directory, one for each topic.

### Insights-

1. Cheerio module used here for web scraping.
2. Limitation: cheerio module only parses and extracts initial loaded html. Since all repositories are not loaded at once, we are extracting issues only from top 8 repositories.