Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chaudharypraveen98/stackoverflowscraper

This project aims to scraps questions depending on the fields, no of pages and question size. It makes a file with required tag. You can edit the filename easily.
https://github.com/chaudharypraveen98/stackoverflowscraper

pandas requests requests-html scraping scripting

Last synced: about 1 month ago
JSON representation

This project aims to scraps questions depending on the fields, no of pages and question size. It makes a file with required tag. You can edit the filename easily.

Host: GitHub
URL: https://github.com/chaudharypraveen98/stackoverflowscraper
Owner: chaudharypraveen98
Created: 2020-07-27T12:53:39.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-07-08T16:33:49.000Z (over 3 years ago)
Last Synced: 2023-03-06T20:48:52.938Z (almost 2 years ago)
Topics: pandas, requests, requests-html, scraping, scripting
Language: Python
Homepage:
Size: 210 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

## **Stack Overflow Question Scraper**
This scrapper scrapes the questions from the stack overflow depending upon the number of votes, newest, active , no of question, no of pages to search and the field in which you want to search the question(topic).

##### Level: Beginner

Topics -> requests, requests-html, pandas, scraping, scripting, StackOverflowScraper

Preview Link -> StackOverflowScraper

Source Code Link -> GitHub

What We are going to do?

First, we made a request to fetch the html page using the requests library

If the response is OK , then we feed into the HTML parser from requests-HTML

We will then use the selectors to get the required fields like question title, tag , votes and answered.

## libraries Required : -

Request-html

Pandas

Request library

## Prerequisites

What are selectors/locators?
A CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.

The choice of locator depends largely on your Application Under Test

Id
An element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM.
Examples:
`
XPath: //div[@id='example']
CSS: #example
`

Element Type
The previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or "a" for a link.

`
Xpath: //input or
Css: =input
`

Direct Child
HTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “>”.
Examples:
`
XPath: //div/a
CSS: div > a
`

Child or Sub-Child
Writing nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace.
Examples:
```
XPath: //div//a
CSS: div a
```

Class

For classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.”
Examples:
```
XPath: //div[@class='example']
CSS: .example
```

## Understanding the code : -
## Requesting the html webpage

We will using the requests library to fetch the html code
```
def extract_from_url(url):
r = requests.get(url)
if r.status_code not in range(200, 299):
print("error")
return "error while finding the data"
```
r.status_code will check the response status code. If it is valid then proceed to other part.
## Parsing the Html code using HTML from requests-HTML
```
html_text = r.text
formatted_html = HTML(html=html_text)
```

## Scraping using the parsed HTML code
```
data_summary = formatted_html.find(".question-summary")
data = []
classes_needed = ['.vote-count-post', '.question-hyperlink']
final_data = []
for question in data_summary:
question_votes = question.find('.vote-count-post', first=True).text
question_data = question.find('.question-hyperlink', first=True).text
question_tags = question.find('.tags', first=True).text
data = {}
data["question"] = question_data
data["votes"] = question_votes
data["tags"] = question_tags
final_data.append(data)
return final_data
```
First we find the question container that contains whole information. We had used the class css selector (.question-summary)
Then, we loop through all the question container.We can easily extract other details using the css selector like

('.vote-count-post') selector for votes

('.question-hyperlink') selector for question link

('.tags') selector for getting all the tags for the question

## Starting Scraper and Saving data into CSV format

```
def scrape_stack(tag="python", page=1, pagesize="20", sortby="votes"):
base_url = "https://stackoverflow.com/questions/tagged/"
all_page_data = []
# iterating through each pages
for i in range(1, page + 1):
url = f"{base_url}{tag}?tab={sortby}&page={i}&pagesize={pagesize}"
all_page_data += extract_from_url(url)
df = pd.DataFrame(all_page_data)
df.to_csv(f"{tag}.csv", index=False)
```

To scrap the Stack Overflows Question , We have 4 keyword argument
scrape_stack(tag="python", page=1, pagesize="20", sortby="votes")
where

tag : Field you want to search like c, javascript, html etc.

page : How many pages you want to search.

pagesize : How much questions or thread each page contains.

sortby : You can sort the question according to votes,newest,active and unanswered.

if argument are passed then we made the url according to it, otherwise we will use the default arguments.
Once the scraping is done, we load that data into pandas dataframe. Once we are able to make dataframe, then we can easily export the data into .csv file.

## How to setup/run on local machine

First clone the repo by following command:- `git clone https://github.com/chaudharypraveen98/StackOverflowScraper.git`

Then you have to install all the required dependencies by following command :- `pip3 install -r requirements.txt`

Run the file in python interactive mode. Now you are ready to go. To scrap the Stack Overflows Question , type:-
`scrape_stack(tag="python", page=1, pagesize="20", sortby="votes")`

## Deployment

For deployment, We are using the Repl or Heroku to deploy our localhost to web.For More Info
## Web Preview / Output

Web preview on deployment

Placeholder text by Praveen Chaudhary· Images by Binary Beast

_**Note**_: Any changes are most welcomed. By default the file extension is set to csv with the tag you used for scraping