Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chaudharypraveen98/stackoverflowscraper
This project aims to scraps questions depending on the fields, no of pages and question size. It makes a file with required tag. You can edit the filename easily.
https://github.com/chaudharypraveen98/stackoverflowscraper
pandas requests requests-html scraping scripting
Last synced: about 1 month ago
JSON representation
This project aims to scraps questions depending on the fields, no of pages and question size. It makes a file with required tag. You can edit the filename easily.
- Host: GitHub
- URL: https://github.com/chaudharypraveen98/stackoverflowscraper
- Owner: chaudharypraveen98
- Created: 2020-07-27T12:53:39.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-08T16:33:49.000Z (over 3 years ago)
- Last Synced: 2023-03-06T20:48:52.938Z (almost 2 years ago)
- Topics: pandas, requests, requests-html, scraping, scripting
- Language: Python
- Homepage:
- Size: 210 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## **Stack Overflow Question Scraper**
This scrapper scrapes the questions from the stack overflow depending upon the number of votes, newest, active , no of question, no of pages to search and the field in which you want to search the question(topic).##### Level: Beginner
Topics -> requests, requests-html, pandas, scraping, scripting, StackOverflowScraper
Preview Link -> StackOverflowScraper
Source Code Link -> GitHub
What We are going to do?
- First, we made a request to fetch the html page using the requests library
- If the response is OK , then we feed into the HTML parser from requests-HTML
- We will then use the selectors to get the required fields like question title, tag , votes and answered.
## libraries Required : -
- Request-html
- Pandas
- Request library
## PrerequisitesWhat are selectors/locators?
A CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.The choice of locator depends largely on your Application Under Test
Id
An element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM.
Examples:
`
XPath: //div[@id='example']
CSS: #example
`Element Type
The previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or "a" for a link.`
Xpath: //input or
Css: =input
`Direct Child
HTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “>”.
Examples:
`
XPath: //div/a
CSS: div > a
`Child or Sub-Child
Writing nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace.
Examples:
```
XPath: //div//a
CSS: div a
```Class
For classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.”
Examples:
```
XPath: //div[@class='example']
CSS: .example
```## Understanding the code : -
## Requesting the html webpageWe will using the requests library to fetch the html code
```
def extract_from_url(url):
r = requests.get(url)
if r.status_code not in range(200, 299):
print("error")
return "error while finding the data"
```
r.status_code will check the response status code. If it is valid then proceed to other part.
## Parsing the Html code using HTML from requests-HTML
```
html_text = r.text
formatted_html = HTML(html=html_text)
```## Scraping using the parsed HTML code
```
data_summary = formatted_html.find(".question-summary")
data = []
classes_needed = ['.vote-count-post', '.question-hyperlink']
final_data = []
for question in data_summary:
question_votes = question.find('.vote-count-post', first=True).text
question_data = question.find('.question-hyperlink', first=True).text
question_tags = question.find('.tags', first=True).text
data = {}
data["question"] = question_data
data["votes"] = question_votes
data["tags"] = question_tags
final_data.append(data)
return final_data
```
First we find the question container that contains whole information. We had used the class css selector (.question-summary)
Then, we loop through all the question container.We can easily extract other details using the css selector like
- ('.vote-count-post') selector for votes
- ('.question-hyperlink') selector for question link
- ('.tags') selector for getting all the tags for the question
## Starting Scraper and Saving data into CSV format
```
def scrape_stack(tag="python", page=1, pagesize="20", sortby="votes"):
base_url = "https://stackoverflow.com/questions/tagged/"
all_page_data = []
# iterating through each pages
for i in range(1, page + 1):
url = f"{base_url}{tag}?tab={sortby}&page={i}&pagesize={pagesize}"
all_page_data += extract_from_url(url)
df = pd.DataFrame(all_page_data)
df.to_csv(f"{tag}.csv", index=False)
```
To scrap the Stack Overflows Question , We have 4 keyword argument
scrape_stack(tag="python", page=1, pagesize="20", sortby="votes")
where
-
tag : Field you want to search like c, javascript, html etc. -
page : How many pages you want to search. -
pagesize : How much questions or thread each page contains. -
sortby : You can sort the question according to votes,newest,active and unanswered.
if argument are passed then we made the url according to it, otherwise we will use the default arguments.
Once the scraping is done, we load that data into pandas dataframe. Once we are able to make dataframe, then we can easily export the data into .csv file.
## How to setup/run on local machine
- First clone the repo by following command:- `git clone https://github.com/chaudharypraveen98/StackOverflowScraper.git`
- Then you have to install all the required dependencies by following command :- `pip3 install -r requirements.txt`
- Run the file in python interactive mode. Now you are ready to go. To scrap the Stack Overflows Question , type:-
`scrape_stack(tag="python", page=1, pagesize="20", sortby="votes")`
## Deployment
For deployment, We are using the Repl or Heroku to deploy our localhost to web.For More Info
## Web Preview / Output
Web preview on deployment
Placeholder text by Praveen Chaudhary· Images by Binary Beast
_**Note**_: Any changes are most welcomed. By default the file extension is set to csv with the tag you used for scraping