https://github.com/kadnan/scrapegen
A simple python tool that generates a requests/bs4 based web scraper
https://github.com/kadnan/scrapegen
beautiful bs4 python requests scraper
Last synced: 5 months ago
JSON representation
A simple python tool that generates a requests/bs4 based web scraper
- Host: GitHub
- URL: https://github.com/kadnan/scrapegen
- Owner: kadnan
- License: mit
- Created: 2019-11-08T19:52:26.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-06-08T07:28:14.000Z (over 3 years ago)
- Last Synced: 2025-04-03T10:51:20.414Z (6 months ago)
- Topics: beautiful, bs4, python, requests, scraper
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 26
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ScrapeGen
_ScrapeGen_ is a simple tool written in python that generates the code of a web scraper based on rules given in a file.
## Why was it created?
No particular reasons other than I was bored and I had come across Simone Giertz's TED talk [Why you should make useless things](https://www.ted.com/talks/simone_giertz_why_you_should_make_useless_things?language=en), hence thought to create something useless.## How it works?
This tool generate a parser that basically rely on Python `requests` and `Beautifulsoup`. If I got bored again then I might add other libraries too, who knows?Anyways, you will create a YAML file first that will contain all info about the parser. A typical YAML file that generates a parser will look like below:
```
script_name: olx_indi_test.py # Name of the scraper file
main: # Code under _main_ function
entry_url: https://www.olx.com.pk/item/1-kanal-brand-bew-banglow-available-for-sale-in-wapda-town-iid-1009971253 # URL to be parsed
entry_function: parse # The function that uses requests library to fetch the data and calling Bs4rules: # Each Selector will be a separate rule that itself will be a separate method
- name: price
type: single #Valid types: array,single
selector: '#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W._2r-Wm > div > section > span._2xKfz'
extract: #Either an attribute value or just text
what: text- name: seller
type: single #Valid types: array,single
selector: '#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W.YpyR- > div > div > div._1oSdP > div > a > div'
extract:
what: text```
Assuming you installed all required libs mentioned in `requirements.txt`, all you have to do is to run the command:
`python parse_gen.py indi.yaml`
Where `indi.yaml` is the file that contains the content given above. If it runs successfully, it generates a file with name `olx_indi_test.py` which looks like below:
```
import requests
from bs4 import BeautifulSoupdef get_price(soup_object):
_price = None
price_section = soup_object.select(
"#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W._2r-Wm > div > section > span._2xKfz")if len(price_section) > 0:
_price = price_section[0].text.strip()
return _pricedef get_seller(soup_object):
_seller = None
seller_section = soup_object.select(
"#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W.YpyR- > div > div > div._1oSdP > div > a > div")if len(seller_section) > 0:
_seller = seller_section[0].text.strip()
return _sellerdef parse(_url):
r = requests.get(_url)
if r.status_code == 200:
html = r.text.strip()
soup = BeautifulSoup(html, 'lxml')
price = get_price(soup)
seller = get_seller(soup)if __name__ == '__main__':
main_url = "https://www.olx.com.pk/item/1-kanal-brand-bew-banglow-available-for-sale-in-wapda-town-iid-1009971253"
parse(main_url)
```
The generated can easily be modified based on your needs.