https://github.com/kadnan/scrapegen

A simple python tool that generates a requests/bs4 based web scraper
https://github.com/kadnan/scrapegen

beautiful bs4 python requests scraper

Last synced: 5 months ago
JSON representation

A simple python tool that generates a requests/bs4 based web scraper

Host: GitHub
URL: https://github.com/kadnan/scrapegen
Owner: kadnan
License: mit
Created: 2019-11-08T19:52:26.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-06-08T07:28:14.000Z (over 3 years ago)
Last Synced: 2025-04-03T10:51:20.414Z (6 months ago)
Topics: beautiful, bs4, python, requests, scraper
Language: Python
Homepage:
Size: 5.86 KB
Stars: 26
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # ScrapeGen

_ScrapeGen_ is a simple tool written in python that generates the code of a web scraper based on rules given in a file.

## Why was it created?

 

No particular reasons other than I was bored and I had come across Simone Giertz's TED talk [Why you should make useless things](https://www.ted.com/talks/simone_giertz_why_you_should_make_useless_things?language=en), hence thought to create something useless.

## How it works?

This tool generate a parser that basically rely on Python `requests` and `Beautifulsoup`. If I got bored again then I might add other libraries too, who knows?

Anyways, you will create a YAML file first that will contain all info about the parser. A typical YAML file that generates a parser will look like below:

```

script_name: olx_indi_test.py # Name of the scraper file

main: # Code under _main_ function

  entry_url: https://www.olx.com.pk/item/1-kanal-brand-bew-banglow-available-for-sale-in-wapda-town-iid-1009971253 # URL to be parsed

  entry_function: parse # The function that uses requests library to fetch the data and calling Bs4

rules: # Each Selector will be a separate rule that itself will be a separate method

    - name: price

      type: single #Valid types: array,single

      selector: '#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W._2r-Wm > div > section > span._2xKfz'

      extract: #Either an attribute value or just text

        what: text

    - name: seller

      type: single #Valid types: array,single

      selector: '#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W.YpyR- > div > div > div._1oSdP > div > a > div'

      extract:

        what: text

```

Assuming you installed all required libs mentioned in `requirements.txt`, all you have to do is to run the command:

 

 `python parse_gen.py indi.yaml` 

 

 Where `indi.yaml` is the file that contains the content given above. If it runs successfully, it generates a file with name `olx_indi_test.py` which looks like below:

 

 ```

 import requests

from bs4 import BeautifulSoup

def get_price(soup_object):

    _price = None

    price_section = soup_object.select(

        "#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W._2r-Wm > div > section > span._2xKfz")

    if len(price_section) > 0:

        _price = price_section[0].text.strip()

    return _price

def get_seller(soup_object):

    _seller = None

    seller_section = soup_object.select(

        "#container > main > div > div > div.rui-2SwH7.rui-m4D6f.rui-1nZcN.rui-3CPXI.rui-3E1c2.rui-1JF_2 > div.rui-2ns2W.YpyR- > div > div > div._1oSdP > div > a > div")

    if len(seller_section) > 0:

        _seller = seller_section[0].text.strip()

    return _seller

def parse(_url):

    r = requests.get(_url)

    if r.status_code == 200:

        html = r.text.strip()

        soup = BeautifulSoup(html, 'lxml')

        price = get_price(soup)

        seller = get_seller(soup)

if __name__ == '__main__':

    main_url = "https://www.olx.com.pk/item/1-kanal-brand-bew-banglow-available-for-sale-in-wapda-town-iid-1009971253"

    parse(main_url)

 ```

 

 The generated can easily be modified based on your needs.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kadnan/scrapegen

Awesome Lists containing this project

README