https://github.com/reagentx/scraper-platform

A simple multiprocessed Python scraper platform powered by RegEx and requests.
https://github.com/reagentx/scraper-platform

multithreading regex scraper

Last synced: 12 months ago
JSON representation

A simple multiprocessed Python scraper platform powered by RegEx and requests.

Host: GitHub
URL: https://github.com/reagentx/scraper-platform
Owner: ReagentX
License: gpl-3.0
Created: 2018-05-06T21:21:26.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-09-07T05:26:27.000Z (almost 8 years ago)
Last Synced: 2025-03-26T14:53:04.475Z (over 1 year ago)
Topics: multithreading, regex, scraper
Language: Python
Homepage:
Size: 23.4 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # scraper-platform

A simple multiprocessed Python scraper platform powered by RegEx and `requests`.

## Classes

### HTMLCache

This class contains the methods to construct a cache of HTML information for scraping.

This class is constructed with a `timedelta` for the expiration length of the cache database and a boolean for whether the cache will be used or bypassed.

### Scraper

This class contains the methods to apply a set of `Rules()` to a list of URLs in multiple threads.

This class is constructed with an integer that sets the maximum number of threads.

### Rules

This class contains the code to parse URLs. It is initialized with an `HTMLCache` object. Rules classes must be kept in the `scraper_rules` module.

This class is constructed with an HTMLCache which the rules will use to get the HTML of webpages.

The class methods are rules that are designed to take a single URL and apply some transformation to it, then return a list of data.

## Examples

To install the package, download/clone it, `cd` to the directory, and run `python setup.py develop`.

This allows for fast testing of RegEx rules across a set of URLs. To begin, we need to import a few things:

    from scraper_platform import scraper, cache

    from datetime import timedelta

    from scraper_rules import sample_rules

Here, we import the `scraper` and `cache` modules from the `scraper_platform` module as well as the `sample_rules` module from the `scraper_rules` module. We also import the `timedelta` datatype so we can construct an `HTMLCache` object.

Next, we need to create the objects we imported:

    s = scraper.Scraper(2)

    c = cache.HTMLCache(timedelta(days=2), True)

Here, we create a Scraper object that will use at maximum two threads and a Cache object where the HTML cached will have an expiration of 2 days from access. Changing the boolean argument to false will bypass the cache altogether.

If we print `s` and `c` we should get:

    

    

Next, we can create our sample rules class and define a list of URLs to analyze. For this example I use Google's store page and some regex to capture all of the http/https links:

    r = sample_rules.Rules(c)

    urls = [

    'https://store.google.com/category/phones',

    'https://store.google.com/category/home_entertainment',

    'https://store.google.com/category/laptops_tablets',

    'https://store.google.com/category/virtual_reality'

    ]

To get the data, we then run the scraper by passing the list of URLs and the function we want to apply to them:

    data = s.scrape(urls, r.get_links)

This maps the URLs to the rule method `get_links` and returns a list of the data we asked for in a variable called `data`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/reagentx/scraper-platform

Awesome Lists containing this project

README