https://github.com/jschnurr/scrapyscript

Run a Scrapy spider programmatically from a script or a Celery task - no project required.
https://github.com/jschnurr/scrapyscript

Last synced: 8 months ago
JSON representation

Run a Scrapy spider programmatically from a script or a Celery task - no project required.

Host: GitHub
URL: https://github.com/jschnurr/scrapyscript
Owner: jschnurr
License: mit
Created: 2016-05-29T05:05:48.000Z (over 9 years ago)
Default Branch: main
Last Pushed: 2024-06-04T17:42:34.000Z (over 1 year ago)
Last Synced: 2024-10-14T17:38:02.238Z (about 1 year ago)
Language: Python
Homepage:
Size: 321 KB
Stars: 121
Watchers: 7
Forks: 26
Open Issues: 9
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

awesome-scrapy - scrapyscript - **no project required.** (Apps / Scrapy Service)

README

          


  


  

  




Embed Scrapy jobs directly in your code




  

    

  

  

    

  

  

  

  



### What is Scrapyscript?

Scrapyscript is a Python library you can use to run [Scrapy](https://github.com/scrapy/scrapy) spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a [Celery](https://github.com/celery/celery) job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

- wrap regular Scrapy [Spiders](https://docs.scrapy.org/en/latest/topics/spiders.html) in a `Job`

- load the `Job(s)` in a `Processor`

- call `processor.run()` to execute them

... returning all results when the last job completes.

Let's see an example.

```python

import scrapy

from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):

    name = "myspider"

    def start_requests(self):

        yield scrapy.Request(self.url)

    def parse(self, response):

        data = response.xpath("//title/text()").extract_first()

        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")

results = processor.run(job)

print(results)

```

```json

[{ "title": "Welcome to Python.org" }]

```

See the [examples](examples/) directory for more, including a complete `Celery` example.

### Install

```python

pip install scrapyscript

```

### Requirements

- Linux or MacOS

- Python 3.8+

- Scrapy 2.5+

### API

#### Job (spider, \*args, \*\*kwargs)

A single request to call a spider, optionally passing in \*args or \*\*kwargs, which will be passed through to the spider constructor at runtime.

```python

# url will be available as self.url inside MySpider at runtime

myjob = Job(MySpider, url='http://www.github.com')

```

#### Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a `scrapy.settings.Settings` object to configure the Scrapy runtime.

```python

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})

processor = Processor(settings=settings)

```

#### Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list.

`jobs` can be a single instance of `Job`, or a list.

```python

results = processor.run(myjob)

```

or

```python

results = processor.run([myjob1, myjob2, ...])

```

#### A word about Spider outputs

As per the [scrapy docs](https://doc.scrapy.org/en/latest/topics/spiders.html), a `Spider`

must return an iterable of `Request` and/or `dict` or `Item` objects.

Requests will be consumed by Scrapy inside the `Job`. `dict` or `scrapy.Item` objects will be queued

and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each `dict` or `Item` must be

pickle-able using pickle protocol 0. **It's generally best to output `dict` objects from your Spider.**

### Contributing

Updates, additional features or bug fixes are always welcome.

#### Setup

- Install [Poetry](https://python-poetry.org/docs/#installation)

- `git clone git@github.com:jschnurr/scrapyscript.git`

- `poetry install`

#### Tests

- `make test` or `make tox`

### Version History

See [CHANGELOG.md](CHANGELOG.md)

### License

The MIT License (MIT). See LICENCE file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jschnurr/scrapyscript

Awesome Lists containing this project

README

Embed Scrapy jobs directly in your code