https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
- Owner: scrapy-plugins
- Created: 2021-11-26T15:29:35.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-05-18T16:36:27.000Z (about 2 years ago)
- Last Synced: 2025-04-07T08:03:13.880Z (about 1 year ago)
- Language: Python
- Size: 26.4 KB
- Stars: 3
- Watchers: 6
- Forks: 6
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Azure Exporter for Scrapy
[Scrapy feed export storage backend](https://doc.scrapy.org/en/latest/topics/feed-exports.html#storage-backends) for [Azure Storage](https://docs.microsoft.com/en-us/azure/storage/).
## Requirements
- Python 3.8+
## Installation
```bash
pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
```
## Usage
* Add this storage backend to the [FEED_STORAGES](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_STORAGES) Scrapy setting. For example:
```python
# settings.py
FEED_STORAGES = {'azure': 'scrapy_azure_exporter.AzureFeedStorage'}
```
* Configure [authentication](https://docs.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python) via any of the following settings:
- `AZURE_CONNECTION_STRING`
- `AZURE_ACCOUNT_URL_WITH_SAS_TOKEN`
- `AZURE_ACCOUNT_URL` & `AZURE_ACCOUNT_KEY` - If using this method, specify both of them.
For example,
```python
AZURE_ACCOUNT_URL = "https://.blob.core.windows.net/"
AZURE_ACCOUNT_KEY = "Account key for the Azure account"
```
* Configure in the [FEEDS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds) Scrapy setting the Azure URI where the feed needs to be exported.
```python
FEEDS = {
"azure://.blob.core.windows.net//": {
"format": "json"
}
}
```
## Write mode and blob type
The `overwrite`
[feed option](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feed-options)
is `False` by default when using this feed export storage backend.
An extra feed option is also provided, `blob_type`, which can be `"BlockBlob"`
(default) or `"AppendBlob"`. See
[Understanding blob types](https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs).
The feed options `overwrite` and `blob_type` can be combined to set the write
mode of the feed export:
- `overwrite=False` and `blob_type="BlockBlob"` create the blob if it does not
exist, and fail if it exists.
- `overwrite=False` and `blob_type="AppendBlob"` append to the blob if it
exists and it is an `AppendBlob`, and create it otherwise.
- `overwrite=True` overwrites the blob, even if it exists. The `blob_type` must
match that of the target blob.
## Media pipeline usage
Use the Azure pipeline for [Scrapy media pipelines](https://docs.scrapy.org/en/latest/topics/media-pipeline.html) and be able to use Azure Blob Storage.
Just add the pipeline to Scrapy:
```python
ITEM_PIPELINES = {
"scrapy_azure_exporter.AzureFilesPipeline": 1,
}
```
## Azurite usage
You can use [Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio) as a storage emulator for Azure Blob Storage
and test your application locally. Just append or set the feed storage to `azurite`.
```python
# settings.py
FEED_STORAGES = {'azurite': 'scrapy_azure_exporter.AzureFeedStorage'}
```
And add the Azurite URI to the `FEEDS` setting:
```python
FEEDS = {
"azurite://:///[]": {
// ...
}
}
```
And finally run your Scrapy project as it is usually done for FilesPipeline or ImagesPipeline.