https://github.com/michelderu/ml-scrapy-pipeline
This Scrapy Pipeline makes it easy to ingest the scraped data directly into MarkLogic as XML or JSON.
https://github.com/michelderu/ml-scrapy-pipeline
Last synced: 3 months ago
JSON representation
This Scrapy Pipeline makes it easy to ingest the scraped data directly into MarkLogic as XML or JSON.
- Host: GitHub
- URL: https://github.com/michelderu/ml-scrapy-pipeline
- Owner: michelderu
- License: gpl-3.0
- Created: 2016-03-14T13:41:42.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2019-11-22T10:25:45.000Z (over 5 years ago)
- Last Synced: 2025-01-20T08:49:25.714Z (5 months ago)
- Language: Python
- Size: 22.5 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MarkLogic Pipeline for Scrapy
This Scrapy Pipeline makes it easy to ingest scraped content from websites directly into MarkLogic as XML or JSON. Thereby enabling you to define the content's collections. Also it allows you to define a transform.## How it works
Uses the MarkLogic REST document endpoint to load scraped items.## How to configure
Takes configuration from settings.py in the form of:ITEM_PIPELINES = {
'recall.pipelines.MarkLogicPipeline': 300,
}
MARKLOGIC_DOC_ENDPOINT = 'http://localhost:8000/v1/documents'
MARKLOGIC_USER = 'admin'
MARKLOGIC_PASSWORD = ''
MARKLOGIC_CONTENT_TYPE = 'xml'
MARKLOGIC_COLLECTIONS = ['data', 'data/events']
MARKLOGIC_TRANSFORM = ''In items.py:
The only the field 'uri' is required as it is used for URI of the document in MarkLogic.## Further configuration options
If json output is preferred, use MARKLOGIC_CONTENT_TYPE = 'json'.
In case MARKLOGIC_COLLECTIONS = '', the spider_name will be used for the collection.