https://github.com/michelderu/ml-scrapy-pipeline

This Scrapy Pipeline makes it easy to ingest the scraped data directly into MarkLogic as XML or JSON.
https://github.com/michelderu/ml-scrapy-pipeline

Last synced: 3 months ago
JSON representation

This Scrapy Pipeline makes it easy to ingest the scraped data directly into MarkLogic as XML or JSON.

Host: GitHub
URL: https://github.com/michelderu/ml-scrapy-pipeline
Owner: michelderu
License: gpl-3.0
Created: 2016-03-14T13:41:42.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2019-11-22T10:25:45.000Z (over 5 years ago)
Last Synced: 2025-01-20T08:49:25.714Z (5 months ago)
Language: Python
Size: 22.5 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MarkLogic Pipeline for Scrapy
This Scrapy Pipeline makes it easy to ingest scraped content from websites directly into MarkLogic as XML or JSON. Thereby enabling you to define the content's collections. Also it allows you to define a transform.

## How it works
Uses the MarkLogic REST document endpoint to load scraped items.

## How to configure
Takes configuration from settings.py in the form of:

ITEM_PIPELINES = {
'recall.pipelines.MarkLogicPipeline': 300,
}
MARKLOGIC_DOC_ENDPOINT = 'http://localhost:8000/v1/documents'
MARKLOGIC_USER = 'admin'
MARKLOGIC_PASSWORD = ''
MARKLOGIC_CONTENT_TYPE = 'xml'
MARKLOGIC_COLLECTIONS = ['data', 'data/events']
MARKLOGIC_TRANSFORM = ''

In items.py:
The only the field 'uri' is required as it is used for URI of the document in MarkLogic.

## Further configuration options
If json output is preferred, use MARKLOGIC_CONTENT_TYPE = 'json'.
In case MARKLOGIC_COLLECTIONS = '', the spider_name will be used for the collection.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/michelderu/ml-scrapy-pipeline

Awesome Lists containing this project

README