https://github.com/toyama0919/embulk-input-crawl
https://github.com/toyama0919/embulk-input-crawl
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/toyama0919/embulk-input-crawl
- Owner: toyama0919
- License: mit
- Created: 2016-10-25T12:12:25.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-12-21T11:59:18.000Z (over 8 years ago)
- Last Synced: 2025-01-18T17:47:47.782Z (4 months ago)
- Language: Ruby
- Size: 15.6 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Crawl input plugin for Embulk
Crawl by Anemone input plugin for Embulk.
## Overview
* **Plugin type**: input
* **Resume supported**: yes
* **Cleanup supported**: yes
* **Guess supported**: no## Configuration
- **payloads**: payloads (array, default: `null`)
- **reject_url_regexp**: reject_url_regexp (string, default: `null`)
- **url_key_of_payload**: url_key_of_payload (string, default: `url`)
- **user_agent**: user_agent (string, default: `null`)
- **obey_robots_txt**: don't obey the robots exclusion protocol (bool, default: `false`)
- **skip_query_strings**: skip_query_strings (bool, default: `false`)
- **add_payload_to_record**: add_payload_to_record (bool, default: `false`)
- **accept_cookies**: accept_cookies (bool, default: `false`)
- **delay**: delay (integer, default: `null`)
- **depth_limit**: by default, don't limit the depth of the crawl (integer, default: `null`)
- **read_timeout**: HTTP read timeout in seconds (integer, default: `null`)
- **redirect_limit**: number of times HTTP redirects will be followed (integer, default: `null`)
- **cookies**: Hash of cookie name => value to send with HTTP requests (hash, default: `null`)## Example
```yaml
in:
type: crawl
payloads:
- {"id":3,"url":"http://www-stg1.jp"}
- {"id":4,"url":"http://www-stg2.jp"}
add_payload: true
user_agent: 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'
obey_robots_txt: true
skip_query_strings: false
add_payload_to_record: true
accept_cookies: true
url_key_of_payload: url
depth_limit: 5
read_timeout: 10
delay: 4
redirect_limit: 3
cookies: { UUID: hoge }
crawl_url_regexp: '\/detail\/'
reject_url_regexp: '\.(jpg|JPG|png|PNG|jpeg|JPEG|gif|GIF|jpm|zip|ZIP|doc|docx|xls|xlsx|bmp|BMP|pdf|PDF|exe|EXE|dll|DLL)(\?|$)'```
## Output Schema
```
Column.new(0, "url", :string),
Column.new(1, "title", :string),
Column.new(2, "body", :string),
Column.new(3, "time", :timestamp),
Column.new(4, "code", :long),
Column.new(5, "response_time", :long),
Column.new(6, "payload", :json) if task['add_payload_to_record']```
## Build
```
$ rake
```