https://github.com/dakrone/itsy
A threaded web-spider written in Clojure
https://github.com/dakrone/itsy
Last synced: 9 months ago
JSON representation
A threaded web-spider written in Clojure
- Host: GitHub
- URL: https://github.com/dakrone/itsy
- Owner: dakrone
- Created: 2012-05-18T22:10:22.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2015-06-10T16:50:34.000Z (over 10 years ago)
- Last Synced: 2025-03-31T13:16:43.746Z (10 months ago)
- Language: Clojure
- Homepage:
- Size: 171 KB
- Stars: 182
- Watchers: 11
- Forks: 29
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-list - itsy - spider written in Clojure | dakrone | 176 | (Clojure)
README
# Itsy
A threaded web spider, written in Clojure.
## Usage
In your project.clj:
```clojure
[itsy "0.1.1"]
```
In your project:
```clojure
(ns myns.foo
(:require [itsy.core :refer :all]))
(defn my-handler [{:keys [url body]}]
(println url "has a count of" (count body)))
(def c (crawl {;; initial URL to start crawling at (required)
:url "http://aoeu.com"
;; handler to use for each page crawled (required)
:handler my-handler
;; number of threads to use for crawling, (optional,
;; defaults to 5)
:workers 10
;; number of urls to spider before crawling stops, note
;; that workers must still be stopped after crawling
;; stops. May be set to -1 to specify no limit.
;; (optional, defaults to 100)
:url-limit 100
;; function to use to extract urls from a page, a
;; function that takes one argument, the body of a page.
;; (optional, defaults to itsy's extract-all)
:url-extractor extract-all
;; http options for clj-http, (optional, defaults to
;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})
:http-opts {}
;; specifies whether to limit crawling to a single
;; domain. If false, does not limit domain, if true,
;; limits to the same domain as the original :url, if set
;; to a string, limits crawling to the hostname of the
;; given url
:host-limit false
;; polite crawlers obey robots.txt directives
;; by default this crawler is polite
:polite? true}))
;; ... crawling ensues ...
(thread-status c)
;; returns a map of thread-id to Thread.State:
{33 #, 34 #, 35 #,
36 #, 37 #, 38 #,
39 #, 40 #, 41 #,
42 #}
(add-worker c)
;; adds an additional thread worker to the pool
(remove-worker c)
;; removes a worker from the pool
(stop-workers c)
;; stop-workers will return a collection of all threads it failed to
;; stop (it should be able to stop all threads unless something goes
;; very wrong)
```
Upon completion, `c` will contain state that allows you to see what
happened:
```clojure
(clojure.pprint/pprint (:state c))
;; URLs still in the queue
{:url-queue #,
;; URLs that were seen/queued
:url-count #,
;; running worker threads (will contain thread objects while crawling)
:running-workers #,
;; canaries for running worker threads
:worker-canaries #,
;; a map of URL to times seen/extracted from the body of a page
:seen-urls
#}
```
## Features
- Multithreaded, with the ability to add and remove workers as needed
- No global state, run multiple crawlers with multiple threads at once
- Pre-written handlers for text files and ElasticSearch
- Skips URLs that have been seen before
- Domain limiting to crawl pages only belonging to a certain domain
## Included handlers
Itsy includes handlers for common actions, either to be used, or
examples for writing your own.
### Text file handler
The text file handler stores web pages in text files. It uses the
`html->str` method in `itsy.extract` to convert HTML documents to
plain text (which in turn uses [Tika](http://tika.apache.org) to
extract HTML to plain text).
Usage:
```clojure
(ns bar
(:require [itsy.core :refer :all]
[itsy.handlers.textfiles :refer :all]))
;; The directory will be created when the handler is created if it
;; doesn't already exist
(def txt-handler (make-textfile-handler {:directory "/mnt/data" :extension ".txt"}))
(def c (crawl {:url "http://example.com" :handler txt-handler}))
;; then look in the /mnt/data directory
```
### [ElasticSearch](http://elasticsearch.org) handler
The elasticsearch handler stores documents with the following mapping:
```clojure
{:id {:type "string"
:index "not_analyzed"
:store "yes"}
:url {:type "string"
:index "not_analyzed"
:store "yes"}
:body {:type "string"
:store "yes"}}
```
Usage:
```clojure
(ns foo
(:require [itsy.core :refer :all]
[itsy.handlers.elasticsearch :refer :all]))
;; These are the default settings
(def index-settings {:settings
{:index
{:number_of_shards 2
:number_of_replicas 0}}})
;; If the ES index doesn't exist, make-es-handler will create it when called.
(def es-handler (make-es-handler {:es-url "http://localhost:9200/"
:es-index "crawl"
:es-type "page"
:es-index-settings index-settings
:http-opts {}}))
(def c (crawl {:url "http://example.com" :handler es-handler}))
;; ... crawling and indexing ensues ...
```
## Todo
- Relative URL extraction/crawling
- Always better URL extraction
- Handlers for common body actions
- elasticsearch
- text files
- other?
- Helpers for dynamically raising/lowering thread count
- Timed crawling, have threads clean themselves up after a limit
- Have threads auto-clean when url-limit is hit
- Use Tika for HTML extraction
- Write tests
## License
Copyright © 2012 Lee Hinman
Distributed under the Eclipse Public License, the same as Clojure.