https://github.com/dakrone/itsy

A threaded web-spider written in Clojure
https://github.com/dakrone/itsy

Last synced: 11 months ago
JSON representation

A threaded web-spider written in Clojure

Host: GitHub
URL: https://github.com/dakrone/itsy
Owner: dakrone
Created: 2012-05-18T22:10:22.000Z (almost 14 years ago)
Default Branch: master
Last Pushed: 2015-06-10T16:50:34.000Z (over 10 years ago)
Last Synced: 2025-03-31T13:16:43.746Z (11 months ago)
Language: Clojure
Homepage:
Size: 171 KB
Stars: 182
Watchers: 11
Forks: 29
Open Issues: 5
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-list - itsy - spider written in Clojure | dakrone | 176 | (Clojure)

README

          # Itsy

A threaded web spider, written in Clojure.

## Usage

In your project.clj:

```clojure

[itsy "0.1.1"]

```

In your project:

```clojure

(ns myns.foo

  (:require [itsy.core :refer :all]))

(defn my-handler [{:keys [url body]}]

  (println url "has a count of" (count body)))

(def c (crawl {;; initial URL to start crawling at (required)

               :url "http://aoeu.com"

               ;; handler to use for each page crawled (required)

               :handler my-handler

               ;; number of threads to use for crawling, (optional,

               ;; defaults to 5)

               :workers 10

               ;; number of urls to spider before crawling stops, note

               ;; that workers must still be stopped after crawling

               ;; stops. May be set to -1 to specify no limit.

               ;; (optional, defaults to 100)

               :url-limit 100

               ;; function to use to extract urls from a page, a

               ;; function that takes one argument, the body of a page.

               ;; (optional, defaults to itsy's extract-all)

               :url-extractor extract-all

               ;; http options for clj-http, (optional, defaults to

               ;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})

               :http-opts {}

               ;; specifies whether to limit crawling to a single

               ;; domain. If false, does not limit domain, if true,

               ;; limits to the same domain as the original :url, if set

               ;; to a string, limits crawling to the hostname of the

               ;; given url

               :host-limit false

               ;; polite crawlers obey robots.txt directives

               ;; by default this crawler is polite

               :polite? true}))

;; ... crawling ensues ...

(thread-status c)

;; returns a map of thread-id to Thread.State:

{33 #, 34 #, 35 #,

 36 #, 37 #, 38 #,

 39 #, 40 #, 41 #,

 42 #}

(add-worker c)

;; adds an additional thread worker to the pool

(remove-worker c)

;; removes a worker from the pool

(stop-workers c)

;; stop-workers will return a collection of all threads it failed to

;; stop (it should be able to stop all threads unless something goes

;; very wrong)

```

Upon completion, `c` will contain state that allows you to see what

happened:

```clojure

(clojure.pprint/pprint (:state c))

;; URLs still in the queue

{:url-queue #,

;; URLs that were seen/queued

 :url-count #,

 ;; running worker threads (will contain thread objects while crawling)

 :running-workers #,

 ;; canaries for running worker threads

 :worker-canaries #,

 ;; a map of URL to times seen/extracted from the body of a page

 :seen-urls

 #}

```

## Features

- Multithreaded, with the ability to add and remove workers as needed

- No global state, run multiple crawlers with multiple threads at once

- Pre-written handlers for text files and ElasticSearch

- Skips URLs that have been seen before

- Domain limiting to crawl pages only belonging to a certain domain

## Included handlers

Itsy includes handlers for common actions, either to be used, or

examples for writing your own.

### Text file handler

The text file handler stores web pages in text files. It uses the

`html->str` method in `itsy.extract` to convert HTML documents to

plain text (which in turn uses [Tika](http://tika.apache.org) to

extract HTML to plain text).

Usage:

```clojure

(ns bar

  (:require [itsy.core :refer :all]

            [itsy.handlers.textfiles :refer :all]))

;; The directory will be created when the handler is created if it

;; doesn't already exist

(def txt-handler (make-textfile-handler {:directory "/mnt/data" :extension ".txt"}))

(def c (crawl {:url "http://example.com" :handler txt-handler}))

;; then look in the /mnt/data directory

```

### [ElasticSearch](http://elasticsearch.org) handler

The elasticsearch handler stores documents with the following mapping:

```clojure

{:id {:type "string"

      :index "not_analyzed"

      :store "yes"}

 :url {:type "string"

       :index "not_analyzed"

       :store "yes"}

 :body {:type "string"

        :store "yes"}}

```

Usage:

```clojure

(ns foo

  (:require [itsy.core :refer :all]

            [itsy.handlers.elasticsearch :refer :all]))

;; These are the default settings

(def index-settings {:settings

                     {:index

                      {:number_of_shards 2

                       :number_of_replicas 0}}})

;; If the ES index doesn't exist, make-es-handler will create it when called.

(def es-handler (make-es-handler {:es-url "http://localhost:9200/"

                                  :es-index "crawl"

                                  :es-type "page"

                                  :es-index-settings index-settings

                                  :http-opts {}}))

(def c (crawl {:url "http://example.com" :handler es-handler}))

;; ... crawling and indexing ensues ...

```

## Todo

- Relative URL extraction/crawling

- Always better URL extraction

- Handlers for common body actions

  - elasticsearch

  - text files

  - other?

- Helpers for dynamically raising/lowering thread count

- Timed crawling, have threads clean themselves up after a limit

- Have threads auto-clean when url-limit is hit

- Use Tika for HTML extraction

- Write tests

## License

Copyright © 2012 Lee Hinman

Distributed under the Eclipse Public License, the same as Clojure.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dakrone/itsy

Awesome Lists containing this project

README