{"id":13722623,"url":"https://github.com/dakrone/itsy","last_synced_at":"2025-04-07T15:08:30.111Z","repository":{"id":3330384,"uuid":"4374033","full_name":"dakrone/itsy","owner":"dakrone","description":"A threaded web-spider written in Clojure","archived":false,"fork":false,"pushed_at":"2015-06-10T16:50:34.000Z","size":175,"stargazers_count":182,"open_issues_count":5,"forks_count":29,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-31T13:16:43.746Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dakrone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-05-18T22:10:22.000Z","updated_at":"2025-02-07T17:53:48.000Z","dependencies_parsed_at":"2022-08-26T22:50:44.120Z","dependency_job_id":null,"html_url":"https://github.com/dakrone/itsy","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fitsy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fitsy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fitsy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fitsy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dakrone","download_url":"https://codeload.github.com/dakrone/itsy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675597,"owners_count":20977376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:01:30.977Z","updated_at":"2025-04-07T15:08:30.090Z","avatar_url":"https://github.com/dakrone.png","language":"Clojure","funding_links":[],"categories":["Clojure"],"sub_categories":[],"readme":"# Itsy\n\nA threaded web spider, written in Clojure.\n\n## Usage\n\nIn your project.clj:\n\n```clojure\n[itsy \"0.1.1\"]\n```\n\nIn your project:\n\n```clojure\n(ns myns.foo\n  (:require [itsy.core :refer :all]))\n\n(defn my-handler [{:keys [url body]}]\n  (println url \"has a count of\" (count body)))\n\n(def c (crawl {;; initial URL to start crawling at (required)\n               :url \"http://aoeu.com\"\n               ;; handler to use for each page crawled (required)\n               :handler my-handler\n               ;; number of threads to use for crawling, (optional,\n               ;; defaults to 5)\n               :workers 10\n               ;; number of urls to spider before crawling stops, note\n               ;; that workers must still be stopped after crawling\n               ;; stops. May be set to -1 to specify no limit.\n               ;; (optional, defaults to 100)\n               :url-limit 100\n               ;; function to use to extract urls from a page, a\n               ;; function that takes one argument, the body of a page.\n               ;; (optional, defaults to itsy's extract-all)\n               :url-extractor extract-all\n               ;; http options for clj-http, (optional, defaults to\n               ;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})\n               :http-opts {}\n               ;; specifies whether to limit crawling to a single\n               ;; domain. If false, does not limit domain, if true,\n               ;; limits to the same domain as the original :url, if set\n               ;; to a string, limits crawling to the hostname of the\n               ;; given url\n               :host-limit false\n               ;; polite crawlers obey robots.txt directives\n               ;; by default this crawler is polite\n               :polite? true}))\n\n;; ... crawling ensues ...\n\n(thread-status c)\n;; returns a map of thread-id to Thread.State:\n{33 #\u003cState RUNNABLE\u003e, 34 #\u003cState RUNNABLE\u003e, 35 #\u003cState RUNNABLE\u003e,\n 36 #\u003cState RUNNABLE\u003e, 37 #\u003cState RUNNABLE\u003e, 38 #\u003cState RUNNABLE\u003e,\n 39 #\u003cState RUNNABLE\u003e, 40 #\u003cState RUNNABLE\u003e, 41 #\u003cState RUNNABLE\u003e,\n 42 #\u003cState RUNNABLE\u003e}\n\n(add-worker c)\n;; adds an additional thread worker to the pool\n\n(remove-worker c)\n;; removes a worker from the pool\n\n(stop-workers c)\n;; stop-workers will return a collection of all threads it failed to\n;; stop (it should be able to stop all threads unless something goes\n;; very wrong)\n```\n\nUpon completion, `c` will contain state that allows you to see what\nhappened:\n\n```clojure\n(clojure.pprint/pprint (:state c))\n;; URLs still in the queue\n{:url-queue #\u003cLinkedBlockingQueue []\u003e,\n;; URLs that were seen/queued\n :url-count #\u003cAtom@67d6b87e: 2\u003e,\n ;; running worker threads (will contain thread objects while crawling)\n :running-workers #\u003cRef@decdc7b: []\u003e,\n ;; canaries for running worker threads\n :worker-canaries #\u003cRef@397f1661: {}\u003e,\n ;; a map of URL to times seen/extracted from the body of a page\n :seen-urls\n #\u003cAtom@469657c4:\n   {\"http://www.phpbb.com\" 1,\n    \"http://pagead2.googlesyndication.com/pagead/show_ads.js\" 2,\n    \"http://www.subBlue.com/\" 1,\n    \"http://www.phpbb.com/\" 1,\n    \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\" 1,\n    \"http://www.w3.org/1999/xhtml\" 1,\n    \"http://forums.asdf.com\" 1,\n    \"http://www.google.com/images/poweredby_transparent/poweredby_000000.gif\" 1,\n    \"http://asdf.com\" 1,\n    \"http://www.google.com/cse/api/branding.css\" 1,\n    \"http://www.google.com/cse\" 1}\u003e}\n```\n\n## Features\n- Multithreaded, with the ability to add and remove workers as needed\n- No global state, run multiple crawlers with multiple threads at once\n- Pre-written handlers for text files and ElasticSearch\n- Skips URLs that have been seen before\n- Domain limiting to crawl pages only belonging to a certain domain\n\n## Included handlers\n\nItsy includes handlers for common actions, either to be used, or\nexamples for writing your own.\n\n### Text file handler\n\nThe text file handler stores web pages in text files. It uses the\n`html-\u003estr` method in `itsy.extract` to convert HTML documents to\nplain text (which in turn uses [Tika](http://tika.apache.org) to\nextract HTML to plain text).\n\nUsage:\n\n```clojure\n(ns bar\n  (:require [itsy.core :refer :all]\n            [itsy.handlers.textfiles :refer :all]))\n\n;; The directory will be created when the handler is created if it\n;; doesn't already exist\n(def txt-handler (make-textfile-handler {:directory \"/mnt/data\" :extension \".txt\"}))\n\n(def c (crawl {:url \"http://example.com\" :handler txt-handler}))\n\n;; then look in the /mnt/data directory\n```\n\n### [ElasticSearch](http://elasticsearch.org) handler\n\nThe elasticsearch handler stores documents with the following mapping:\n\n```clojure\n{:id {:type \"string\"\n      :index \"not_analyzed\"\n      :store \"yes\"}\n :url {:type \"string\"\n       :index \"not_analyzed\"\n       :store \"yes\"}\n :body {:type \"string\"\n        :store \"yes\"}}\n```\n\nUsage:\n\n```clojure\n(ns foo\n  (:require [itsy.core :refer :all]\n            [itsy.handlers.elasticsearch :refer :all]))\n\n;; These are the default settings\n(def index-settings {:settings\n                     {:index\n                      {:number_of_shards 2\n                       :number_of_replicas 0}}})\n\n;; If the ES index doesn't exist, make-es-handler will create it when called.\n(def es-handler (make-es-handler {:es-url \"http://localhost:9200/\"\n                                  :es-index \"crawl\"\n                                  :es-type \"page\"\n                                  :es-index-settings index-settings\n                                  :http-opts {}}))\n\n(def c (crawl {:url \"http://example.com\" :handler es-handler}))\n\n;; ... crawling and indexing ensues ...\n```\n\n\n## Todo\n\n- \u003cdel\u003eRelative URL extraction/crawling\u003c/del\u003e\n- Always better URL extraction\n- Handlers for common body actions\n  - \u003cdel\u003eelasticsearch\u003c/del\u003e\n  - \u003cdel\u003etext files\u003c/del\u003e\n  - other?\n- \u003cdel\u003eHelpers for dynamically raising/lowering thread count\u003c/del\u003e\n- Timed crawling, have threads clean themselves up after a limit\n- \u003cdel\u003eHave threads auto-clean when url-limit is hit\u003c/del\u003e\n- \u003cdel\u003eUse Tika for HTML extraction\u003c/del\u003e\n- Write tests\n\n## License\n\nCopyright © 2012 Lee Hinman\n\nDistributed under the Eclipse Public License, the same as Clojure.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdakrone%2Fitsy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdakrone%2Fitsy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdakrone%2Fitsy/lists"}