Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/7bridges-eu/shelob

Clojure scraping framework wanna-be
https://github.com/7bridges-eu/shelob

clojure framework jsoup scraping selenium

Last synced: 2 days ago
JSON representation

Clojure scraping framework wanna-be

Awesome Lists containing this project

README

        

Shelob
------

Shelob wraps https://www.seleniumhq.org/[Selenium] to let you browse a website
and scrape its contents.

Rationale
~~~~~~~~~

Selenium automates web browsing, primarily for testing and administration
purposes.

Shelob wraps Selenium to make it more idiomatic and adherent to the Clojure way
of coding, and exposes facilities to scrape web pages.

Version
~~~~~~~

image:https://img.shields.io/clojars/v/eu.7bridges/shelob.svg[link="https://clojars.org/eu.7bridges/shelob"]

Example
~~~~~~~

A simple DuckDuckGo search:

* Type "clojure" on the search field
* Click on the magnify glass to perform the search
* Retrieve the URLs of the visible results

[source,clojure]
----
(require '[shelob.core :as sh])
(require '[shelob.browser :as shb])
(require '[shelob.scraper :as shs])

(def context
{:driver-options {:browser :firefox}
:pool-size 2
:init-messages [{:msg :go :url "https://duckduckgo.com/html/"}]})

(defn scrape-result
[document]
(println (map
#(shs/attribute % "href")
(shs/select document ".result__a"))))

(defn example
[]
(sh/init context)
(let [msg [{:msg :fill
:locator (shb/by-css-selector "#search_form_input_homepage")
:text "Clojure"}
{:msg :click :locator (shb/by-css-selector "#search_button_homepage")}
{:msg :wait-for :condition
(shb/presence-of-element-located
(shb/by-css-selector ".serp__results"))}]]
(sh/send-message context scrape-result msg))
(sh/stop))
----

Running `(example)` results in:

[source,clojure]
----
user> (https://clojure.org/ https://en.wikipedia.org/wiki/Clojure https://github.com/clojure/clojure https://www.reddit.com/r/Clojure/ https://clojuredocs.org/ https://clojuredocs.org/clojure.core/when https://www.braveclojure.com/ https://repl.it/languages/clojure https://www.zhihu.com/question/21446061 https://leiningen.org/ https://clojure.github.io/clojure/ https://github.com/clojure https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm https://learnxinyminutes.com/docs/clojure/ https://cursive-ide.com/ https://clojurescript.org/ https://www.tutorialspoint.com/clojure/clojure_loops.htm http://www.clojurekoans.com/ https://www.braveclojure.com/clojure-for-the-brave-and-true/ https://marketplace.visualstudio.com/items?itemName=avli.clojure https://www.youtube.com/user/ClojureTV https://www.slant.co/options/1538/~clojure-review https://www.amazon.com/Clojure-Programming-Practical-Lisp-World/dp/1449394701 https://kimh.github.io/clojure-by-example/ https://ja.wikipedia.org/wiki/Clojure http://www.4clojure.
com/ https://developer.mozilla.org/en-US/docs/Web/JavaScript/Closures https://www.clojure.org/guides/getting_started https://en.wikibooks.org/wiki/Clojure_Programming)
----

Exception management
~~~~~~~~~~~~~~~~~~~~

By default, Shelob prints out exceptions on standard output and continues the
crawling process; you can customise exception management by passing a custom
exception management function to `send-message`, as such:

[source,clojure]
----

....

(defn exception-custom-fn
"Prints out exception with a custom message"
[_source e]
(println "An exception occurred, this is a custom message -" (.getMessage e)))

(defn example
[]
(sh/init context)
(let [msg [...]
(sh/send-message context scrape-result exception-custom-fn msg))
(sh/stop))
----

License
~~~~~~~

Copyright © 2019 7bridges s.r.l. — Distributed under the Apache License
2.0.