https://github.com/cmiles74/scraper
A simple web scraper built around the JavaFX WebEngine
https://github.com/cmiles74/scraper
clojure java javafx javascript scraper
Last synced: 3 months ago
JSON representation
A simple web scraper built around the JavaFX WebEngine
- Host: GitHub
- URL: https://github.com/cmiles74/scraper
- Owner: cmiles74
- License: epl-1.0
- Created: 2014-08-27T13:49:57.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2021-02-15T17:09:57.000Z (over 5 years ago)
- Last Synced: 2025-12-20T15:02:26.312Z (6 months ago)
- Topics: clojure, java, javafx, javascript, scraper
- Language: Clojure
- Homepage:
- Size: 49.8 KB
- Stars: 13
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scraper
This project provides a web scraping library built around the JavaFX
[WebEngine][0], which in turn is built on top of [WebKit][1]. The goal of
this project is to provide an robust and easy-to-use web scraper that
doesn't require an external binary in order to function. With the
introduction of Java 8, this is finally beginning to seem feasible.
If you find this code useful in any way, please feel free to...
# Usage
It's still early days yet, this project hasn't reached the point where
we're releasing builds of the library. Still, you can checkout the
project and build it yourself.
````clojure
[com.nervestaple/scraper "0.1.0-SNAPSHOT"]
````
Probably more fun is to check out the project and then interact with
it directly via the REPL.
$ cd scraper
$ lein repl
From there it's easy to get a handle on a WebEngine instance and
scrape out some content.
````
user> (def we (scraper/get-web-engine))
#'user/we
user> (scraper/load-url we "http://twitch.nervestaple.com")
{:state :ready}
user> (scraper/load-artoo we)
{:state :ready}
user> (scraper/scrape we "h1" {:title "text"})
{"title" "Bishop: Makes Your Web Service Shiny"} {"title" "Why Is My Web Service
API Crappy?"} {"title" "All Your HBase Are Belong to Clojure"}) ({"title" "Work
In Progress"} {"title" "Linux Is All About Choices"} {"title" "Real Life Web App
Integration Testing (IT) with Spring"} {"title" "Bishop: Makes Your Web Service
Shiny"} {"title" "Why Is My Web Service API Crappy?"} {"title" "All Your HBase
Are Belong to Clojure"})
````
As you can see in the example above, the [Artoo.js][2] JavaScript
scraping library is injected into the loaded page in order to make
your scraping easier. You are welcome! ;-)
If you're interested in being able to see the content that your
WebEngine instance is loading, you can get a handle on a WebView
instead. This will bring up a new window displaying the WebView.
````
user> (def wv (scraper/get-web-view))
#'user/wv
user> (def we (:web-engine wv))
#'user/we
````
Work on the project continues, but this should be enough to get you
started.
----
[0]:
http://docs.oracle.com/javafx/2/api/javafx/scene/web/WebEngine.html "Web Engine API"
[1]: http://en.wikipedia.org/wiki/WebKit "WebKit"
[2]: http://medialab.github.io/artoo "Artoo.js"
