An open API service indexing awesome lists of open source software.

https://github.com/bplawler/crawler

Scala DSL for web crawling
https://github.com/bplawler/crawler

Last synced: 8 months ago
JSON representation

Scala DSL for web crawling

Awesome Lists containing this project

README

          

# crawler - A DSL for web crawling in Scala

The purpose of this project is to provide a nice DSL wrapper around the
cumbersome htmlunit Java library. Here is an example taken from a unit
test in this package:

import crawler._

class TestCrawler(output: java.io.OutputStream) extends Crawler {
var result = ""
def crawl = {
navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
in(submit having name("btnK")) {
click ==>
}
}
}
onCurrentPage {
result = from(div having id("resultStats")) getTextContent
val url = from(
anchor having xPath("//div[@id='field_timetable_file-wrapper']/a")
).getAttributes.getNamedItem("href").getTextContent
download(url).writeTo(output)
}
}
}

This `TestCrawler` class defines a crawl that will navigate to google,
find the form whose id is `tsf`, type something into the form, then
click on the submit button named `btnK`, which will then take us to a
new page (the search results) where we can then grab the content of the
`resultStats` div.

It also grabs URL from a link defined by XPath and downloads it to given
OutputStream.

Alternatively you could just get bytes instead of writing downloaded data to
OutputStream: `download(url).getBytes`.

## Background

This DSL was created to simplify the code needed to programmatically access
web pages and do something meaningful with the content. It is backed by the
Java [HtmlUnit](http://htmlunit.sourceforge.net/) library, which, according to
their web site, provides a "GUI-less browser for Java programs." The library
is very good at what it does, but I found that using it generally resulted in
code that was pretty difficult to read.

This DSL is also my first attempt to write such a thing in Scala, so really
this is just sort of an academic project to learn as much as I can about Scala
and about writing DSL's in Scala. There are a few brittle areas in this thing
that could greatly benefit from some clear error handling, but for what I
was trying to do, the code here did the trick just fine.

## Basic Language Structure

The first part of any web crawl is to provide a starting point. This is
done with the `navigateTo` method. `navigateTo` takes a string and is
followed by the code block that contains the stuff you want to do with
this page.

navigateTo("http://www.google.com") { ... }

Inside the code block, you can use the DSL keywords to find individual HTML
elements and do operations on those things. On of the more common keywords is
`in`, which receives as an argument a bit of code that identifies an HTML
node on the current page, then opens another code block to do processing within
that HTML node. The following excerpt will navigate to the google home page
and find the form element that has an id of `tsf`.

navigateTo("http://www.google.com") {
in(form having id("tsf")) { ... }
}

The code block after the `in` call will operate on the form element that was
found. If there was no form with that id on the page, you'll get an error
but it will be the one that is generated by HtmlUnit - I have not yet made
any effort to wrap the errors nicely. Inside the code block for this form,
we can do things like access the individual input fields and enter in
values.

navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
...
}
}

Here we have expanded the example to find the text field on the Google home
page and type in a search term. With this typed in, the next thing we will
want to do is submit the form and do our search.

navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
in(submit having name("btnK") {
click ==>
}
}
}

Clicking the button is as easy as finding the submit button in the HTML and
calling `click`. But what is that wierd `==>` operator? It turns out that
this click on our GUI-less browser will take us to a new web page. The
`==>` operator without an argument signifies that this new page is the next
page we will be working with. So rather than having to use `navigateTo`
again, we can simply end this code block and use the `onCurrentPage` method
to start our next code block.

navigateTo("http://www.google.com") { ... }
onCurrentPage {
result = from(div having id("resultStats")) getTextContent
}

In this example, what we are doing is using the `from` keyword to find a
particular HTML element (just as with `in`) but this time we are going to
get something out of the element and put that value in a variable. Remember
that this DSL is just an extension of Scala, and that we could also just as
easily now call out to another method from within here and do some meaningful
work. One other keyword that is supported is `forAll` which receives an
[XPath](http://www.w3schools.com/xpath/) and a subsequent code block over
which all of the items in the list will be run.

navigateTo("http://www.google.com") { ... }
onCurrentPage {
result = from(div having id("resultStats")) getTextContent

forAll(div having xPath("""//ol[@id = "rso"]/li/div[@class = "vsc"]""")) {
println(from(anchor having xPath("h3/a")) getTextContent)
}
}

This invocation of `forAll` will loop through each individual search result
and print out the main anchor text for each.

## Releases

* 0.6.0 (2013.08.16)
* Switched to Scala 2.10.2 for building (with 2.10 binary compatibility).
* Added #download method.
* 0.5.0 (2012.06.24)
* Bumping the version number to something that is completely unrelated
to circupon.
* During crawler construction, it is now possible to set whether
css is supported (default is false) and whether JavaScript is supported
(default is true).
* 0.3.3 (2012.06.15)
* Added support for `mouseOver` in the DSL. Works just like `click`.