https://github.com/bplawler/crawler

Scala DSL for web crawling
https://github.com/bplawler/crawler

Last synced: 8 months ago
JSON representation

Scala DSL for web crawling

Host: GitHub
URL: https://github.com/bplawler/crawler
Owner: bplawler
License: other
Created: 2011-11-19T06:31:24.000Z (about 14 years ago)
Default Branch: master
Last Pushed: 2016-08-02T21:33:54.000Z (over 9 years ago)
Last Synced: 2024-10-29T17:52:28.410Z (about 1 year ago)
Language: Scala
Homepage:
Size: 82 KB
Stars: 148
Watchers: 14
Forks: 40
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-crawler-cn - crawler - 基于Scala DSL的网络爬虫. (Scala)
awesome-crawler - crawler - Scala DSL for web crawling. (Scala)

README

          # crawler - A DSL for web crawling in Scala

The purpose of this project is to provide a nice DSL wrapper around the

cumbersome htmlunit Java library.  Here is an example taken from a unit 

test in this package:

    import crawler._

    class TestCrawler(output: java.io.OutputStream) extends Crawler {

      var result = ""

      def crawl = {

        navigateTo("http://www.google.com") {

          in(form having id("tsf")) {

            in(textField having id("lst-ib")) {

              typeIn("bplawler")

            }

            in(submit having name("btnK")) {

              click ==>

            }

          }

        }

        onCurrentPage {

          result = from(div having id("resultStats")) getTextContent

          val url = from(

            anchor having xPath("//div[@id='field_timetable_file-wrapper']/a")

          ).getAttributes.getNamedItem("href").getTextContent

          download(url).writeTo(output)

        }

      }

    }

This `TestCrawler` class defines a crawl that will navigate to google, 

find the form whose id is `tsf`, type something into the form, then

click on the submit button named `btnK`, which will then take us to a 

new page (the search results) where we can then grab the content of the

`resultStats` div.

It also grabs URL from a link defined by XPath and downloads it to given

OutputStream.

Alternatively you could just get bytes instead of writing downloaded data to 

OutputStream: `download(url).getBytes`.

## Background

This DSL was created to simplify the code needed to programmatically access

web pages and do something meaningful with the content.  It is backed by the

Java [HtmlUnit](http://htmlunit.sourceforge.net/) library, which, according to

their web site, provides a "GUI-less browser for Java programs."  The library

is very good at what it does, but I found that using it generally resulted in

code that was pretty difficult to read.

This DSL is also my first attempt to write such a thing in Scala, so really 

this is just sort of an academic project to learn as much as I can about Scala

and about writing DSL's in Scala.  There are a few brittle areas in this thing

that could greatly benefit from some clear error handling, but for what I 

was trying to do, the code here did the trick just fine.

## Basic Language Structure

The first part of any web crawl is to provide a starting point.  This is 

done with the `navigateTo` method.  `navigateTo` takes a string and is 

followed by the code block that contains the stuff you want to do with

this page.

    navigateTo("http://www.google.com") { ... }

Inside the code block, you can use the DSL keywords to find individual HTML

elements and do operations on those things.  On of the more common keywords is

`in`, which receives as an argument a bit of code that identifies an HTML

node on the current page, then opens another code block to do processing within

that HTML node.  The following excerpt will navigate to the google home page

and find the form element that has an id of `tsf`.

    navigateTo("http://www.google.com") {

      in(form having id("tsf")) { ... }

    }

The code block after the `in` call will operate on the form element that was

found.  If there was no form with that id on the page, you'll get an error 

but it will be the one that is generated by HtmlUnit - I have not yet made 

any effort to wrap the errors nicely.  Inside the code block for this form,

we can do things like access the individual input fields and enter in 

values.

    navigateTo("http://www.google.com") {

      in(form having id("tsf")) {

        in(textField having id("lst-ib")) {

          typeIn("bplawler")

        }

        ...

      }

    }

Here we have expanded the example to find the text field on the Google home

page and type in a search term.  With this typed in, the next thing we will

want to do is submit the form and do our search.

    navigateTo("http://www.google.com") {

      in(form having id("tsf")) {

        in(textField having id("lst-ib")) {

          typeIn("bplawler")

        }

        in(submit having name("btnK") {

          click ==>

        }

      }

    }

Clicking the button is as easy as finding the submit button in the HTML and

calling `click`.  But what is that wierd `==>` operator?  It turns out that

this click on our GUI-less browser will take us to a new web page.  The 

`==>` operator without an argument signifies that this new page is the next

page we will be working with.  So rather than having to use `navigateTo` 

again, we can simply end this code block and use the `onCurrentPage` method

to start our next code block.

    navigateTo("http://www.google.com") { ... }

    onCurrentPage {

      result = from(div having id("resultStats")) getTextContent

    }

In this example, what we are doing is using the `from` keyword to find a 

particular HTML element (just as with `in`) but this time we are going to 

get something out of the element and put that value in a variable.  Remember 

that this DSL is just an extension of Scala, and that we could also just as

easily now call out to another method from within here and do some meaningful

work.  One other keyword that is supported is `forAll` which receives an

[XPath](http://www.w3schools.com/xpath/) and a subsequent code block over

which all of the items in the list will be run.

    navigateTo("http://www.google.com") { ... }

    onCurrentPage {

      result = from(div having id("resultStats")) getTextContent

      

      forAll(div having xPath("""//ol[@id = "rso"]/li/div[@class = "vsc"]""")) {

        println(from(anchor having xPath("h3/a")) getTextContent)

      }

    }

This invocation of `forAll` will loop through each individual search result

and print out the main anchor text for each.

## Releases

* 0.6.0 (2013.08.16)

  * Switched to Scala 2.10.2 for building (with 2.10 binary compatibility).

  * Added #download method.

* 0.5.0 (2012.06.24)

  * Bumping the version number to something that is completely unrelated 

    to circupon.

  * During crawler construction, it is now possible to set whether 

    css is supported (default is false) and whether JavaScript is supported

    (default is true).

* 0.3.3 (2012.06.15)

  * Added support for `mouseOver` in the DSL.  Works just like `click`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bplawler/crawler

Awesome Lists containing this project

README