Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dyweb/scrala

Unmaintained :whale: :coffee: :spider: Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege
https://github.com/dyweb/scrala

actor-model docker scala scrapy spider

Last synced: about 2 months ago
JSON representation

Unmaintained :whale: :coffee: :spider: Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege

Awesome Lists containing this project

README

        

# scrala

[![Codacy Badge](https://api.codacy.com/project/badge/grade/563bbcd12d874610bca7313abe6e6fdd)](https://www.codacy.com/app/gaocegege/scrala)
[![Build Status](https://travis-ci.org/gaocegege/scrala.svg?branch=master)](https://travis-ci.org/gaocegege/scrala)
![License](https://img.shields.io/pypi/l/Django.svg)
[![scrala published](https://jitpack.io/v/gaocegege/scrala.svg)](https://jitpack.io/#gaocegege/scrala)
[![Docker Pulls](https://img.shields.io/docker/pulls/gaocegege/scrala.svg)](https://hub.docker.com/r/gaocegege/scrala/)
[![Join the chat at https://gitter.im/gaocegege/scrala](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/gaocegege/scrala?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

scrala is a web crawling framework for scala, which is inspired by [scrapy](https://github.com/scrapy/scrapy).

## Installation

### From Docker

[![](https://images.microbadger.com/badges/image/gaocegege/scrala.svg)](https://microbadger.com/images/gaocegege/scrala "Get your own image badge on microbadger.com")

[gaocegege/scrala in dockerhub](https://hub.docker.com/r/gaocegege/scrala/)

#### Create a Dockerfile in your project.

```
FROM gaocegege/scrala:latest

// COPY the build.sbt and the src to the container
```

#### Run a single command in docker

```
docker run -v :/app/src -v :/root/.ivy2 gaocegege/scrala
```

### From SBT

**Step 1.** Add it in your build.sbt at the end of resolvers:

resolvers += "jitpack" at "https://jitpack.io"

**Step 2.** Add the dependency

libraryDependencies += "com.github.gaocegege" % "scrala" % "0.1.5"

### From Source Code

git clone https://github.com/gaocegege/scrala.git
cd ./scrala
sbt assembly

You will get the jar in `./target/scala-/`.

## Example

import com.gaocegege.scrala.core.spider.impl.DefaultSpider
import com.gaocegege.scrala.core.common.response.Response
import java.io.BufferedReader
import java.io.InputStreamReader
import com.gaocegege.scrala.core.common.response.impl.HttpResponse
import com.gaocegege.scrala.core.common.response.impl.HttpResponse

class TestSpider extends DefaultSpider {
def startUrl = List[String]("http://www.gaocegege.com/resume")

def parse(response: HttpResponse): Unit = {
val links = (response getContentParser) select ("a")
for (i <- 0 to links.size() - 1) {
request(((links get (i)) attr ("href")), printIt)
}
}

def printIt(response: HttpResponse): Unit = {
println((response getContentParser) title)
}
}

object Main {
def main(args: Array[String]) {
val test = new TestSpider
test begin
}
}

Just like the scrapy, what you need to do is define a `startUrl` to tell me where to start, and override `parse(...)` to parse the response of the startUrl. And `request(...)` function is like `yield scrapy.Request(...)` in scrapy.

You can get the example project in the `./example/`

## For Developer

scrala is under active development, feel free to contribute documentation, test cases, pull requests, issues, and anything you want. I'm a newcomer to scala so the code is hard to read. I'm glad to see someone familiar with scala coding standards could do some code reviews for the repo :)