Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dyweb/scrala
Unmaintained :whale: :coffee: :spider: Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege
https://github.com/dyweb/scrala
actor-model docker scala scrapy spider
Last synced: about 2 months ago
JSON representation
Unmaintained :whale: :coffee: :spider: Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege
- Host: GitHub
- URL: https://github.com/dyweb/scrala
- Owner: dyweb
- Created: 2015-11-04T09:37:40.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2019-10-05T15:36:58.000Z (about 5 years ago)
- Last Synced: 2024-10-29T17:52:47.131Z (about 2 months ago)
- Topics: actor-model, docker, scala, scrapy, spider
- Language: Scala
- Homepage: http://dongyueweb.com/scrala/
- Size: 83 KB
- Stars: 113
- Watchers: 12
- Forks: 23
- Open Issues: 6
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# scrala
[![Codacy Badge](https://api.codacy.com/project/badge/grade/563bbcd12d874610bca7313abe6e6fdd)](https://www.codacy.com/app/gaocegege/scrala)
[![Build Status](https://travis-ci.org/gaocegege/scrala.svg?branch=master)](https://travis-ci.org/gaocegege/scrala)
![License](https://img.shields.io/pypi/l/Django.svg)
[![scrala published](https://jitpack.io/v/gaocegege/scrala.svg)](https://jitpack.io/#gaocegege/scrala)
[![Docker Pulls](https://img.shields.io/docker/pulls/gaocegege/scrala.svg)](https://hub.docker.com/r/gaocegege/scrala/)
[![Join the chat at https://gitter.im/gaocegege/scrala](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/gaocegege/scrala?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)scrala is a web crawling framework for scala, which is inspired by [scrapy](https://github.com/scrapy/scrapy).
## Installation
### From Docker
[![](https://images.microbadger.com/badges/image/gaocegege/scrala.svg)](https://microbadger.com/images/gaocegege/scrala "Get your own image badge on microbadger.com")
[gaocegege/scrala in dockerhub](https://hub.docker.com/r/gaocegege/scrala/)
#### Create a Dockerfile in your project.
```
FROM gaocegege/scrala:latest// COPY the build.sbt and the src to the container
```#### Run a single command in docker
```
docker run -v :/app/src -v :/root/.ivy2 gaocegege/scrala
```### From SBT
**Step 1.** Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
**Step 2.** Add the dependency
libraryDependencies += "com.github.gaocegege" % "scrala" % "0.1.5"
### From Source Code
git clone https://github.com/gaocegege/scrala.git
cd ./scrala
sbt assemblyYou will get the jar in `./target/scala-/`.
## Example
import com.gaocegege.scrala.core.spider.impl.DefaultSpider
import com.gaocegege.scrala.core.common.response.Response
import java.io.BufferedReader
import java.io.InputStreamReader
import com.gaocegege.scrala.core.common.response.impl.HttpResponse
import com.gaocegege.scrala.core.common.response.impl.HttpResponseclass TestSpider extends DefaultSpider {
def startUrl = List[String]("http://www.gaocegege.com/resume")def parse(response: HttpResponse): Unit = {
val links = (response getContentParser) select ("a")
for (i <- 0 to links.size() - 1) {
request(((links get (i)) attr ("href")), printIt)
}
}def printIt(response: HttpResponse): Unit = {
println((response getContentParser) title)
}
}object Main {
def main(args: Array[String]) {
val test = new TestSpider
test begin
}
}Just like the scrapy, what you need to do is define a `startUrl` to tell me where to start, and override `parse(...)` to parse the response of the startUrl. And `request(...)` function is like `yield scrapy.Request(...)` in scrapy.
You can get the example project in the `./example/`
## For Developer
scrala is under active development, feel free to contribute documentation, test cases, pull requests, issues, and anything you want. I'm a newcomer to scala so the code is hard to read. I'm glad to see someone familiar with scala coding standards could do some code reviews for the repo :)