Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sntran/gen_spider

An Erlang/Elixir behaviour to define Spiders
https://github.com/sntran/gen_spider

behaviour crawler generic interface spider

Last synced: about 1 month ago
JSON representation

An Erlang/Elixir behaviour to define Spiders

Awesome Lists containing this project

README

        

# GenSpider

[![Build Status](https://img.shields.io/travis/sntran/gen_spider/master.svg)](https://travis-ci.org/sntran/gen_spider)
[![Test Coverage](https://img.shields.io/coveralls/github/sntran/gen_spider.svg)](https://coveralls.io/github/sntran/gen_spider)
[![Hex Version](https://img.shields.io/hexpm/v/gen_spider.svg)](https://hex.pm/packages/gen_spider)
[![License](https://img.shields.io/github/license/sntran/gen_spider.svg)](https://choosealicense.com/licenses/apache-2.0/)

GenSpider is a behaviour for defining Spiders.

Spiders are modules which define how a certain site (or a group of sites) will
be scraped, including how to perform the crawl (i.e. follow links) and how to
extract structured data from their pages (i.e. scraping items). In other words,
Spiders are the place where you define the custom behaviour for crawling and
parsing pages for a particular site (or, in some cases, a group of sites).

## Hello World

The basic Quotes Spider from Scrapy is implemented with `gen_spider` in both
[Erlang](examples/quotes_spider.erl) and [Elixir](examples/quotes_spider.ex).

## Generic Spiders

GenSpider also comes with some useful generic spiders that can be found in the
[examples](examples) directory. Their aim is to provide convenient functionality
for a few common scraping cases, like following all links on a site based on
certain rules, crawling from Sitemaps, or parsing an XML/CSV feed.

## Installation

If [available in Hex](https://hex.pm/docs/publish), the package can be installed
by adding `gen_spider` to your list of dependencies in `mix.exs`:

```elixir
def deps do
[
{:gen_spider, "~> 0.1.0"}
]
end
```

Documentation can be generated with [ExDoc](https://github.com/elixir-lang/ex_doc)
and published on [HexDocs](https://hexdocs.pm). Once published, the docs can
be found at [https://hexdocs.pm/gen_spider](https://hexdocs.pm/gen_spider).

## Contributing

We welcome everyone to contribute to GenSpider and help us tackle existing issues!

Use the [issue tracker][issues] for bug reports or feature requests. Open a [pull request][pulls] when you are ready to contribute.

When submitting a pull request you should not update the `CHANGELOG.md`.

## License

GenSpider source code is released under Apache 2 License.
Check LICENSE file for more information.

[issues]: https://github.com/sntran/gen_spider/issues
[pulls]: https://github.com/sntran/gen_spider/pulls