Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Manzanit0/spidey

A dead-simple crawler which focuses on ease of use and speed.
https://github.com/Manzanit0/spidey

Last synced: 3 months ago
JSON representation

A dead-simple crawler which focuses on ease of use and speed.

Host: GitHub
URL: https://github.com/Manzanit0/spidey
Owner: Manzanit0
License: mit
Created: 2019-08-19T06:05:55.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-05-03T17:56:20.000Z (over 3 years ago)
Last Synced: 2024-06-22T16:45:46.353Z (5 months ago)
Language: Elixir
Homepage:
Size: 1.87 MB
Stars: 5
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Spidey

A dead-simple, concurrent web crawler which focuses on ease of use and speed.

## Installation

The package can be installed by adding spidey to your list of dependencies in

mix.exs:

```elixir

def deps do

  [

    {:spidey, "~> 0.3"}

  ]

end

```

The docs can be found at https://hexdocs.pm/spidey

## Usage

Spidey has been thought with ease of usage in mind, so all you have to do to get

started is:

```elixir

iex> Spidey.crawl("https://manzanit0.github.io", :crawler_name, pool_size: 15)

[

  "https://https://manzanit0.github.io/foo",

  "https://https://manzanit0.github.io/bar-baz/#",

  ...

]

```

In a nutshell, the above line will:

1. Spin up a new supervision tree under the `Spidey` OTP Application that will

   supervise a task supervisor and the queue of URLs.

2. Create an ETS table to store crawled urls

3. Crawl the website

4. Return all the urls as a list

5. Teardown the supervision tree and the ETS table

The function is blocking, but if you were to call it asynchronously

multiple times, each invocation will spin up a new supervision trees with a

new task supervisor and a new queue.

### But why is it blocking?

The reason why it has been made blocking instead of non-blocking is because

there are already multiple libraries which do async crawling out there... and I

needed one that was blocking which allowed me to decide when to run it

synchronously and when not to.

### Specifying your own filter

Furthermore, if you would you want to specify your own filter for crawled

URLs, you can do so by implementing the `Spidey.Filter` behaviour:

```elixir

defmodule MyApp.RssFilter do

  @behaviour Spidey.Filter

  @impl true

  def filter_urls(urls, _opts) do

    urls

    |> Stream.reject(&String.ends_with?(&1, "feed/"))

    |> Stream.reject(&String.ends_with?(&1, "feed"))

  end

 end

```

And simply pass it down to the crawler as an option:

```elixir

Spidey.crawl("https://manzanit0.github.io", :crawler_name, filter: MyApp.RssFilter)

```

It's encouraged to use the `Stream` module instead of the `Enum` since the code

that handles the filtering uses streams.

## Configuration

Currently Spidey supports the following configuration:

- `:log` - the log level used when logging events with Elixir's

  Logger. If false, disables logging. Defaults to `:debug`

```elixir

config :spidey, log: :info

```

## Using the CLI

To be able to run the application make sure to have Elixir installed. Please

check the official instructions: [link](https://elixir-lang.org/install.html)

Once you have Elixir installed, to set up the application run:

```

git clone https://github.com/Manzanit0/spidey

cd spidey

mix deps.get

mix escript.build

```

To crawl websites, run the escript `./spidey`:

```

./spidey --site https://manzanit0.github.io/

```

[Escripts](https://hexdocs.pm/mix/master/Mix.Tasks.Escript.Build.html)

will run in any system which has Erlang/OTP installed, regardless

if they have Elixir or not.

### CLI options

Spidey provides two main functionalities – crawling a specific domain and saving

it to a file according to the [plain text site map protocol](https://www.sitemaps.org/protocol.html).

For the latter, simply append `--save` to the execution.