Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Manzanit0/spidey
A dead-simple crawler which focuses on ease of use and speed.
https://github.com/Manzanit0/spidey
Last synced: 3 months ago
JSON representation
A dead-simple crawler which focuses on ease of use and speed.
- Host: GitHub
- URL: https://github.com/Manzanit0/spidey
- Owner: Manzanit0
- License: mit
- Created: 2019-08-19T06:05:55.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-05-03T17:56:20.000Z (over 3 years ago)
- Last Synced: 2024-06-22T16:45:46.353Z (5 months ago)
- Language: Elixir
- Homepage:
- Size: 1.87 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spidey
A dead-simple, concurrent web crawler which focuses on ease of use and speed.
## Installation
The package can be installed by adding spidey to your list of dependencies in
mix.exs:```elixir
def deps do
[
{:spidey, "~> 0.3"}
]
end
```The docs can be found at https://hexdocs.pm/spidey
## Usage
Spidey has been thought with ease of usage in mind, so all you have to do to get
started is:```elixir
iex> Spidey.crawl("https://manzanit0.github.io", :crawler_name, pool_size: 15)
[
"https://https://manzanit0.github.io/foo",
"https://https://manzanit0.github.io/bar-baz/#",
...
]
```In a nutshell, the above line will:
1. Spin up a new supervision tree under the `Spidey` OTP Application that will
supervise a task supervisor and the queue of URLs.
2. Create an ETS table to store crawled urls
3. Crawl the website
4. Return all the urls as a list
5. Teardown the supervision tree and the ETS tableThe function is blocking, but if you were to call it asynchronously
multiple times, each invocation will spin up a new supervision trees with a
new task supervisor and a new queue.### But why is it blocking?
The reason why it has been made blocking instead of non-blocking is because
there are already multiple libraries which do async crawling out there... and I
needed one that was blocking which allowed me to decide when to run it
synchronously and when not to.### Specifying your own filter
Furthermore, if you would you want to specify your own filter for crawled
URLs, you can do so by implementing the `Spidey.Filter` behaviour:```elixir
defmodule MyApp.RssFilter do
@behaviour Spidey.Filter@impl true
def filter_urls(urls, _opts) do
urls
|> Stream.reject(&String.ends_with?(&1, "feed/"))
|> Stream.reject(&String.ends_with?(&1, "feed"))
end
end
```And simply pass it down to the crawler as an option:
```elixir
Spidey.crawl("https://manzanit0.github.io", :crawler_name, filter: MyApp.RssFilter)
```It's encouraged to use the `Stream` module instead of the `Enum` since the code
that handles the filtering uses streams.## Configuration
Currently Spidey supports the following configuration:
- `:log` - the log level used when logging events with Elixir's
Logger. If false, disables logging. Defaults to `:debug````elixir
config :spidey, log: :info
```## Using the CLI
To be able to run the application make sure to have Elixir installed. Please
check the official instructions: [link](https://elixir-lang.org/install.html)Once you have Elixir installed, to set up the application run:
```
git clone https://github.com/Manzanit0/spidey
cd spidey
mix deps.get
mix escript.build
```To crawl websites, run the escript `./spidey`:
```
./spidey --site https://manzanit0.github.io/
```[Escripts](https://hexdocs.pm/mix/master/Mix.Tasks.Escript.Build.html)
will run in any system which has Erlang/OTP installed, regardless
if they have Elixir or not.### CLI options
Spidey provides two main functionalities – crawling a specific domain and saving
it to a file according to the [plain text site map protocol](https://www.sitemaps.org/protocol.html).
For the latter, simply append `--save` to the execution.