Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gfarrell/knobble

Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.
https://github.com/gfarrell/knobble

Last synced: about 22 hours ago
JSON representation

Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.

Awesome Lists containing this project

README

        

# Knobble: a framework for crawling websites

Knobble is a framework to make it really easy to write fast, custom
crawlers for websites. You can download files, write data to a variety
of common formats, or do pretty much whatever you like. It is designed
to be extremely flexible and easy to build on.

## Motivation

Knobble came about when I had to extract some data from a friend's
website for him (it's a long story). The website was using
all sorts of annoying infini-scroll pages which meant that
[HTTrack](https://www.httrack.com/) just didn't capture most of the
data. The search pages also had lots of circular references, and the
data were often copied to multiple locations. I needed a really fast way
to write custom crawlers for the different pages whose data I really
needed to extract on a very tight timeline. The original use-case was
downloading PDFs and creating CSVs, but you can really extend this to
any application you would like. Check out the `examples` folder for more
ideas as to how it can be used.

## TODO

- [x] Finish off rewrite
- [x] Create pipeline functions for connecting things
- [x] Create download helper for pooling connections
- [x] Create a link extraction helper
- [x] Create basic factories for crawling pages as a SourceSink
- [x] See if you can replace the download pool with HTTP[S].Agent
- [x] Create basic sinks for downloading files and writing data files (CSV, JSON)
- [x] Add a Sink-retry wrapper (higher-order sink)
- [x] Add a URL filter
- [x] Write some examples
- [ ] Find a nice way to handle errors
- [ ] Make it a proper library with a good build toolchain
- [ ] Make a nice CLI to show the queue count, progress indicators, etc.
- [ ] Finish README documentation

## Basic Usage

## Basic Concepts

### Pipelines

### Sources

### Sinks

### Crawlers

### Helpers

#### Link Extractor

#### Downloader