Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gfarrell/knobble

Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.
https://github.com/gfarrell/knobble

Last synced: about 22 hours ago
JSON representation

Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.

Host: GitHub
URL: https://github.com/gfarrell/knobble
Owner: gfarrell
Created: 2020-08-17T22:28:28.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-08-30T14:35:27.000Z (over 4 years ago)
Last Synced: 2024-11-16T14:11:40.105Z (2 months ago)
Language: TypeScript
Size: 188 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Knobble: a framework for crawling websites

Knobble is a framework to make it really easy to write fast, custom
crawlers for websites. You can download files, write data to a variety
of common formats, or do pretty much whatever you like. It is designed
to be extremely flexible and easy to build on.

## Motivation

Knobble came about when I had to extract some data from a friend's
website for him (it's a long story). The website was using
all sorts of annoying infini-scroll pages which meant that
[HTTrack](https://www.httrack.com/) just didn't capture most of the
data. The search pages also had lots of circular references, and the
data were often copied to multiple locations. I needed a really fast way
to write custom crawlers for the different pages whose data I really
needed to extract on a very tight timeline. The original use-case was
downloading PDFs and creating CSVs, but you can really extend this to
any application you would like. Check out the `examples` folder for more
ideas as to how it can be used.

## TODO

- [x] Finish off rewrite
- [x] Create pipeline functions for connecting things
- [x] Create download helper for pooling connections
- [x] Create a link extraction helper
- [x] Create basic factories for crawling pages as a SourceSink
- [x] See if you can replace the download pool with HTTP[S].Agent
- [x] Create basic sinks for downloading files and writing data files (CSV, JSON)
- [x] Add a Sink-retry wrapper (higher-order sink)
- [x] Add a URL filter
- [x] Write some examples
- [ ] Find a nice way to handle errors
- [ ] Make it a proper library with a good build toolchain
- [ ] Make a nice CLI to show the queue count, progress indicators, etc.
- [ ] Finish README documentation

## Basic Usage

## Basic Concepts

### Pipelines

### Sources

### Sinks

### Crawlers

### Helpers

#### Link Extractor

#### Downloader