Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gfarrell/knobble
Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.
https://github.com/gfarrell/knobble
Last synced: about 22 hours ago
JSON representation
Observable-based framework for extracting data from websites, especially when it's a bit more complex or bespoke than a simple crawler will do.
- Host: GitHub
- URL: https://github.com/gfarrell/knobble
- Owner: gfarrell
- Created: 2020-08-17T22:28:28.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-08-30T14:35:27.000Z (over 4 years ago)
- Last Synced: 2024-11-16T14:11:40.105Z (2 months ago)
- Language: TypeScript
- Size: 188 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Knobble: a framework for crawling websites
Knobble is a framework to make it really easy to write fast, custom
crawlers for websites. You can download files, write data to a variety
of common formats, or do pretty much whatever you like. It is designed
to be extremely flexible and easy to build on.## Motivation
Knobble came about when I had to extract some data from a friend's
website for him (it's a long story). The website was using
all sorts of annoying infini-scroll pages which meant that
[HTTrack](https://www.httrack.com/) just didn't capture most of the
data. The search pages also had lots of circular references, and the
data were often copied to multiple locations. I needed a really fast way
to write custom crawlers for the different pages whose data I really
needed to extract on a very tight timeline. The original use-case was
downloading PDFs and creating CSVs, but you can really extend this to
any application you would like. Check out the `examples` folder for more
ideas as to how it can be used.## TODO
- [x] Finish off rewrite
- [x] Create pipeline functions for connecting things
- [x] Create download helper for pooling connections
- [x] Create a link extraction helper
- [x] Create basic factories for crawling pages as a SourceSink
- [x] See if you can replace the download pool with HTTP[S].Agent
- [x] Create basic sinks for downloading files and writing data files (CSV, JSON)
- [x] Add a Sink-retry wrapper (higher-order sink)
- [x] Add a URL filter
- [x] Write some examples
- [ ] Find a nice way to handle errors
- [ ] Make it a proper library with a good build toolchain
- [ ] Make a nice CLI to show the queue count, progress indicators, etc.
- [ ] Finish README documentation## Basic Usage
## Basic Concepts
### Pipelines
### Sources
### Sinks
### Crawlers
### Helpers
#### Link Extractor
#### Downloader