https://github.com/owainlewis/falkor
Open Source web scraping API. Falkor turns web pages into queryable JSON
https://github.com/owainlewis/falkor
webscraping webscrapper
Last synced: about 1 year ago
JSON representation
Open Source web scraping API. Falkor turns web pages into queryable JSON
- Host: GitHub
- URL: https://github.com/owainlewis/falkor
- Owner: owainlewis
- License: epl-1.0
- Created: 2015-06-13T18:27:42.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2016-02-12T20:40:43.000Z (over 10 years ago)
- Last Synced: 2025-03-27T23:33:05.774Z (about 1 year ago)
- Topics: webscraping, webscrapper
- Language: Clojure
- Homepage:
- Size: 21.5 KB
- Stars: 188
- Watchers: 11
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Falkor
A web service for turning HTML pages into traversable JSON documents
Very early stage development. If you have any feature requests just create an issue on the project
## Getting started
Running the server locally
```
lein uberjar
docker build -t falkor .
docker run -t falkor
# Visit http://localhost:5000
```
## Comming soon
+ Better error handling
+ CORS
+ Query filtering (return only certain attributes)
+ Fetching multiple elements in a single request ( e.g [h1 > a, .subtitle] )
## Usage
Get all the title links from the Reddit.com home page
https://falkor-api.herokuapp.com/api/query?url=http://reddit.com&query=a.title
Grab all the news stories from Digg.com
https://falkor-api.herokuapp.com/api/query?url=http://digg.com&query=.story-title%20a
Extract all the images from Digg.com
https://falkor-api.herokuapp.com/api/query?url=http://digg.com&query=img[src]
## TODO
Filters to remove some of the attribute cruft
For example if we just want to extract the text for an element and ignore the other attributes
```
&filter=[text]
```
## License
Copyright © 2015 Forward Digital Limited
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.