Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/coderhs/leo
Web Scrapper API
https://github.com/coderhs/leo
Last synced: about 2 months ago
JSON representation
Web Scrapper API
- Host: GitHub
- URL: https://github.com/coderhs/leo
- Owner: coderhs
- Created: 2016-09-07T07:46:36.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-09-07T08:32:45.000Z (over 8 years ago)
- Last Synced: 2024-10-30T09:42:04.926Z (3 months ago)
- Language: Ruby
- Size: 28.3 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# README
This is an API only Ruby on Rails application used to scrape h1, h2, h3 tags and links present in a URL.
## API End Points
```
POST '/v1/websites', params: domain (submit with http/https)
GET '/v1/wsbites/:key', a key of the job task
GET '/v1/websites', to display all the websites presently scraped
```To make the application run faster, I have used background jobs to scrape the result. When a user
submits a domain, a job would be created. The result URL would be send as response if the job has been created. The user can check the result url for the status as well.## Example
**Submit a domain**
```sh
# command
curl -X POST "http://localhost:3000/v1/websites?domain=https://simple.wikipedia.org/wiki/Wikipedia"
``````json
{"result":{"domain":"https://simple.wikipedia.org/wiki/Wikipedia","status":"PENDING","result_url":"http://localhost:3000/v1/website/389b76561f52f5f0337742b68354c106"}}
```**Fetch Result**
```sh
# command
curl http://localhost:3000/v1/websites/389b76561f52f5f0337742b68354c106
```Result:
https://gist.github.com/coderhs/9d84b96875fa996a7a80195cbe96425f***Display all Website***
```sh
curl http://localhost:3000/v1/websites
``````json
[
{
domain: "http://csnipp.com",
status: "COMPLETED",
result_url: "http://localhost:3000/v1/websites/3bae0b276b4c475c1e6bd43f2266b80e"
},
{
domain: "https://redpanthers.co",
status: "COMPLETED",
result_url: "http://localhost:3000/v1/websites/e14dd438487e385054747f1091e86a2e"
},
{
domain: "https://simple.wikipedia.org/wiki/Wikipedia",
status: "COMPLETED",
result_url: "http://localhost:3000/v1/websites/389b76561f52f5f0337742b68354c106"
}
]
```## ToDO:
Implement Priority Queue: Presently all the scraping is done through a single queue. Which is not good when a lot of users are using our website. So we need to create a priority queue system where we can let people submit to another queue if they need something quick.