Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fschaeffler/lambda-web-scrapper
https://github.com/fschaeffler/lambda-web-scrapper
Last synced: 25 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/fschaeffler/lambda-web-scrapper
- Owner: fschaeffler
- Created: 2019-07-30T08:10:28.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-08-03T21:23:31.000Z (over 5 years ago)
- Last Synced: 2024-12-23T20:24:23.637Z (29 days ago)
- Language: JavaScript
- Size: 99.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS Lambda Web Scrapper
This web scrapper runs on AWS with a headless Chromium browser and Puppeteer.
## Development Setup
- install Node.js 8.10 because this is the version being used by the Serverless service
- install yarn from https://yarnpkg.com
- install dependencies via `yarn install`
- install serverless globally via `yarn global add serverless`
## Run (Local)
- run the service via `yarn serve`
## Deploy (Production)
- deploy the Serverless service via `yarn deploy`
- note down the created API-key from the deployment output
## Usage (Production)
The actualy endpoint URL is displayed in the deployment output. It looks like `https://.execute-api.eu-central-1.amazonaws.com/prod/scrape`.
In order to ensure only authorized clients can call this API, an API-key is needed for the call. This key needs to get specified in the HTTP Request Header for `x-api-key`.
An exemplary call of the endpoint would look like the following one.
Content of `request.json`
```
{
"url": "https://www.sfit.services",
"xpathes": [
"//*/div/div/div/a",
"//*/div/div/div"
],
"uniqueResults": true
}
````curl -X POST -H "x-api-key: API-SECRET-KEY" -d @request.json https://GATEWAY-ID.execute-api.eu-central-1.amazonaws.com/prod/scrape`
### Request Options
The request options get handed in as JSON-data. This data need to adhere the following structure.
- `url`: [string] The URL which should get scrapped
- `xpathes`: [array of strings] A list of XPathes of which the content should get retrieved
- `uniqueResults` (optional): [boolean] When set to `true`, only unique results will get returned