Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ajmeese7/dynamic-page-retrieval

Scrape data from JS-rendered pages
https://github.com/ajmeese7/dynamic-page-retrieval

dynamic-content puppeteer scraping

Last synced: 21 days ago
JSON representation

Scrape data from JS-rendered pages

Awesome Lists containing this project

README

        

# dynamic-page-retrieval

The point of this project is to make web scraping easier for developers in any language.
This allows you to send a URL as a parameter to a Heroku application via a GET request
and receive the scraped HTML as a result. The most helpful part of this project is that
it returns the web page after it has been dynamically populated by JavaScript, so you
can scrape nearly any page.

## Usage

Simply send a GET request to `https://dynamic-page-retrieval.herokuapp.com/scrape` with a URL
parameter, which should be formatted like so: `?URL=https://www.google.com`.

So, the entire URL for your GET request, if you were going to use the pre-hosted Heroku
application, would be `https://dynamic-page-retrieval.herokuapp.com/scrape?URL=https://www.google.com`
if you wanted to scrape `https://www.google.com`.

An example of how to format this GET request in JavaScript:
```javascript
const Http = new XMLHttpRequest();
const url = "https://dynamic-page-retrieval.herokuapp.com/scrape?URL=https://www.google.com";
Http.open("GET", url);
Http.send();
Http.onreadystatechange=(e)=>{
// Replace console.log() with what you need the HTML for,
// or assign it to a global variable for use elsewhere
console.log(Http.responseText)
}
```

## Set up your own

First, create a free [Heroku](signup.heroku.com) account. If you already have one, there is
no need to make a new one.

Next, make sure you have [Node.js and npm](https://nodejs.org/en/download/) installed locally.
In the creation of this project, I used Node v9.3.0 and npm v6.4.1, but it shouldn't matter
that much since you are just going to be deploying to Heroku. If you are going to run this
locally, then version will likely be more of a factor.

Clone this project to your machine and open a terminal in the folder. Enter the following
sequence of commands:

`heroku create`

`heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack`

`git push heroku master`

`heroku ps:scale web=1`

`heroku open`

And you should have a working copy of the project!

I am using [kaffeine](http://kaffeine.herokuapp.com/) to keep my dyno alive to reduce loading
times. It is currently set to sleep at 12:00 AM to conserve hours, so it will not be awake from
12:00-6:00 unless someone sends a request during that time interval. An alternative is to add
something like this to the app to help it keep itself awake:
```javascript
var http = require("http");
setInterval(function() {
http.get("http://.herokuapp.com");
}, 300000); // every 5 minutes (300000)
```

## Contributing

Feel free to open a PR for README additions of GET requests in other languages, making a pretty
homepage and displaying the information on the scraped page in a nicer format, better tests,
better error handling, etc.

### Ideas
- Make npm package where you just put in the URL and get back the scraped content
- Make similar projects in other languages (even though that was a bust before)