Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ajmeese7/dynamic-page-retrieval

Scrape data from JS-rendered pages
https://github.com/ajmeese7/dynamic-page-retrieval

dynamic-content puppeteer scraping

Last synced: about 2 months ago
JSON representation

Scrape data from JS-rendered pages

Host: GitHub
URL: https://github.com/ajmeese7/dynamic-page-retrieval
Owner: ajmeese7
License: mit
Created: 2018-09-03T22:10:33.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-06-18T15:17:31.000Z (8 months ago)
Last Synced: 2024-11-24T17:58:32.991Z (2 months ago)
Topics: dynamic-content, puppeteer, scraping
Language: EJS
Homepage: https://dynamic-page-retrieval.herokuapp.com/scrape
Size: 194 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # dynamic-page-retrieval

The point of this project is to make web scraping easier for developers in any language.

This allows you to send a URL as a parameter to a Heroku application via a GET request

and receive the scraped HTML as a result. The most helpful part of this project is that

it returns the web page after it has been dynamically populated by JavaScript, so you

can scrape nearly any page.

## Usage

Simply send a GET request to `https://dynamic-page-retrieval.herokuapp.com/scrape` with a URL

parameter, which should be formatted like so: `?URL=https://www.google.com`.

So, the entire URL for your GET request, if you were going to use the pre-hosted Heroku

application, would be `https://dynamic-page-retrieval.herokuapp.com/scrape?URL=https://www.google.com`

if you wanted to scrape `https://www.google.com`.

An example of how to format this GET request in JavaScript:

```javascript

const Http = new XMLHttpRequest();

const url = "https://dynamic-page-retrieval.herokuapp.com/scrape?URL=https://www.google.com";

Http.open("GET", url);

Http.send();

Http.onreadystatechange=(e)=>{

  // Replace console.log() with what you need the HTML for,

  // or assign it to a global variable for use elsewhere

  console.log(Http.responseText)

}

```

## Set up your own

First, create a free [Heroku](signup.heroku.com) account. If you already have one, there is

no need to make a new one.

Next, make sure you have [Node.js and npm](https://nodejs.org/en/download/) installed locally.

In the creation of this project, I used Node v9.3.0 and npm v6.4.1, but it shouldn't matter

that much since you are just going to be deploying to Heroku. If you are going to run this

locally, then version will likely be more of a factor.

Clone this project to your machine and open a terminal in the folder. Enter the following

sequence of commands:

`heroku create`

`heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack`

`git push heroku master`

`heroku ps:scale web=1`

`heroku open`

And you should have a working copy of the project!

I am using [kaffeine](http://kaffeine.herokuapp.com/) to keep my dyno alive to reduce loading

times. It is currently set to sleep at 12:00 AM to conserve hours, so it will not be awake from

12:00-6:00 unless someone sends a request during that time interval. An alternative is to add

something like this to the app to help it keep itself awake:

```javascript

var http = require("http");

setInterval(function() {

  http.get("http://.herokuapp.com");

}, 300000); // every 5 minutes (300000)

```

## Contributing

Feel free to open a PR for README additions of GET requests in other languages, making a pretty

homepage and displaying the information on the scraped page in a nicer format, better tests,

better error handling, etc.

### Ideas

- Make npm package where you just put in the URL and get back the scraped content

- Make similar projects in other languages (even though that was a bust before)