Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/newamericafoundation/miniscraper

Tiny Node.js web scraping tool.
https://github.com/newamericafoundation/miniscraper

Last synced: about 17 hours ago
JSON representation

Tiny Node.js web scraping tool.

Host: GitHub
URL: https://github.com/newamericafoundation/miniscraper
Owner: newamericafoundation
Created: 2015-08-06T15:24:41.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-09-29T19:19:51.000Z (over 9 years ago)
Last Synced: 2024-04-14T23:57:17.700Z (9 months ago)
Language: JavaScript
Size: 8.78 MB
Stars: 0
Watchers: 8
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

        A scraping utility for highly customized mass-data collection.

# Usage

The main scraper module expects a ``job`` object as follows:

	var job = {

		id: 'find_favorite_foods',

		saveFileName: 'cartoon_characters.json',

		extractables: [

			{

				field: 'favorite_food',

				extractMethodName: 'extractOne',

				location: {

					selector: '.favorite-food p'

				}

			}

		],

		getEntries: function() {

			return [

				{

					name: 'Jerry',

					species: 'mouse'

				},

				{

					name: 'Tom',

					species: 'cat'

				}

			];

		},

		// The URL where the entry can be found, e.g. http://www.cartoonnetwork.com/jerry-mouse

		getEntryUrl: function(entry) {

			return ('http://www.cartoonnetwork.com/' + entry.name + '-' + entry.species);

		}

	};

The scraper generates a new, extended JSON object of the new entries by scraping each corresponding URL for the inner html of ``.favorite-food p``. The following code:

	var scraper = new Scraper(job);

	scraper.scrape(function(data) {

		console.log(data);

	});

Will log:

	{

		name: 'Jerry',

		species: 'mouse',

		favorite_food: 'cheese'

	},

	{

		name: 'Tom',

		species: 'cat',

		favorite_food: 'milk'

	}

# Customize

This scraper implements a range of further options to handle multiple extracts, table lookups and file downloads. Here are the available customization options for ``job`` fields.

## extractMethodName

This option accepts the following method names implemented on scraper:

* extractOne

* extractAll

* extractAndDownloadUrl

Extend from the scraper class