https://github.com/wikimedia/html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
https://github.com/wikimedia/html-metadata

javascript metadata-extraction metadata-extractor node-module nodejs web-scraper web-scraping

Last synced: about 2 months ago
JSON representation

MetaData html scraper and parser for Node.js (supports Promises and callback style)

Host: GitHub
URL: https://github.com/wikimedia/html-metadata
Owner: wikimedia
License: mit
Created: 2014-12-17T12:54:28.000Z (over 10 years ago)
Default Branch: main
Last Pushed: 2025-03-07T14:59:17.000Z (4 months ago)
Last Synced: 2025-04-01T00:34:01.425Z (3 months ago)
Topics: javascript, metadata-extraction, metadata-extractor, node-module, nodejs, web-scraper, web-scraping
Language: JavaScript
Homepage:
Size: 464 KB
Stars: 171
Watchers: 27
Forks: 43
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

my-awesome-list - html-metadata

README

        html-metadata

=============

[![npm](https://img.shields.io/npm/v/html-metadata.svg)](https://www.npmjs.com/package/html-metadata)

> MetaData html scraper and parser for Node.js (supports Promises only. Callbacks were deprecated in 3.0.0)

The aim of this library is to be a comprehensive source for extracting all html embedded metadata. Currently it supports Schema.org microdata using a third party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Planned is support for RDFa, AGLS, and other yet unheard of metadata types. Contributions and requests for other metadata types welcome!

## Install

	npm install html-metadata

## Usage

```js

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url).then(function(metadata){

	console.log(metadata);

});

```

The scrape method used here invokes the parseAll() method, which uses all the available methods registered in method metadataFunctions(), and are available for use separately as well, for example:

```js

var cheerio = require('cheerio');

var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

fetch(url).then(function(response){

	$ = cheerio.load(response.body);

	return parseDublinCore($).then(function(metadata){

		console.log(metadata);

	});

});

```

Options dictionary:

You can also pass an [options dictionary](https://developer.mozilla.org/en-US/docs/Web/API/RequestInit) as the first argument containing extra parameters. Some websites require the user-agent or cookies to be set in order to get the response. This is identifical to the RequestInit dictionary except that it should also contain the requested url as part of the dictionary. 

```

var scrape = require('html-metadata');

var options =  {

	url: "http://example.com",

	headers: {

		'User-Agent': 'webscraper'

	}

};

scrape(options, function(error, metadata){

	console.log(metadata);

});

```

The method parseGeneral obtains the following general metadata:

```html

```

## Tests

```npm test``` runs the mocha tests

```npm run-script coverage``` runs the tests and reports code coverage

## Contributing

Contributions welcome! All contibutions should use [Promises](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise) instead of callbacks.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wikimedia/html-metadata

Awesome Lists containing this project

README