https://github.com/sc10ntech/site-metadata-extractor

Cleans and extracts a web resource's metadata
https://github.com/sc10ntech/site-metadata-extractor

extractor metadata metadata-extraction opengraph webpage-extractor

Last synced: 5 months ago
JSON representation

Cleans and extracts a web resource's metadata

Host: GitHub
URL: https://github.com/sc10ntech/site-metadata-extractor
Owner: sc10ntech
License: apache-2.0
Created: 2019-08-16T20:36:51.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2025-10-28T21:03:03.000Z (8 months ago)
Last Synced: 2025-11-27T10:37:07.153Z (7 months ago)
Topics: extractor, metadata, metadata-extraction, opengraph, webpage-extractor
Language: TypeScript
Homepage:
Size: 2.79 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 13
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Site Metadata Extractor

Cleans and extracts a web(site) resource's metadata.

Metadata extraction fields currently supported:

| Name                     | Data Type      |

| ------------------------ | -------------- |

| author                   | array (jsonb)  |

| canonical_url            | string         |

| copyright                | string         |

| date (publish date)      | date           |

| description              | text           |

| favicon                  | text           |

| image (primary/og image) | text           |

| jsonld (structured data) | object (jsonb) |

| keywords                 | array (jsonb)  |

| lang                     | string         |

| locale                   | string         |

| origin                   | string         |

| publisher                | string         |

| site_name                | string         |

| tags                     | array (jsonb)  |

| title                    | string         |

| type                     | string         |

| truncated_text           | text           |

| status                   | string         |

| videos                   | array (jsonb)  |

| links                    | array (jsonb)  |

## Install

NPM:

```bash

$ npm install site-metadata-extractor --save

```

Yarn:

```bash

$ yarn add site-metadata-extractor

```

## Usage

Feed in a raw markup from a webpage to get extracted metadata fields.

**From `.html` file:**

```js

import fs from "fs";

import siteMetadataExtractor from "site-metadata-extractor";

const getMetadataFromFile = (filename) => {

  const filepath = path.resolve(__dirname, `../data/${filename}.html`);

  const markup = fs.readFileSync(filepath).toString();

  // feel free to use localhost as the second parameter for testing

  const metadata = siteMetadataExtractor(markup, "YOUR_SITE_ORIGIN_HERE");

  return metadata;

};

getMetadataFromFile("example");

```

**From a server request:**

```js

import axios from 'axios';

import siteMetadataExtractor from 'site-metadata-extractor';

const processSite = async (url) => {

  return axios.get(url, config = {})

    .then(res => {

      const { headers } = res;

      const contentType = headers['content-type'];

      if (contentType.includes('text/html')) {

        return {

          body: res.data,

          url

        };

      } else {

        return {};

      }

    })

    .catch(err => {

      console.log(err);

    });

};

processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)

	.then((data) => {

		...

    siteMetadataExtractor(data, "https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/", "en");

    ...

	});

```

## Development

1. Run: `git clone https://github.com/sc10ntech/site-metadata-extractor.git`

2. Change into project directory and install deps: `cd site-metadata-extractor && npm i`

## Creids & Disclaimer

site-metadata-extractor was inspired by, and tries to be the spiritual successor to [node-unfluff](https://github.com/ageitgey/node-unfluff)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sc10ntech/site-metadata-extractor

Awesome Lists containing this project

README