https://github.com/sc10ntech/extract-site-metadata

Metadata extractor for the sprawling web ⚙️
https://github.com/sc10ntech/extract-site-metadata

metadata-extraction open-graph-protocol web-data-extraction

Last synced: 5 months ago
JSON representation

Metadata extractor for the sprawling web ⚙️

Host: GitHub
URL: https://github.com/sc10ntech/extract-site-metadata
Owner: sc10ntech
License: apache-2.0
Created: 2020-10-23T15:28:11.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-01-08T13:38:48.000Z (over 3 years ago)
Last Synced: 2025-09-30T08:36:10.375Z (9 months ago)
Topics: metadata-extraction, open-graph-protocol, web-data-extraction
Language: TypeScript
Homepage:
Size: 2.69 MB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 7
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Extract Site Metadata

Cleans and extracts a web resource's metadata.

Metadata extraction fields currently supported:

| Name                     | Data Type      |

| ------------------------ | -------------- |

| author                   | array (jsonb)  |

| canonical_url            | string         |

| copyright                | string         |

| date (publish date)      | date           |

| description              | text           |

| favicon                  | text           |

| image (primary/og image) | text           |

| jsonld (structured data) | object (jsonb) |

| keywords                 | array (jsonb)  |

| lang                     | string         |

| locale                   | string         |

| origin                   | string         |

| publisher                | string         |

| site_name                | string         |

| tags                     | array (jsonb)  |

| title                    | string         |

| type                     | string         |

| truncated_text           | text           |

| status                   | string         |

| videos                   | array (jsonb)  |

| links                    | array (jsonb)  |

## Install

NPM:

```bash

$ npm install extract-site-metadata --save

```

Yarn:

```bash

$ yarn add extract-site-metadata

```

## Usage

Feed in a raw markup from a webpage to get extracted metadata fields.

**From `.html` file:**

```js

import fs from 'fs';

import extractSiteMetadata from 'extract-site-metadata';

const getMetadataFromFile = (filename) => {

  const filepath = path.resolve(__dirname, `../data/${filename}.html`);

  const markup = fs.readFileSync(filepath).toString();

  // feel free to use localhost as the second parameter for testing

  const metadata = extractLinkMetadata(markup, 'YOUR_SITE_ORIGIN_HERE');

  return metadata;

};

getMetadataFromFile('example');

```

**From a server request:**

```js

import axios from 'axios';

import extractSiteMetadata from 'extract-site-metadata';

const processSite = async (url) => {

  return axios.get(url, config = {})

    .then(res => {

      const { headers } = res;

      const contentType = headers['content-type'];

      if (contentType.includes('text/html')) {

        return {

          body: res.data,

          url

        };

      }

    })

    .catch(err => {

      console.log(err);

    });

};

processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)

	.then((data) => {

		...

	});

```

## Development

1. Run: `git clone https://github.com/sc10ntech/extract-site-metadata.git`

2. Change into project directory and install deps: `cd extract-site-metadata && npm i`

## Credits & Disclaimer

extract-site-metadata was inspired by, and tries to be the spiritual successor to [node-unfluff](https://github.com/ageitgey/node-unfluff)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sc10ntech/extract-site-metadata

Awesome Lists containing this project

README