https://github.com/sc10ntech/extract-site-metadata
Metadata extractor for the sprawling web ⚙️
https://github.com/sc10ntech/extract-site-metadata
metadata-extraction open-graph-protocol web-data-extraction
Last synced: 5 months ago
JSON representation
Metadata extractor for the sprawling web ⚙️
- Host: GitHub
- URL: https://github.com/sc10ntech/extract-site-metadata
- Owner: sc10ntech
- License: apache-2.0
- Created: 2020-10-23T15:28:11.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-01-08T13:38:48.000Z (over 3 years ago)
- Last Synced: 2025-09-30T08:36:10.375Z (9 months ago)
- Topics: metadata-extraction, open-graph-protocol, web-data-extraction
- Language: TypeScript
- Homepage:
- Size: 2.69 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Extract Site Metadata
Cleans and extracts a web resource's metadata.
Metadata extraction fields currently supported:
| Name | Data Type |
| ------------------------ | -------------- |
| author | array (jsonb) |
| canonical_url | string |
| copyright | string |
| date (publish date) | date |
| description | text |
| favicon | text |
| image (primary/og image) | text |
| jsonld (structured data) | object (jsonb) |
| keywords | array (jsonb) |
| lang | string |
| locale | string |
| origin | string |
| publisher | string |
| site_name | string |
| tags | array (jsonb) |
| title | string |
| type | string |
| truncated_text | text |
| status | string |
| videos | array (jsonb) |
| links | array (jsonb) |
## Install
NPM:
```bash
$ npm install extract-site-metadata --save
```
Yarn:
```bash
$ yarn add extract-site-metadata
```
## Usage
Feed in a raw markup from a webpage to get extracted metadata fields.
**From `.html` file:**
```js
import fs from 'fs';
import extractSiteMetadata from 'extract-site-metadata';
const getMetadataFromFile = (filename) => {
const filepath = path.resolve(__dirname, `../data/${filename}.html`);
const markup = fs.readFileSync(filepath).toString();
// feel free to use localhost as the second parameter for testing
const metadata = extractLinkMetadata(markup, 'YOUR_SITE_ORIGIN_HERE');
return metadata;
};
getMetadataFromFile('example');
```
**From a server request:**
```js
import axios from 'axios';
import extractSiteMetadata from 'extract-site-metadata';
const processSite = async (url) => {
return axios.get(url, config = {})
.then(res => {
const { headers } = res;
const contentType = headers['content-type'];
if (contentType.includes('text/html')) {
return {
body: res.data,
url
};
}
})
.catch(err => {
console.log(err);
});
};
processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)
.then((data) => {
...
});
```
## Development
1. Run: `git clone https://github.com/sc10ntech/extract-site-metadata.git`
2. Change into project directory and install deps: `cd extract-site-metadata && npm i`
## Credits & Disclaimer
extract-site-metadata was inspired by, and tries to be the spiritual successor to [node-unfluff](https://github.com/ageitgey/node-unfluff)