https://github.com/cmg8431/web-meta-scraper

A URL scraper for extracting various metadata, including Open Graph, JSON-LD, and more
https://github.com/cmg8431/web-meta-scraper
html json-ld metadata node nodejs og open-graph scraper twitter
Last synced: 2 months ago
JSON representation
A URL scraper for extracting various metadata, including Open Graph, JSON-LD, and more
Host: GitHub
URL: https://github.com/cmg8431/web-meta-scraper
Owner: cmg8431
Created: 2025-01-21T01:29:43.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-02-23T07:06:10.000Z (4 months ago)
Last Synced: 2026-02-23T11:28:31.307Z (4 months ago)
Topics: html, json-ld, metadata, node, nodejs, og, open-graph, scraper, twitter
Language: TypeScript
Homepage: https://radiant-malabi-26e1e6.netlify.app/
Size: 1.58 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README-ko_kr.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project

awesome-seo-mcp-servers - web-meta-scraper - LD, media extraction with SEO validation | (:globe_with_meridians: MCP Servers / :wrench: Technical SEO & Auditing)
README

          ![](https://github.com/user-attachments/assets/d90c0d88-c820-4ad7-ab28-193fd6491c6e)

# web-meta-scraper

[![npm version](https://img.shields.io/npm/v/web-meta-scraper)](https://www.npmjs.com/package/web-meta-scraper)

[![npm downloads](https://img.shields.io/npm/dm/web-meta-scraper)](https://www.npmjs.com/package/web-meta-scraper)

[![bundle size](https://img.shields.io/bundlephobia/minzip/web-meta-scraper)](https://bundlephobia.com/package/web-meta-scraper)

[![license](https://img.shields.io/npm/l/web-meta-scraper)](https://github.com/cmg8431/web-meta-scraper/blob/main/LICENSE)

[![TypeScript](https://img.shields.io/badge/TypeScript-5.9-blue)](https://www.typescriptlang.org/)

[![GitHub stars](https://img.shields.io/github/stars/cmg8431/web-meta-scraper)](https://github.com/cmg8431/web-meta-scraper)

[English](https://github.com/cmg8431/web-meta-scraper/blob/main/README.md) | 한국어

웹 페이지 메타데이터를 추출하는 경량 플러그인 기반 TypeScript 라이브러리입니다. Open Graph, Twitter Cards, JSON-LD, oEmbed, 표준 메타 태그를 지원하며 우선순위 기반 자동 병합을 제공합니다.

## 왜 web-meta-scraper인가?

| | web-meta-scraper | [metascraper](https://github.com/microlinkhq/metascraper) | [open-graph-scraper](https://github.com/jshemas/openGraphScraper) |

|---|---|---|---|

| **의존성** | 1개 (`cheerio`) | 10개+ | 4개+ |

| **번들 크기** | ~5KB min+gzip | ~50KB+ | ~15KB+ |

| **플러그인 시스템** | 조합 가능한 플러그인 | 규칙 기반 | 모놀리식 |

| **커스텀 플러그인** | 간단한 함수 | 복잡한 규칙 | 미지원 |

| **TypeScript** | 퍼스트 클래스 | 부분 지원 | 부분 지원 |

| **oEmbed 지원** | 내장 플러그인 | 별도 패키지 | 미지원 |

| **커스텀 우선순위 규칙** | 설정 가능 | 고정 | 고정 |

| **네이티브 fetch** | 네이티브 `fetch()` 사용 | `got` 사용 | `undici` 사용 |

- **단일 의존성** — HTML 파싱을 위한 [cheerio](https://cheerio.js.org/)만 사용. HTTP 요청은 네이티브 `fetch()` 사용.

- **플러그인 아키텍처** — 필요한 추출기만 선택. 간단한 함수로 커스텀 플러그인 생성 가능.

- **우선순위 기반 병합** — 같은 필드가 여러 소스에 있을 때 자동으로 충돌 해결. 우선순위 규칙 커스터마이징 가능.

- **TypeScript 퍼스트** — `ResolvedMetadata`, `ScraperResult`, 플러그인 타입 등 완전한 타입 정의.

- **구조화된 결과** — 병합된 `metadata`와 각 플러그인의 원본 `sources`를 함께 반환.

## 설치

```bash

npm install web-meta-scraper

# 또는

pnpm add web-meta-scraper

# 또는

yarn add web-meta-scraper

# 또는

bun add web-meta-scraper

```

## 빠른 시작

### 간단한 사용 — `scrape()` 함수

가장 쉬운 방법입니다. URL과 HTML을 자동 감지하고 모든 빌트인 플러그인을 사용합니다:

```typescript

import { scrape } from 'web-meta-scraper';

// URL에서 스크래핑

const result = await scrape('https://example.com');

// HTML 문자열에서 스크래핑

const result = await scrape('Hello');

console.log(result.metadata);

// {

//   title: "Example",

//   description: "An example page",

//   image: "https://example.com/og-image.png",

//   url: "https://example.com",

//   type: "website",

//   siteName: "Example",

//   ...

// }

// 각 플러그인의 원본 데이터도 확인 가능

console.log(result.sources);

// { "open-graph": { title: "Example", ... }, "meta-tags": { ... }, ... }

```

### 고급 사용 — `createScraper()`

플러그인, 우선순위 규칙, fetch 옵션, 후처리를 세밀하게 제어할 수 있습니다:

```typescript

import { createScraper, metaTags, openGraph, twitter, jsonLd, oembed } from 'web-meta-scraper';

const scraper = createScraper({

  plugins: [metaTags, openGraph, twitter, jsonLd, oembed],

  fetch: {

    timeout: 10000,

    userAgent: 'MyBot/1.0',

  },

  postProcess: {

    maxDescriptionLength: 150,

    secureImages: true,

  },

});

// URL에서 스크래핑

const result = await scraper.scrapeUrl('https://example.com');

// HTML 직접 파싱

const result = await scraper.scrape(html, { url: 'https://example.com' });

```

## 플러그인

| 플러그인 | Import | 추출 항목 |

|---------|--------|----------|

| **Meta Tags** | `metaTags` | `title`, `description`, `keywords`, `author`, `favicon`, `canonicalUrl` |

| **Open Graph** | `openGraph` | `og:title`, `og:description`, `og:image`, `og:url`, `og:type`, `og:site_name`, `og:locale` |

| **Twitter Cards** | `twitter` | `twitter:title`, `twitter:description`, `twitter:image`, `twitter:card`, `twitter:site`, `twitter:creator` |

| **JSON-LD** | `jsonLd` | 구조화된 데이터 (`Article`, `Product`, `Organization`, `FAQPage`, `BreadcrumbList` 등) |

| **oEmbed** | `oembed` | oEmbed 데이터 (`title`, `author_name`, `thumbnail_url`, `html` 등) |

| **Favicons** | `favicons` | 모든 아이콘 링크 (`icon`, `apple-touch-icon`, `mask-icon`, `manifest`) + `sizes`, `type` |

| **Feeds** | `feeds` | RSS (`application/rss+xml`) 및 Atom (`application/atom+xml`) 피드 링크 + `title` |

| **Robots** | `robots` | Robots 메타 디렉티브 (`noindex`, `nofollow`, `noarchive`, `nosnippet` 등) + 인덱싱 가능 여부 플래그 |

| **Date** | `date` | 발행일 (`article:published_time`, Dublin Core, JSON-LD, ``) 및 수정일 |

| **Logo** | `logo` | `og:logo`, Schema.org 마이크로데이터, JSON-LD Organization/Publisher에서 사이트 로고 URL |

| **Lang** | `lang` | ``, `og:locale`, `content-language`, JSON-LD에서 BCP 47 언어 태그 |

| **Video** | `video` | `og:video`, `twitter:player`, `` 요소, JSON-LD `VideoObject`에서 비디오 리소스 |

| **Audio** | `audio` | `og:audio`, `` 요소, JSON-LD `AudioObject`에서 오디오 리소스 |

| **iFrame** | `iframe` | `twitter:player`에서 임베드 가능한 iframe HTML + oEmbed 폴백 |

```typescript

// 필요한 것만 사용

const scraper = createScraper({

  plugins: [openGraph, twitter],

});

```

> **참고:** `scrape()` 단축 함수는 기본적으로 코어 플러그인(`metaTags`, `openGraph`, `twitter`, `jsonLd`)만 사용합니다. `favicons`, `feeds`, `robots`, `date`, `logo`, `lang`, `video`, `audio`, `iframe` 등을 사용하려면 `createScraper()`에 명시적으로 전달하세요.

## 배치 스크래핑

`batchScrape()`로 여러 URL을 동시에 스크래핑할 수 있습니다. 외부 의존성 없는 Promise 기반 워커 풀을 사용합니다. 각 URL은 독립적으로 처리되어 하나의 실패가 나머지에 영향을 주지 않습니다.

```typescript

import { batchScrape } from 'web-meta-scraper';

const results = await batchScrape(

  ['https://example.com', 'https://github.com', 'https://nodejs.org'],

  { concurrency: 3 },

);

for (const r of results) {

  if (r.success) {

    console.log(r.url, r.result.metadata.title);

  } else {

    console.error(r.url, r.error);

  }

}

```

## 우선순위 기반 병합

같은 필드가 여러 소스에 존재할 경우 가장 높은 우선순위의 값이 사용됩니다:

| 필드 | 우선순위 (높음 → 낮음) |

|------|---------------------|

| `title` | Open Graph → Meta Tags → Twitter |

| `description` | Open Graph → Meta Tags → Twitter |

| `image` | Open Graph → Twitter |

| `url` | Open Graph → Meta Tags (canonical) |

`twitterCard`, `siteName`, `locale`, `jsonLd`, `oembed` 등 소스 고유 필드는 항상 그대로 포함됩니다.

기본 규칙을 오버라이드할 수 있습니다:

```typescript

import { createScraper, metaTags, openGraph, twitter } from 'web-meta-scraper';

const scraper = createScraper({

  plugins: [metaTags, openGraph, twitter],

  rules: [

    {

      field: 'title',

      sources: [

        { plugin: 'twitter', key: 'title', priority: 3 },   // Twitter 우선

        { plugin: 'open-graph', key: 'title', priority: 2 },

        { plugin: 'meta-tags', key: 'title', priority: 1 },

      ],

    },

    // ... 다른 규칙

  ],

});

```

## 설정

### `ScraperConfig`

```typescript

const scraper = createScraper({

  // 사용할 플러그인

  plugins: [metaTags, openGraph, twitter, jsonLd, oembed],

  // 우선순위 규칙 (기본값: DEFAULT_RULES)

  rules: DEFAULT_RULES,

  // Fetch 옵션 (scrapeUrl에 적용)

  fetch: {

    timeout: 30000,             // 요청 타임아웃 ms (기본값: 30000)

    userAgent: 'MyBot/1.0',    // 커스텀 User-Agent 헤더

    followRedirects: true,      // HTTP 리다이렉트 따라가기 (기본값: true)

    maxContentLength: 5242880,  // 최대 응답 크기 bytes (기본값: 5MB)

  },

  // 후처리 옵션

  postProcess: {

    maxDescriptionLength: 200,  // 설명 최대 길이 (기본값: 200)

    secureImages: true,         // 이미지 URL을 HTTPS로 변환 (기본값: true)

    omitEmpty: true,            // 빈 값/null 제거 (기본값: true)

    fallbacks: true,            // 폴백 로직 적용 (기본값: true)

  },

});

```

### 스텔스 모드

일부 웹사이트는 TLS 핑거프린팅으로 자동화된 요청을 차단합니다. 스텔스 모드를 활성화하면 브라우저와 유사한 TLS 핑거프린트로 HTTP/2를 사용합니다:

```typescript

const scraper = createScraper({

  plugins: [metaTags, openGraph],

  fetch: {

    stealth: true,

  },

});

```

> **주의:** 스텔스 모드는 기본적으로 비활성화되어 있습니다. 스텔스 모드로 빠른 반복 요청 시 속도 제한(예: JS 챌린지 페이지)이 발생할 수 있습니다. 항상 `robots.txt`와 사이트 이용약관을 준수하세요. 책임감 있게 사용하세요.

### 폴백 동작

`fallbacks: true` (기본값)일 때:

- `title`이 없으면 `siteName`으로 대체

- `description`이 없으면 JSON-LD 구조화된 데이터에서 추출

- 상대 경로 이미지/파비콘 URL을 절대 경로로 변환

## 커스텀 플러그인

플러그인은 `ScrapeContext`를 받아 `PluginResult`를 반환하는 함수입니다:

```typescript

import type { Plugin } from 'web-meta-scraper';

const pricePlugin: Plugin = (ctx) => {

  const { $ } = ctx; // Cheerio 인스턴스

  const price = $('[itemprop="price"]').attr('content');

  const currency = $('[itemprop="priceCurrency"]').attr('content');

  return {

    name: 'price',

    data: { price, currency },

  };

};

const scraper = createScraper({

  plugins: [openGraph, pricePlugin],

  rules: [

    ...DEFAULT_RULES,

    { field: 'price', sources: [{ plugin: 'price', key: 'price', priority: 1 }] },

    { field: 'currency', sources: [{ plugin: 'price', key: 'currency', priority: 1 }] },

  ],

});

```

## 에러 처리

```typescript

import { scrape, ScraperError } from 'web-meta-scraper';

try {

  const result = await scrape('https://example.com');

} catch (error) {

  if (error instanceof ScraperError) {

    console.error(error.message); // 예: "Request timeout after 30000ms"

    console.error(error.cause);   // 원본 에러 (있는 경우)

  }

}

```

## 메타데이터 검증

`validateMetadata()`는 14가지 SEO 규칙에 따라 메타데이터 품질을 점수화(0–100)하고 이슈를 보고합니다:

```typescript

import { scrape, validateMetadata } from 'web-meta-scraper';

const result = await scrape('https://example.com');

const validation = validateMetadata(result);

console.log(validation.score);  // 85

console.log(validation.issues);

// [

//   { field: "description", severity: "warning", message: "Description is too short (under 50 characters)" },

// ]

```

## 콘텐츠 추출

`extractContent()`는 네비게이션, 광고, 사이드바를 제거하고 웹 페이지의 본문 텍스트를 추출합니다:

```typescript

import { extractContent } from 'web-meta-scraper';

const content = await extractContent('https://example.com/article');

console.log(content.content);   // "기사 본문 내용..."

console.log(content.wordCount); // 1234

console.log(content.language);  // "ko"

console.log(content.metadata);  // { title: "기사 제목", description: "..." }

```

CJK 단어 수 계산을 지원하며, HTML 문자열 파싱을 위한 `extractFromHtml()`도 제공합니다.

## MCP 서버

[`web-meta-scraper-mcp`](https://www.npmjs.com/package/web-meta-scraper-mcp)는 web-meta-scraper를 [MCP(Model Context Protocol)](https://modelcontextprotocol.io) 서버로 제공합니다. Claude Code, Claude Desktop 등 MCP 클라이언트에서 메타데이터 추출 도구를 바로 사용할 수 있습니다.

### 설정

**Claude Code:**

```bash

claude mcp add web-meta-scraper -- npx -y web-meta-scraper-mcp

```

**Claude Desktop / Cursor:**

설정 파일에 아래 내용을 추가합니다:

```json

{

  "mcpServers": {

    "web-meta-scraper": {

      "command": "npx",

      "args": ["-y", "web-meta-scraper-mcp"]

    }

  }

}

```

### 제공 도구

| 도구 | 설명 |

|------|------|

| `scrape_url` | URL에서 메타데이터 추출 (Open Graph, Twitter Cards, JSON-LD, meta tags, favicons, feeds, robots) |

| `scrape_html` | HTML 문자열에서 메타데이터 추출 (상대 경로 해석을 위한 기준 URL 옵션 제공) |

| `batch_scrape` | 여러 URL에서 메타데이터를 동시에 추출 |

| `detect_feeds` | 웹 페이지에서 RSS/Atom 피드 링크 감지 |

| `check_robots` | robots 메타 태그 디렉티브 및 인덱싱 상태 확인 |

| `validate_metadata` | 메타데이터 품질 검증 및 SEO 점수 리포트 생성 |

| `extract_content` | 웹 페이지에서 본문 텍스트 추출 |

자세한 사용법과 예시는 [MCP 패키지 README](./mcp/README.md)를 참고하세요.

## 라이선스

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cmg8431/web-meta-scraper

Awesome Lists containing this project

README