https://github.com/bozzhik/meta-scraper

python script that automates the collection of metadata from a list of provided websites
https://github.com/bozzhik/meta-scraper

bozzhik fetch-data json markdown metadata python

Last synced: 3 months ago
JSON representation

python script that automates the collection of metadata from a list of provided websites

Host: GitHub
URL: https://github.com/bozzhik/meta-scraper
Owner: bozzhik
Created: 2024-12-27T18:13:56.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-26T14:59:53.000Z (over 1 year ago)
Last Synced: 2025-06-05T10:03:23.090Z (12 months ago)
Topics: bozzhik, fetch-data, json, markdown, metadata, python
Language: Python
Homepage:
Size: 40 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Meta Scraper: Website Metadata Extractor

**meta-scraper** is a simple Python script that automates the collection of metadata (title, description, keywords, and author) from websites. It’s useful for competitor analysis or website auditing.

## Features

- Extracts metadata from a list of websites provided in a `websites.json` file.
- Saves metadata as Markdown `.md` files in the `output` folder.
- Filenames include a timestamp (`YYYYMMDD`) for better organization.
- Gracefully handles errors for invalid or inaccessible URLs without creating unnecessary `.md` files.
- Single-file script for easy use.

## Requirements

- Python 3.12.6 or higher
- Libraries: `requests`, `beautifulsoup4`

Install the required libraries:

```bash
pip install requests beautifulsoup4
```

## Usage

1. Add websites to the `websites.json` file in the following format:

```json
{
"urls": ["https://bozzhik.com", "https://example.com"]
}
```

2. Run the script:

```bash
python main.py
```

3. Check the `output` folder for `.md` files containing the metadata. Filenames will include a timestamp (`YYYYMMDD`) followed by the website URL, e.g., `20231227_bozzhik.com.md`.

4. If a URL is invalid or inaccessible, the script will display an error message in the console but will not create an `.md` file for that URL.

## Example Output

For `https://bozzhik.com`, the output file in `output/` will look like this:

```markdown
# Metadata for https://bozzhik.com

**Title:** BOZZHIK

**Description:** I'm a website developer and user interface designer.

**Keywords:** bozzhik, bozhik, bojic, maxim bozhik, maxim bojic

**Author:** — — —
```

If the URL is invalid, the script will display:

```bash
Error fetching metadata for https://some-website.com:
Skipping https://some-website.com: Unable to fetch metadata (URL might be invalid or inaccessible).
```

## Project Structure

```bash
meta-scraper/
├── main.py # Main script
├── websites.json # JSON file with a list of website URLs
├── output/ # Folder for generated metadata files
└── README.md # Project documentation
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bozzhik/meta-scraper

Awesome Lists containing this project

README