https://github.com/timf34/pagesource

CLI to download websites' actual JS/CSS/assets (not flattened HTML)
https://github.com/timf34/pagesource

Last synced: 6 months ago
JSON representation

CLI to download websites' actual JS/CSS/assets (not flattened HTML)

Host: GitHub
URL: https://github.com/timf34/pagesource
Owner: timf34
License: mit
Created: 2025-12-29T17:53:19.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-12-30T14:45:06.000Z (7 months ago)
Last Synced: 2026-01-02T00:20:00.101Z (7 months ago)
Language: Python
Homepage: https://pypi.org/project/pagesource/
Size: 1.17 MB
Stars: 34
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# pagesource

A Python CLI tool that captures all resources loaded by a webpage (like browser DevTools Sources tab) and saves them with the original directory structure.

## Installation

```bash
pip install pagesource

# IMPORTANT: Install Playwright browser after package installation
playwright install chromium
```

## Usage

### Basic Usage

```bash
# Capture all resources from a webpage
pagesource https://example.com
```

This will save all resources to `./pagesource_output/` with the directory structure preserved.

### Options

```bash
# Specify custom output directory
pagesource https://example.com -o ./my-output

# Wait extra time for JavaScript content (useful for SPAs)
pagesource https://example.com --wait 5

# Include external resources (CDN assets, third-party scripts)
pagesource https://example.com --include-external

# Combine options
pagesource https://example.com -o ./output --wait 3 --include-external
```

### CLI Reference

```
pagesource [OPTIONS]

Arguments:
url URL of the webpage to capture resources from

Options:
-o, --output PATH Output directory (default: ./pagesource_output)
-w, --wait INTEGER Additional seconds to wait after page load
-e, --include-external Include external resources (CDN, third-party)
-v, --version Show version and exit
--help Show help message
```

## Output Structure

Resources are saved preserving the URL path structure:

```
pagesource_output/
└── example.com/
├── index.html
├── assets/
│ ├── css/
│ │ └── style.css
│ └── js/
│ └── app.js
└── images/
└── logo.png
```

If `--include-external` is used, external resources are saved in their own host directories:

```
pagesource_output/
├── example.com/
│ └── ...
├── cdn.example.com/
│ └── libs/
│ └── library.js
└── fonts.googleapis.com/
└── css/
└── font.css
```

## Features

- Captures all network resources loaded by the page (HTML, CSS, JS, images, fonts, etc.)
- Preserves original directory structure
- Handles query strings (strips them from filenames)
- Infers file extensions from Content-Type when missing
- Handles duplicate filenames
- Sanitizes paths for filesystem safety
- Optional wait time for JavaScript-heavy pages

## Requirements

- Python 3.10+
- Playwright (with Chromium browser)

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/timf34/pagesource

Awesome Lists containing this project

README