Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/laiso/site2pdf
Generate comprehensive PDFs of entire websites, ideal for RAG.
https://github.com/laiso/site2pdf
Last synced: 16 days ago
JSON representation
Generate comprehensive PDFs of entire websites, ideal for RAG.
- Host: GitHub
- URL: https://github.com/laiso/site2pdf
- Owner: laiso
- License: mit
- Created: 2024-07-14T05:22:19.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-07-20T09:47:30.000Z (4 months ago)
- Last Synced: 2024-10-01T09:41:41.044Z (about 1 month ago)
- Language: TypeScript
- Size: 96.7 KB
- Stars: 159
- Watchers: 1
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# site2pdf
This tool generates a PDF file containing the main page and all sub-pages of a website that match a provided URL pattern.
**đź“—The PDF generated by this tool is particularly well-suited for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.đź“—**
## Motivation
**🧳Portability:** Combining multiple pages of a website into a single file enhances portability, making it easier to share and use the information.
**🤖AI Integration:** In some use cases, such as with [Google NotebookLM](https://notebooklm.google.com/) and [ChatGPT GPTs](https://chatgpt.com/gpts), providing a master dataset in PDF format helps in creating more efficient bots.
**🖼️Visual Information Preservation:** By generating results in PDF format, visual information like images is preserved, ensuring better recognition by multimodal models.## Prerequisites
To run this software, you need to have Node.js installed on your machine. You can download and install the latest version of Node.js from [the official Node.js website](https://nodejs.org/).
### Dependencies(Linux)
This project uses the following dependencies:
```bash
sudo apt-get update
sudo apt-get install -y libxkbcommon0
sudo apt-get install -y libnss3 libxss1 libasound2
sudo apt-get install -y fonts-liberation libappindicator3-1 libatk-bridge2.0-0 libatspi2.0-0 libgtk-3-0 libgbm-dev
```## Usage
```bash
npx site2pdf-cli [url_pattern]
```### Arguments
* ``: The main URL of the website to be converted to PDF.
* `[url_pattern]`: Optional regular expression to filter sub-links. Defaults to matching only links within the main URL domain.### Example
```bash
npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"
``````bash
> [email protected] start
> tsx index.ts https://www.typescriptlang.org/docs/handbook/ https://www.typescriptlang.org/docs/handbook/2/Generating PDF for: https://www.typescriptlang.org/docs/handbook/
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/basic-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/functions.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/objects.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/classes.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/modules.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/types-from-types.html
PDF saved to ./out/www-typescriptlang-org-docs-handbook.pdf
```This command will generate a PDF file named `www.typescriptlang.org-docs-handbook.pdf` containing all pages on the `https://www.typescriptlang.org/docs/handbook/` domain that match the pattern `https://www.typescriptlang.org/docs/handbook/2/`.
## Troubleshooting for Windows
When running Puppeteer on Windows, you may encounter permission issues related to generating PDFs. To resolve this, you need to grant appropriate permissions. Follow these steps:
```powershell
icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)
```[Troubleshooting - Chrome reports sandbox errors on Windows| Puppeteer](https://pptr.dev/troubleshooting#chrome-reports-sandbox-errors-on-windows)
## Implementation Details
* Navigates to the main page using `puppeteer`.
* Finds all sub-links matching the provided `url_pattern`.
* Generates a PDF for each sub-link using `pdf-lib` and merges them into a single document.
* Saves the final PDF file with a slugified name based on the main URL.
**Note:** The provided `url_pattern` should be a valid regular expression. If no `url_pattern` is provided, the tool will default to matching only links within the main URL domain.This tool is still under development and may have limitations. Feel free to contribute to the project by opening issues or pull requests!
## Development
### Prerequisites
Ensure you have Node.js and npm installed. You will also need a modern version of TypeScript and other dependencies specified in `package.json`.
### Setup
Clone the repository and install the dependencies:
```bash
git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install
```### Building
The project uses TypeScript. To compile the TypeScript files, run:
```bash
npx tsc
```### Running the Project
You can run the project in development mode with:
```bash
npm run dev
```This command uses `tsx` to watch for changes and recompile as necessary.
### Testing
The project uses Jest for testing. To run the tests, execute:
```bash
npm test
```### Linting
Linting is configured using Biome. To check for linting issues, run:
```bash
npx biome lint
```### Code Formatting
To format the code according to the project's style guidelines, run:
```bash
npx biome format
```### Contributing
Feel free to open issues or pull requests. Make sure to follow the existing code style and include tests for new features or bug fixes.
### Notes
- The project uses ES modules. Ensure your Node.js version supports this.
- Update dependencies as necessary, and ensure compatibility with existing code.