Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sabber-slt/netextract
NetExtract: Efficiently extract core content from any webpage and convert it to clean, LLM-optimized Markdown with a simple API.
https://github.com/sabber-slt/netextract
api crawling gemma2 llm markdown puppeteer
Last synced: 3 months ago
JSON representation
NetExtract: Efficiently extract core content from any webpage and convert it to clean, LLM-optimized Markdown with a simple API.
- Host: GitHub
- URL: https://github.com/sabber-slt/netextract
- Owner: sabber-slt
- License: mit
- Created: 2024-08-10T21:42:46.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-18T14:01:37.000Z (5 months ago)
- Last Synced: 2024-09-23T20:31:19.954Z (4 months ago)
- Topics: api, crawling, gemma2, llm, markdown, puppeteer
- Language: TypeScript
- Homepage:
- Size: 984 KB
- Stars: 25
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
NetExtract
NetExtract is crafted to extract core content from webpages and convert it into clean, LLM-friendly text. Leveraging the power of Express.js, TypeScript, and Puppeteer, it offers a streamlined API for efficient content extraction and transformation, making it an invaluable tool for enhancing LLM and RAG systems with up-to-date web information and API web scraping.
![preview](./assets/x.png)
## Features
1. Core Content Extraction: Seamlessly extracts essential content from any URL.
2. Markdown Conversion: Converts webpage content into clean, well-formatted Markdown.
3. Social Media Scraping: Efficiently scrapes and formats X (Twitter) posts.
4. Simple API Integration: Easily integrates with existing systems.
5. LLM-Powered Conversion: Utilizes open-source large language models to enhance the extraction and conversion process, ensuring high-quality output.## 📖 Usage
To use NetExtract, prepend the API endpoint to your desired URL:
```bash
http://{your_address}/api?url={url}
```## 🗂️ Getting started with Docker
```bash
git clone https://github.com/sabber-slt/NetExtract
cd NetExtract
```Then run the application with Docker:
```bash
docker compose up -d
```## ⚡️ Acknowledgments
- Inspired by jina.ai
- Built with Node.js, Express.js, TypeScript, and Puppeteer## 🧩 Structure
```
.
├── cookie
│ └── twitter.json # Twitter cookie for X (Twitter) post scraping
├── docs # Documentation files
├── search # Searxng engine
├── src # Source code
│ ├── interfaces # TypeScript interfaces
│ ├── lib # Utility libraries
│ ├── routes # Express route handlers
│ ├── services # Core service layer for business logic
│ ├── utils # Helper functions and utilities
│ └── app.ts # Main application entry point
├── .env # Environment variables
├── .gitignore # Git ignored files
├── .prettierignore # Prettier ignored files
├── .prettierrc.js # Prettier configuration
├── app.log # Log file
├── Dockerfile # Dockerfile
├── docker-compose.yaml # Docker Compose configuration
├── package.json # Node.js project metadata
├── README.md # Project README
├── tsconfig.json # TypeScript configuration
└── yarn.lock # Yarn lockfile for dependency management```
## 🤝 Contributing
I welcome and appreciate contributions! If you'd like to contribute, please feel free to submit issues, fork the repository, and send pull requests.