Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DiscovAI/DiscovAI-crawl
π·οΈ DiscovAI Crawl API(π§ Work in Progress π§): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.
https://github.com/DiscovAI/DiscovAI-crawl
ai api crawler embedding vector-database web-scraping
Last synced: 1 day ago
JSON representation
π·οΈ DiscovAI Crawl API(π§ Work in Progress π§): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.
- Host: GitHub
- URL: https://github.com/DiscovAI/DiscovAI-crawl
- Owner: DiscovAI
- License: apache-2.0
- Created: 2024-08-01T04:28:43.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-05T11:45:54.000Z (5 months ago)
- Last Synced: 2025-01-01T02:05:02.660Z (7 days ago)
- Topics: ai, api, crawler, embedding, vector-database, web-scraping
- Language: TypeScript
- Homepage:
- Size: 391 KB
- Stars: 18
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Discovai-Crawl - π·οΈ DiscovAI Crawl API(π§ Work in Progress π§) - A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, geneβ¦ (Building / Tools)
- awesome_ai_agents - Discovai-Crawl - π·οΈ DiscovAI Crawl API(π§ Work in Progress π§) - A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, geneβ¦ (Building / Tools)
README
# DiscovAI Crawl API π·οΈπ
> One API to scrape everything you need from URLs for your AI tool and vector database.
π§ **Work in Progress** π§
## π Features
Our API provides a comprehensive suite of data extraction and processing capabilities:
- π§Ό Clean HTML (JavaScript and CSS removed)
- π LLM-friendly Markdown conversion
- π« Ad-free, cookie banner-free, and dialog-free content
- πΈ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
- π€ LLM-generated SEO-friendly content
- π LLM-extracted key information (summary, features, FAQs, etc.)
- π§ Ready-to-use embeddings for vector database integration (auto-saved to db)## π§ Installation
```bash
pnpm i
cd apps/api && pnpm exec playwright install
```## π Usage
```bash
pnpm dev
open http://localhost:3000
```## π¦ API Response Structure
```json
{
"clean_html": "...",
"LLM_friendly_markdown": "...",
"clean_text": "...",
"screenshot_url": "...",
"llm_extracts_key_info": {
"what": "...",
"summary": "...",
"features": ["...", "..."],
"faqs": [{"q": "...", "a": "..."}]
},
"llm_summarized_detail": "...",
"embeddings": [...]
}
```## π Documentation
TODO
## π€ Contributing
TODO