Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/discovai/discovai-crawl

๐Ÿ•ท๏ธ DiscovAI Crawl API(๐Ÿšง Work in Progress ๐Ÿšง): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.
https://github.com/discovai/discovai-crawl

ai api crawler embedding vector-database web-scraping

Last synced: 4 days ago
JSON representation

๐Ÿ•ท๏ธ DiscovAI Crawl API(๐Ÿšง Work in Progress ๐Ÿšง): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

Awesome Lists containing this project

README

        

# DiscovAI Crawl API ๐Ÿ•ท๏ธ๐Ÿ”

> One API to scrape everything you need from URLs for your AI tool and vector database.

๐Ÿšง **Work in Progress** ๐Ÿšง

## ๐ŸŒŸ Features

Our API provides a comprehensive suite of data extraction and processing capabilities:

- ๐Ÿงผ Clean HTML (JavaScript and CSS removed)
- ๐Ÿ“ LLM-friendly Markdown conversion
- ๐Ÿšซ Ad-free, cookie banner-free, and dialog-free content
- ๐Ÿ“ธ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
- ๐Ÿค– LLM-generated SEO-friendly content
- ๐Ÿ”‘ LLM-extracted key information (summary, features, FAQs, etc.)
- ๐Ÿง  Ready-to-use embeddings for vector database integration (auto-saved to db)

## ๐Ÿ”ง Installation

```bash
pnpm i
cd apps/api && pnpm exec playwright install
```

## ๐Ÿš€ Usage

```bash
pnpm dev
open http://localhost:3000
```

## ๐Ÿ“ฆ API Response Structure

```json
{
"clean_html": "...",
"LLM_friendly_markdown": "...",
"clean_text": "...",
"screenshot_url": "...",
"llm_extracts_key_info": {
"what": "...",
"summary": "...",
"features": ["...", "..."],
"faqs": [{"q": "...", "a": "..."}]
},
"llm_summarized_detail": "...",
"embeddings": [...]
}
```

## ๐Ÿ“š Documentation

TODO

## ๐Ÿค Contributing

TODO