An open API service indexing awesome lists of open source software.

https://github.com/multix0/froxy

πŸ•ΈοΈ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.
https://github.com/multix0/froxy

go indexing-engine nextjs nodejs open-source search-engine seo-tools supabase tf-idf web-crawler

Last synced: 3 months ago
JSON representation

πŸ•ΈοΈ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.

Awesome Lists containing this project

README

          

# **πŸ•·οΈ Froxy**

> A chill, open-source web engine that crawls, indexes, and vibes with web content using semantic search.

![froxy banner](https://github.com/MultiX0/froxy/blob/main/banner.png?raw=true)

---

## πŸ’‘ What is Froxy?

Froxy is a modular full-stack web engine designed to crawl web pages, extract content, and index it using **semantic embeddings** for intelligent search β€” all powered by modern tools. It includes:

* A **Go**-based crawler (aka the spider πŸ•·οΈ) with real-time indexing
* **FastEmbed** service for generating semantic embeddings
* **Qdrant** vector database for semantic search
* **Froxy Apex** - AI-powered intelligent search (Perplexity-style)
* A **PostgreSQL** database for structured data
* A **Next.js** front-end UI (fully integrated with real APIs)

This project is built for learning, experimenting, and extending β€” great for developers who want to understand how modern semantic search engines work from scratch.

> Fun fact: I made this project in just **3 days** β€” so it might not be perfect, but you know what?
> **It works!**
>
> *(We'll keep evolving this codebase together ❀️)*

> Note: I prefer simplicity over unnecessary complexity. We might make the architecture more advanced in the future, but for now, it's simple, clean, and straightforwardβ€”no fancy stuff, no over-engineering. It's just a chill project for now. If needed, we can scale and make it more complex later. After all, it started as a fun projectβ€”nothing more. <3

---

## πŸ” Features

* 🌐 Crawl websites with real-time indexing (Go)
* 🧠 Semantic search using embeddings (FastEmbed + Qdrant)
* πŸ€– AI-powered intelligent search with LLM integration (Froxy Apex)
* πŸš€ Vector similarity search for intelligent results
* πŸ“Š Chunk-based relevance scoring with cosine similarity
* πŸ•Ί Store structured data in PostgreSQL
* 🎨 Modern UI in Next.js + Tailwind
* 🐳 Fully containerized with Docker

> The frontend is fully connected to the backend and provides semantic search capabilities.

---

## πŸ“‚ Folder Structure

```
froxy/
β”œβ”€β”€ front-end/ # Next.js frontend
β”‚ β”œβ”€β”€ app/ # App routes (search, terms, about, etc.)
β”‚ β”œβ”€β”€ components/ # UI components (shadcn-style)
β”‚ β”œβ”€β”€ hooks/ # React hooks
β”‚ β”œβ”€β”€ lib/ # Utility logic
β”‚ β”œβ”€β”€ public/ # Static assets
β”‚ └── styles/ # TailwindCSS setup
β”œβ”€β”€ indexer-search/ # Node.js search backend
β”‚ └── lib/
β”‚ β”œβ”€β”€ functions/
β”‚ β”œβ”€β”€ services/ # DB + search service
β”‚ └── utils/ # Helper utilities
β”œβ”€β”€ froxy-apex/ # AI-powered intelligent search service
β”‚ β”œβ”€β”€ api/ # API endpoints
β”‚ β”œβ”€β”€ db/ # Database connections
β”‚ β”œβ”€β”€ functions/ # AI processing logic
β”‚ β”œβ”€β”€ llama/ # LLM integration
β”‚ β”œβ”€β”€ models/ # Data models
β”‚ └── utils/ # Helper utilities
β”œβ”€β”€ spider/ # Web crawler in Go with real-time indexing
β”‚ β”œβ”€β”€ db/ # DB handling (PostgreSQL + Qdrant)
β”‚ β”œβ”€β”€ functions/ # Crawl + indexing logic + Proxies (if-need it)
β”‚ β”œβ”€β”€ models/ # Data models
β”‚ └── utils/ # Misc helpers
β”œβ”€β”€ fastembed/ # FastEmbed embedding service
β”‚ β”œβ”€β”€ models/ # Cached embedding models
β”‚ └── docker-compose.yml
β”œβ”€β”€ qdrant/ # Qdrant vector database
β”‚ └── docker-compose.yml
β”œβ”€β”€ db/ # PostgreSQL database
β”‚ β”œβ”€β”€ scripts/ # Shell backups
β”‚ └── docker-compose.yml
β”œβ”€β”€ froxy.sh # Automated setup & runner script
β”œβ”€β”€ LICENSE # MIT License
└── readme.md # This file
```

---

## βš™οΈ Getting Started

### Requirements

* Node.js (18+)
* pnpm or npm
* Go (1.23+)
* Docker & Docker Compose
* At least 2GB RAM (for embedding service)

### Quick Setup (Recommended for Crawler)

For the fastest crawler setup without dealing with configuration details:

```bash
# Make the script executable and run it
chmod +x froxy.sh
./froxy.sh
```

The script will automatically:
- Set up all environment variables with default values
- Create the Docker network
- Start all required services (PostgreSQL, Qdrant, FastEmbed)
- Health check all containers
- Guide you through the crawling process

**Note**: The `froxy.sh` script only handles the crawler setup. You'll need to manually start the `froxy-apex` AI service and `front-end` after crawling.

### Manual Setup

If you prefer to set things up manually:

```bash
# 1. Create Docker network
docker network create froxy-network

# 2. Start Qdrant vector database
cd qdrant
docker-compose up -d --build

# 3. Start PostgreSQL database
cd ../db
# Set proper permissions for PostgreSQL data directory
sudo chown -R 999:999 postgres_data/
docker-compose up -d --build

# 4. Start FastEmbed service
cd ../fastembed
docker-compose up -d --build

# 5. Wait for all services to be healthy, then run the crawler
cd ../spider
go run main.go

# 6. After crawling, start the search backend
cd ../indexer-search
npm install
npm start

# 7. Start the AI-powered search service (Froxy Apex)
# Make sure to configure froxy-apex/.env first
cd ../froxy-apex
go run main.go

# 8. Launch the front-end
cd ../front-end
npm i --legacy-peer-deps
npm run dev
```

---

## πŸ” Environment Variables

### Default Configuration

All services use these environment variables (automatically set by `froxy.sh`):

```env
# Database Configuration (for spider & indexer-search)
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable

# Vector Database Configuration
QDRANT_API_KEY=froxy-secret-key
QDRANT_HOST=http://localhost:6333

# FastEmbed Service
EMBEDDING_HOST=http://localhost:5050

# AI Service (for froxy-apex)
LLM_API_KEY=your_groq_api_key
API_KEY=your_froxy_apex_api_key
```

### Service-Specific Variables

#### `db/.env`
```env
POSTGRES_DB=froxy_db
POSTGRES_USER=froxy_user
POSTGRES_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
```

#### `qdrant/.env`
```env
QDRANT_API_KEY=froxy-secret-key
```

#### `spider/.env` & `indexer-search/.env`
```env
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
QDRANT_API_KEY=froxy-secret-key
EMBEDDING_HOST=http://localhost:5050
```

#### `froxy-apex/.env`
```env
LLM_API_KEY=your_groq_api_key
QDRANT_HOST=http://localhost:6333
EMBEDDING_HOST=http://localhost:5050
API_KEY=your_froxy_apex_api_key
QDRANT_API_KEY=froxy-secret-key
```

#### `front-end/.env`
```env
API_URL=http://localhost:8080
API_KEY=your_api_key
WEBSOCKET_URL=ws://localhost:8080/ws/search
FROXY_APEX_API_KEY=your_froxy_apex_api_key
ACCESS_CODE=auth_access_for_froxy_apex_ui
AUTH_SECRET_TOKEN=jwt_token_for_apex_ui_to_calc_the_usage
```

> πŸ’‘ The `froxy.sh` script automatically creates `.env` files with working default values for the crawler and database services. You'll need to manually configure `froxy-apex/.env` and `front-end/.env` for the AI search and UI components.

---

## πŸ€” How it works

### Traditional Search
1. **Crawler** pulls website content from your provided URLs
2. **Real-time indexing** generates semantic embeddings using FastEmbed
3. **Qdrant** stores vector embeddings for semantic similarity search
4. **PostgreSQL** stores structured metadata
5. **Frontend** provides intelligent semantic search interface

### AI-Powered Search (Froxy Apex)
1. **User query** is received and processed
2. **Query enhancement** using Llama 3.1 8B via Groq API
3. **Embedding generation** for the enhanced query using FastEmbed
4. **Vector search** in Qdrant to find relevant pages
5. **Content chunking** of relevant pages for detailed analysis
6. **Cosine similarity** calculation for each chunk against the query
7. **LLM processing** to generate structured response with:
- Concise summary
- Detailed results with sources
- Relevance scores
- Reference links and favicons
- Confidence ratings

### Response Format
```json
{
"summary": "Concise overview addressing the query directly",
"results": [
{
"point": "Detailed information in markdown format",
"reference": "https://exact-source-url.com",
"reference_favicon": "https://exact-source-url.com/favicon.ico",
"relevance_score": 0.95,
"timestamp": "when this info was published/updated"
}
],
"language": "detected_language_code",
"last_updated": "timestamp",
"confidence": 0.90
}
```

### Crawling Process

When you run the spider, you'll be prompted to:
- Enter URLs you want to crawl
- Set the number of workers (default: 5)

The crawler will:
- Extract content from each page
- Generate embeddings in real-time
- Store vectors in Qdrant
- Store metadata in PostgreSQL

### Manual Service Configuration

Since `froxy.sh` only handles the crawler, you'll need to manually configure:

- **Froxy Apex**: Set up your Groq API key and other environment variables
- **Frontend**: Configure API endpoints and keys
- **Service startup**: Start each service individually after crawler completes

---

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Next.js UI │───▢│ Search Backend │───▢│ PostgreSQL β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β–Ό
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ Qdrant │◀───│ FastEmbed β”‚
β”‚ β”‚ (Vector Search) β”‚ β”‚ (Embeddings) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β–² β–²
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Go Crawler β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ (Real-time β”‚
β”‚ β”‚ Indexing) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Froxy Apex │───▢│ Groq LLM API β”‚ β”‚ Chunk Analysis β”‚
β”‚ (AI Search) β”‚ β”‚ (Llama 3.1 8B) │◀───│ (Cosine Sim) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## πŸ“™ Tech Stack

* πŸ•·οΈ **Go (Golang)** – crawler with real-time indexing
* 🧠 **FastEmbed** – embedding generation service
* πŸš€ **Qdrant** – vector database for semantic search
* πŸ€– **Froxy Apex** – AI-powered search with LLM integration
* πŸ¦™ **Llama 3.1 8B** – language model via Groq API
* πŸ’ͺ **Node.js** – search backend API
* πŸ“€ **PostgreSQL** – structured data storage
* βš›οΈ **Next.js** – frontend interface
* 🎨 **TailwindCSS + shadcn/ui** – UI components
* 🐳 **Docker** – containerized services
* 🌐 **Docker Network** – service communication

---

## πŸš€ Key Improvements

* **AI-Powered Search**: Perplexity-style intelligent search with LLM integration
* **Semantic Search**: Find content by meaning, not just keywords
* **Real-time Indexing**: Content is indexed as it's crawled
* **Vector Similarity**: Intelligent search results based on context
* **Chunk Analysis**: Deep content analysis with cosine similarity
* **Structured Responses**: Rich JSON responses with sources and confidence scores
* **Query Enhancement**: AI-powered query understanding and improvement
* **Scalable Architecture**: Microservices with Docker containers
* **Automated Setup**: One-command deployment with `froxy.sh`

---

## πŸ“¬ Want to contribute?

* Fork it πŸŒ›
* Open a PR 🚰
* Share your ideas πŸ’‘

---

## πŸ“œ License

**MIT** β€” feel free to fork, remix, and learn from it.

---

Made with ❀️ for the curious minds of the internet.

Stay weird. Stay building.

> "Not all who wander are lost β€” some are just crawling the web with semantic understanding."