https://github.com/multix0/froxy
πΈοΈ Froxy β A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.
https://github.com/multix0/froxy
go indexing-engine nextjs nodejs open-source search-engine seo-tools supabase tf-idf web-crawler
Last synced: 3 months ago
JSON representation
πΈοΈ Froxy β A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.
- Host: GitHub
- URL: https://github.com/multix0/froxy
- Owner: MultiX0
- License: mit
- Created: 2025-05-29T04:15:36.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-09T22:59:19.000Z (4 months ago)
- Last Synced: 2025-06-10T02:01:52.419Z (4 months ago)
- Topics: go, indexing-engine, nextjs, nodejs, open-source, search-engine, seo-tools, supabase, tf-idf, web-crawler
- Language: TypeScript
- Homepage: https://froxy.atlasapp.app/
- Size: 725 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# **π·οΈ Froxy**
> A chill, open-source web engine that crawls, indexes, and vibes with web content using semantic search.

---
## π‘ What is Froxy?
Froxy is a modular full-stack web engine designed to crawl web pages, extract content, and index it using **semantic embeddings** for intelligent search β all powered by modern tools. It includes:
* A **Go**-based crawler (aka the spider π·οΈ) with real-time indexing
* **FastEmbed** service for generating semantic embeddings
* **Qdrant** vector database for semantic search
* **Froxy Apex** - AI-powered intelligent search (Perplexity-style)
* A **PostgreSQL** database for structured data
* A **Next.js** front-end UI (fully integrated with real APIs)This project is built for learning, experimenting, and extending β great for developers who want to understand how modern semantic search engines work from scratch.
> Fun fact: I made this project in just **3 days** β so it might not be perfect, but you know what?
> **It works!**
>
> *(We'll keep evolving this codebase together β€οΈ)*> Note: I prefer simplicity over unnecessary complexity. We might make the architecture more advanced in the future, but for now, it's simple, clean, and straightforwardβno fancy stuff, no over-engineering. It's just a chill project for now. If needed, we can scale and make it more complex later. After all, it started as a fun projectβnothing more. <3
---
## π Features
* π Crawl websites with real-time indexing (Go)
* π§ Semantic search using embeddings (FastEmbed + Qdrant)
* π€ AI-powered intelligent search with LLM integration (Froxy Apex)
* π Vector similarity search for intelligent results
* π Chunk-based relevance scoring with cosine similarity
* πΊ Store structured data in PostgreSQL
* π¨ Modern UI in Next.js + Tailwind
* π³ Fully containerized with Docker> The frontend is fully connected to the backend and provides semantic search capabilities.
---
## π Folder Structure
```
froxy/
βββ front-end/ # Next.js frontend
β βββ app/ # App routes (search, terms, about, etc.)
β βββ components/ # UI components (shadcn-style)
β βββ hooks/ # React hooks
β βββ lib/ # Utility logic
β βββ public/ # Static assets
β βββ styles/ # TailwindCSS setup
βββ indexer-search/ # Node.js search backend
β βββ lib/
β βββ functions/
β βββ services/ # DB + search service
β βββ utils/ # Helper utilities
βββ froxy-apex/ # AI-powered intelligent search service
β βββ api/ # API endpoints
β βββ db/ # Database connections
β βββ functions/ # AI processing logic
β βββ llama/ # LLM integration
β βββ models/ # Data models
β βββ utils/ # Helper utilities
βββ spider/ # Web crawler in Go with real-time indexing
β βββ db/ # DB handling (PostgreSQL + Qdrant)
β βββ functions/ # Crawl + indexing logic + Proxies (if-need it)
β βββ models/ # Data models
β βββ utils/ # Misc helpers
βββ fastembed/ # FastEmbed embedding service
β βββ models/ # Cached embedding models
β βββ docker-compose.yml
βββ qdrant/ # Qdrant vector database
β βββ docker-compose.yml
βββ db/ # PostgreSQL database
β βββ scripts/ # Shell backups
β βββ docker-compose.yml
βββ froxy.sh # Automated setup & runner script
βββ LICENSE # MIT License
βββ readme.md # This file
```---
## βοΈ Getting Started
### Requirements
* Node.js (18+)
* pnpm or npm
* Go (1.23+)
* Docker & Docker Compose
* At least 2GB RAM (for embedding service)### Quick Setup (Recommended for Crawler)
For the fastest crawler setup without dealing with configuration details:
```bash
# Make the script executable and run it
chmod +x froxy.sh
./froxy.sh
```The script will automatically:
- Set up all environment variables with default values
- Create the Docker network
- Start all required services (PostgreSQL, Qdrant, FastEmbed)
- Health check all containers
- Guide you through the crawling process**Note**: The `froxy.sh` script only handles the crawler setup. You'll need to manually start the `froxy-apex` AI service and `front-end` after crawling.
### Manual Setup
If you prefer to set things up manually:
```bash
# 1. Create Docker network
docker network create froxy-network# 2. Start Qdrant vector database
cd qdrant
docker-compose up -d --build# 3. Start PostgreSQL database
cd ../db
# Set proper permissions for PostgreSQL data directory
sudo chown -R 999:999 postgres_data/
docker-compose up -d --build# 4. Start FastEmbed service
cd ../fastembed
docker-compose up -d --build# 5. Wait for all services to be healthy, then run the crawler
cd ../spider
go run main.go# 6. After crawling, start the search backend
cd ../indexer-search
npm install
npm start# 7. Start the AI-powered search service (Froxy Apex)
# Make sure to configure froxy-apex/.env first
cd ../froxy-apex
go run main.go# 8. Launch the front-end
cd ../front-end
npm i --legacy-peer-deps
npm run dev
```---
## π Environment Variables
### Default Configuration
All services use these environment variables (automatically set by `froxy.sh`):
```env
# Database Configuration (for spider & indexer-search)
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable# Vector Database Configuration
QDRANT_API_KEY=froxy-secret-key
QDRANT_HOST=http://localhost:6333# FastEmbed Service
EMBEDDING_HOST=http://localhost:5050# AI Service (for froxy-apex)
LLM_API_KEY=your_groq_api_key
API_KEY=your_froxy_apex_api_key
```### Service-Specific Variables
#### `db/.env`
```env
POSTGRES_DB=froxy_db
POSTGRES_USER=froxy_user
POSTGRES_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
```#### `qdrant/.env`
```env
QDRANT_API_KEY=froxy-secret-key
```#### `spider/.env` & `indexer-search/.env`
```env
DB_HOST=localhost
DB_PORT=5432
DB_USER=froxy_user
DB_PASSWORD=froxy_password
DB_NAME=froxy_db
DB_SSLMODE=disable
QDRANT_API_KEY=froxy-secret-key
EMBEDDING_HOST=http://localhost:5050
```#### `froxy-apex/.env`
```env
LLM_API_KEY=your_groq_api_key
QDRANT_HOST=http://localhost:6333
EMBEDDING_HOST=http://localhost:5050
API_KEY=your_froxy_apex_api_key
QDRANT_API_KEY=froxy-secret-key
```#### `front-end/.env`
```env
API_URL=http://localhost:8080
API_KEY=your_api_key
WEBSOCKET_URL=ws://localhost:8080/ws/search
FROXY_APEX_API_KEY=your_froxy_apex_api_key
ACCESS_CODE=auth_access_for_froxy_apex_ui
AUTH_SECRET_TOKEN=jwt_token_for_apex_ui_to_calc_the_usage
```> π‘ The `froxy.sh` script automatically creates `.env` files with working default values for the crawler and database services. You'll need to manually configure `froxy-apex/.env` and `front-end/.env` for the AI search and UI components.
---
## π€ How it works
### Traditional Search
1. **Crawler** pulls website content from your provided URLs
2. **Real-time indexing** generates semantic embeddings using FastEmbed
3. **Qdrant** stores vector embeddings for semantic similarity search
4. **PostgreSQL** stores structured metadata
5. **Frontend** provides intelligent semantic search interface### AI-Powered Search (Froxy Apex)
1. **User query** is received and processed
2. **Query enhancement** using Llama 3.1 8B via Groq API
3. **Embedding generation** for the enhanced query using FastEmbed
4. **Vector search** in Qdrant to find relevant pages
5. **Content chunking** of relevant pages for detailed analysis
6. **Cosine similarity** calculation for each chunk against the query
7. **LLM processing** to generate structured response with:
- Concise summary
- Detailed results with sources
- Relevance scores
- Reference links and favicons
- Confidence ratings### Response Format
```json
{
"summary": "Concise overview addressing the query directly",
"results": [
{
"point": "Detailed information in markdown format",
"reference": "https://exact-source-url.com",
"reference_favicon": "https://exact-source-url.com/favicon.ico",
"relevance_score": 0.95,
"timestamp": "when this info was published/updated"
}
],
"language": "detected_language_code",
"last_updated": "timestamp",
"confidence": 0.90
}
```### Crawling Process
When you run the spider, you'll be prompted to:
- Enter URLs you want to crawl
- Set the number of workers (default: 5)The crawler will:
- Extract content from each page
- Generate embeddings in real-time
- Store vectors in Qdrant
- Store metadata in PostgreSQL### Manual Service Configuration
Since `froxy.sh` only handles the crawler, you'll need to manually configure:
- **Froxy Apex**: Set up your Groq API key and other environment variables
- **Frontend**: Configure API endpoints and keys
- **Service startup**: Start each service individually after crawler completes---
## ποΈ Architecture
```
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Next.js UI βββββΆβ Search Backend βββββΆβ PostgreSQL β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
β βΌ
β ββββββββββββββββββββ βββββββββββββββββββ
β β Qdrant ββββββ FastEmbed β
β β (Vector Search) β β (Embeddings) β
β ββββββββββββββββββββ βββββββββββββββββββ
β β² β²
β β β
β ββββββββββββββββββββ β
β β Go Crawler ββββββββββββββββ
β β (Real-time β
β β Indexing) β
β ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Froxy Apex βββββΆβ Groq LLM API β β Chunk Analysis β
β (AI Search) β β (Llama 3.1 8B) ββββββ (Cosine Sim) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
```---
## π Tech Stack
* π·οΈ **Go (Golang)** β crawler with real-time indexing
* π§ **FastEmbed** β embedding generation service
* π **Qdrant** β vector database for semantic search
* π€ **Froxy Apex** β AI-powered search with LLM integration
* π¦ **Llama 3.1 8B** β language model via Groq API
* πͺ **Node.js** β search backend API
* π **PostgreSQL** β structured data storage
* βοΈ **Next.js** β frontend interface
* π¨ **TailwindCSS + shadcn/ui** β UI components
* π³ **Docker** β containerized services
* π **Docker Network** β service communication---
## π Key Improvements
* **AI-Powered Search**: Perplexity-style intelligent search with LLM integration
* **Semantic Search**: Find content by meaning, not just keywords
* **Real-time Indexing**: Content is indexed as it's crawled
* **Vector Similarity**: Intelligent search results based on context
* **Chunk Analysis**: Deep content analysis with cosine similarity
* **Structured Responses**: Rich JSON responses with sources and confidence scores
* **Query Enhancement**: AI-powered query understanding and improvement
* **Scalable Architecture**: Microservices with Docker containers
* **Automated Setup**: One-command deployment with `froxy.sh`---
## π¬ Want to contribute?
* Fork it π
* Open a PR π°
* Share your ideas π‘---
## π License
**MIT** β feel free to fork, remix, and learn from it.
---
Made with β€οΈ for the curious minds of the internet.
Stay weird. Stay building.
> "Not all who wander are lost β some are just crawling the web with semantic understanding."