Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mohdjey123/roosearch
https://github.com/mohdjey123/roosearch
Last synced: about 22 hours ago
JSON representation
- Host: GitHub
- URL: https://github.com/mohdjey123/roosearch
- Owner: Mohdjey123
- Created: 2024-10-15T18:52:46.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-10-17T19:52:39.000Z (30 days ago)
- Last Synced: 2024-10-21T00:22:15.931Z (27 days ago)
- Language: JavaScript
- Size: 91.8 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RooSearch
RooSearch is a lightweight, open-source search engine designed to crawl and index web pages. It allows for searching indexed content and provides basic web crawling capabilities.
## Features
- **Crawling**: Crawls web pages and extracts text and links.
- **Indexing**: Stores crawled content for efficient searching.
- **Search**: Enables searching the indexed content with a simple keyword-based search.
- **Automatic Crawling**: Supports automatic crawling with predefined seed URLs.
- **Web Interface**: Provides a basic web interface to interact with the search engine.## Getting Started
### Prerequisites
Make sure you have the following installed:
- Node.js (v14 or higher)
- npm (v6 or higher)### Installation
1. **Clone the repository**
```bash
git clone https://github.com/Mohdjey123/RooSearch.git
cd RooSearch
1. **Install dependencies**
```bash
npm install
1. **Start the server**
```bash
node server.js
The server will run on http://localhost:3000.### Usage
**Web Interface**
1. Open a web browser and go to http://localhost:3000.
1. Use the search bar to find indexed content.
1. View the crawling status by navigating to /status.
**API Endpoints**- GET /search?q=: Searches for the given query in the indexed content.
- GET /crawl?url=: Crawls the specified URL and adds it to the index.
- GET /view-index: Returns the current in-memory index.
- GET /crawl-status: Provides the crawling status, including the number of crawled URLs and the queue length.
- GET /status: Displays the current server status.
**Configuration**
Modify the server and crawler settings in server.js and crawler.js as needed:
- Port: Change the port in server.js.
- Seed URLs and max pages: Configure these in server.js for automatic crawling.
**Project Structure**
- crawler.js: Handles web crawling and extracting content.
- indexer.js: Manages indexing, saving, and loading the search index.
- search.js: Implements the search functionality and result processing.
- server.js: The main server script that integrates all components and provides the web interface.**Example**
To start a crawl with automatic seeding:
1. Start the server:
```bash
node --watch server.js
1. Automatic crawling will begin with predefined seed URLs, or you can manually crawl new URLs via the /crawl endpoint.
**Dependencies**- axios: For HTTP requests.
- cheerio: For parsing HTML content.
- express: Web server framework.
- zlib: Compression utilities for indexing.
- fs: File system operations.**Contributing**
Contributions are welcome! Please submit a pull request or open an issue to discuss improvements.
**License**
This project is open-source under the MIT License.
This `README.md` file provides an overview of the RooSearch project, setup instructions, and usage details.