Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pavi2410/semsearch
This project implements a web search engine command-line interface (CLI) using the BM25 (Best Matching 25) algorithm. It is written in TypeScript and utilizes Bun APIs for improved performance.
https://github.com/pavi2410/semsearch
hacktoberfest search-engine semantic-web tf-idf
Last synced: 4 days ago
JSON representation
This project implements a web search engine command-line interface (CLI) using the BM25 (Best Matching 25) algorithm. It is written in TypeScript and utilizes Bun APIs for improved performance.
- Host: GitHub
- URL: https://github.com/pavi2410/semsearch
- Owner: pavi2410
- License: mit
- Created: 2024-10-10T12:34:48.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-10-31T08:06:43.000Z (about 2 months ago)
- Last Synced: 2024-12-10T17:53:21.175Z (12 days ago)
- Topics: hacktoberfest, search-engine, semantic-web, tf-idf
- Language: TypeScript
- Homepage:
- Size: 56.6 KB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Semsearch
This project implements a web search engine command-line interface (CLI) using the BM25 (Best Matching 25) algorithm. It is written in TypeScript and utilizes Bun APIs for improved performance.![image](https://github.com/user-attachments/assets/c25dfcf4-b7ce-4c16-a0d2-8dad3785ba55)
## Getting Started
### Installation
The latest version of Bun is required.
Install dependencies:
```
bun install
```### Usage
First, crawl websites and index their content:
```
bun crawl
bun index
```Then, use the CLI to search:
```
bun search [search terms ...]
```## How It Works
1. **Crawling**: The engine crawls specified websites using Depth-first search and collects web pages' HTML content. It also extracts links to other pages for further crawling. This process outputs the content in the `webpages` directory.
2. **Indexing**: It processes the collected pages and builds an index using the BM25 algorithm. This process outputs a list of documents and their corresponding BM25 scores as `docs.json` and `index.json` files respectively.
3. **Searching**: Users can input search queries, and the engine returns top-10 relevant results ranked by their BM25 scores.## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.