https://github.com/duaa-a/web-crawler-with-tf-idf
a simple search engine using Term Frequency - Inverse Document Frequency algorithm
https://github.com/duaa-a/web-crawler-with-tf-idf
java search-engine term-frequency term-frequency-inverse-document-frequency web-crawler
Last synced: 11 months ago
JSON representation
a simple search engine using Term Frequency - Inverse Document Frequency algorithm
- Host: GitHub
- URL: https://github.com/duaa-a/web-crawler-with-tf-idf
- Owner: DuaA-A
- Created: 2025-04-21T16:02:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-12T20:55:54.000Z (11 months ago)
- Last Synced: 2025-07-12T22:25:32.893Z (11 months ago)
- Topics: java, search-engine, term-frequency, term-frequency-inverse-document-frequency, web-crawler
- Language: Java
- Homepage:
- Size: 1.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Web Crawler With TF-IDF
A Java project implements a simple web crawler and search engine using the TF-IDF (Term Frequency - Inverse Document Frequency) algorithm. It processes crawled web pages, builds an inverted index, calculates TF-IDF scores, and supports search queries using a query processor.
Features
-
WebCrawler.java: Crawls web pages to collect data. -
TextProcessing.java: Tokenizes, filters, and cleans text data. -
Stemmer.java: Performs word stemming for normalization. -
InvertedIndex.java: Builds and stores the inverted index for quick lookup. -
TFIDFCalculator.java: Calculates TF-IDF scores for indexed terms. -
QueryProcessor.java: Handles user queries and ranks results using TF-IDF. -
Main.java: Entry point for running the application.
Technologies Used
- Java
- Basic File I/O
- Collections Framework
- String Processing
How to Run
javac *.java
java Main
Ensure all Java files are in the same directory or set up your project structure accordingly.
License
This project is for educational purposes.