An open API service indexing awesome lists of open source software.

https://github.com/duaa-a/web-crawler-with-tf-idf

a simple search engine using Term Frequency - Inverse Document Frequency algorithm
https://github.com/duaa-a/web-crawler-with-tf-idf

java search-engine term-frequency term-frequency-inverse-document-frequency web-crawler

Last synced: 11 months ago
JSON representation

a simple search engine using Term Frequency - Inverse Document Frequency algorithm

Awesome Lists containing this project

README

          


Web Crawler With TF-IDF


A Java project implements a simple web crawler and search engine using the TF-IDF (Term Frequency - Inverse Document Frequency) algorithm. It processes crawled web pages, builds an inverted index, calculates TF-IDF scores, and supports search queries using a query processor.

Features




  • WebCrawler.java: Crawls web pages to collect data.


  • TextProcessing.java: Tokenizes, filters, and cleans text data.


  • Stemmer.java: Performs word stemming for normalization.


  • InvertedIndex.java: Builds and stores the inverted index for quick lookup.


  • TFIDFCalculator.java: Calculates TF-IDF scores for indexed terms.


  • QueryProcessor.java: Handles user queries and ranks results using TF-IDF.


  • Main.java: Entry point for running the application.

Technologies Used



  • Java

  • Basic File I/O

  • Collections Framework

  • String Processing

How to Run



javac *.java
java Main

Ensure all Java files are in the same directory or set up your project structure accordingly.

License


This project is for educational purposes.