https://github.com/duaa-a/web-crawler-with-tf-idf

a simple search engine using Term Frequency - Inverse Document Frequency algorithm
https://github.com/duaa-a/web-crawler-with-tf-idf

java search-engine term-frequency term-frequency-inverse-document-frequency web-crawler

Last synced: 11 months ago
JSON representation

a simple search engine using Term Frequency - Inverse Document Frequency algorithm

Host: GitHub
URL: https://github.com/duaa-a/web-crawler-with-tf-idf
Owner: DuaA-A
Created: 2025-04-21T16:02:33.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-12T20:55:54.000Z (11 months ago)
Last Synced: 2025-07-12T22:25:32.893Z (11 months ago)
Topics: java, search-engine, term-frequency, term-frequency-inverse-document-frequency, web-crawler
Language: Java
Homepage:
Size: 1.3 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Web Crawler With TF-IDF

A Java project implements a simple web crawler and search engine using the TF-IDF (Term Frequency - Inverse Document Frequency) algorithm. It processes crawled web pages, builds an inverted index, calculates TF-IDF scores, and supports search queries using a query processor.

Features

WebCrawler.java: Crawls web pages to collect data.

TextProcessing.java: Tokenizes, filters, and cleans text data.

Stemmer.java: Performs word stemming for normalization.

InvertedIndex.java: Builds and stores the inverted index for quick lookup.

TFIDFCalculator.java: Calculates TF-IDF scores for indexed terms.

QueryProcessor.java: Handles user queries and ranks results using TF-IDF.

Main.java: Entry point for running the application.

Technologies Used

Java

Basic File I/O

Collections Framework

String Processing

How to Run


javac *.java

java Main

Ensure all Java files are in the same directory or set up your project structure accordingly.

License

This project is for educational purposes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome