https://github.com/xvxvdee/cps842-a1

A python program that builds an inverted index from a collection of Wikipedia articles and performs single term queries.
https://github.com/xvxvdee/cps842-a1

index information-retrieval web-search

Last synced: 7 months ago
JSON representation

A python program that builds an inverted index from a collection of Wikipedia articles and performs single term queries.

Host: GitHub
URL: https://github.com/xvxvdee/cps842-a1
Owner: xvxvdee
Created: 2023-12-25T23:55:22.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-12-25T23:59:57.000Z (almost 2 years ago)
Last Synced: 2025-02-09T21:15:33.692Z (9 months ago)
Topics: index, information-retrieval, web-search
Language: Jupyter Notebook
Homepage:
Size: 2.9 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

This assignment was done in a group of 2.

# Inverted Index and Query

This is a python program that builds an inverted index from a collection of Wikipedia articles and performs single term queries. The program can handle different options for stemming and stop word removal.

## Dataset

The dataset consists of more than 200,000 English Wikipedia pages submitted for TREC fair ranking. Every page includes an ID, the article's title, the page's URL, and the HTML for its content, all in a JSON file compressed with gzip. A subset of 50,000 documents are used for this assignment's dataset due to RAM restrictions.

## Dependencies

- Python 3.8 or higher
- BeautifulSoup from bs4
- gzip
- regex

## Usage

### Indexing

To create the index, run the `indexConstruction.py` file with the following arguments:

- `file_path`: the path to the gzip dataset file
- `stopwords_path`: the path to the cacm_stopwords text file
- `stem_on`: a boolean value indicating whether to apply stemming or not
- `stopwords_on`: a boolean value indicating whether to remove stop words or not

For example:

```python
python indexConstruction.py "path/to/zipped/data" "cacm_stopwords.txt" True True
```

This will create the index with stemming and stop word removal enabled, and save it as JSON files in the same directory.

### Querying

To query the index, run the `query.py` file with no arguments. This will prompt the user to choose whether to query with stemming or stop word removal enabled. Type `y` for yes and `n` for no. Once selected, the corresponding index files will be loaded and the user will be asked to enter a word to be queried. The program will display the document frequency and the top 20 documents that contain the word, along with the term frequency and the positions. The program will also show the time it took to process the query. To exit the program, type `ZZEND`.

For example:

```python
python query.py
Do you want to query with stemming? (y/n) y
Do you want to query with stopwords? (y/n) n
Enter a word to be queried: algorithm
Document Frequency: 104
Doc ID: 5a0c1c9e55b14b36ce7b2c6f
Title: Algorithm
Term Frequency: 38
Positions: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
...
Query time: 0.001 seconds
Enter a word to be queried: ZZEND
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xvxvdee/cps842-a1

Awesome Lists containing this project

README