Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ananyaarun/wikipedia-search-engine

A Mini Wikipedia search engine, which uses Block Sort Based Indexing to create the inverted index of the given wikipedia dump, queries on the created index and retrieves top N results via relevance ranking (weighted TF-IDF) of the documents
https://github.com/ananyaarun/wikipedia-search-engine

information-retrieval python3

Last synced: 17 days ago
JSON representation

Host: GitHub
URL: https://github.com/ananyaarun/wikipedia-search-engine
Owner: ananyaarun
Created: 2020-08-26T18:49:00.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-11-29T15:37:15.000Z (about 4 years ago)
Last Synced: 2024-11-09T09:53:11.316Z (2 months ago)
Topics: information-retrieval, python3
Language: Python
Homepage:
Size: 38.1 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Wikipedia-Search-Engine

A Mini Wikipedia search engine, which uses Block Sort Based Indexing to create the inverted index of a given wikipedia dump, queries on the index and retrieves top N results via relevance ranking of the documents.

[Link to Wikipedia XML dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download)

python3 and nltk are required to run the scripts

## Utility

- wiki_indexer.py - parses wikipedia dump and makes inverted index
- merge_index.py - merges the index files created
- split_index.py - splits the index into smaller chunks sorted alphabetically for easy searching
- wiki_search.py - script for querying

# Creation of Inverted Index

Creation of the inverted index happens in 3 steps.
- First the individual XML dumps are indexed in these steps
- Parsing: The XML corpus given is parsed using SAX parser
- Casefolding: Then Upper Case is converted to Lower Case
- Tokenisation: Sentences are split into tokens using regex
- Stop Word Removal: Stop words are removed using stopwords from nltk.corpus
- Stemming: PyStemmer is used to step individual words
- Once preprocessing is done, individual index files are created with tokens and their respective posting list
- All the indexed files created in the previous step are merged
- Then we split the index into separate files in alphabetically sorted order
- The index is now ready for querying

# Steps for querying

The script for searching handles both simple and multiple field queries.

- Input for searching is given in the form of a text file with n followed by a comma and then the query string
- The query string is parsed differently for simple and field queries
- All words present in the query string along with field value (if present) are searched for in the index
- The document IDs and term frequencies are retrieved
- A weighted TF-IDF scheme is them used with predefined weights to rank the documents and return the top N titles in the output file for all queries

# Steps to run the code

## Indexing

Create the Index

`python3 wiki_index.py `