https://github.com/4rivappa/ir-wikipedia-cities-search

Last synced: 6 days ago
JSON representation

Host: GitHub
URL: https://github.com/4rivappa/ir-wikipedia-cities-search
Owner: 4rivappa
Created: 2022-12-08T08:26:18.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2022-12-08T08:26:42.000Z (over 2 years ago)
Last Synced: 2025-03-06T00:29:27.060Z (3 months ago)
Language: Python
Size: 2.89 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.txt

Awesome Lists containing this project

README

        
  ___           _ _                ____ _ _   _             ____                      _     

 |_ _|_ __   __| (_) __ _ _ __    / ___(_) |_(_) ___  ___  / ___|  ___  __ _ _ __ ___| |__  

  | || '_ \ / _` | |/ _` | '_ \  | |   | | __| |/ _ \/ __| \___ \ / _ \/ _` | '__/ __| '_ \ 

  | || | | | (_| | | (_| | | | | | |___| | |_| |  __/\__ \  ___) |  __/ (_| | | | (__| | | |

 |___|_| |_|\__,_|_|\__,_|_| |_|  \____|_|\__|_|\___||___/ |____/ \___|\__,_|_|  \___|_| |_|

                                                                                           

Informaiton Retrieval for cities dataset

Motivation:

    Getting all cities information at one place, all things about hotels, travel, education, climate, geography...

Running:

    $> python indexer.py

    # You will be presented with some kind of user interface

    # ctrl-c to exit

Process:

    Web crawling:

        Collecting data from wikipedia

        Includes collecting seed urls from main page

        Extracting info about cities for every urls using selenium and beautifulsoup

    Creating Inverted Index:

        Creating Tokens for given dataset

        Includes removing special characters and stop words

        Generating Inverted Index

        Includes storing data about [term][doc_freq]->[p|o|s|t|i|n|g| |l|i|s|t]

    Processing Query:

        Applying normal tokenization process for query

        Getting documents which contains that tokens of input

        Applying Search feature, which gets documents based on no of tokens present in a document

    Collecting relevance feedback:

        Getting relevance feedback, using documentIds and text displayed related to that document

        Collecting sequence of documentIds, in an order

        Storing the collected sequence, in form of feedback_dict, which have a datastructure as

            { query: { docId1: 2, docId2: 3, docId3: 6, ... }, ... }

        Updating the global feedback dictionary

    Evaluating the system

        Calculating the recall and precission

        Based on the sequence of documentIds user interacted

        Plotting the recall vs precission curve

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/4rivappa/ir-wikipedia-cities-search

Awesome Lists containing this project

README