{"id":19560398,"url":"https://github.com/ashad001/ir-indexing","last_synced_at":"2026-05-03T17:31:25.735Z","repository":{"id":227707624,"uuid":"759734930","full_name":"Ashad001/IR-Indexing","owner":"Ashad001","description":"CS4051 - Information Retrieval Course Assignment ","archived":false,"fork":false,"pushed_at":"2024-05-07T07:37:44.000Z","size":37303,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-26T08:42:11.073Z","etag":null,"topics":["ai","boolean-model","indexing","information-retrieval","inverted-index","kmeans","knn","python","react","streamlit","vectorspacemodel"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ashad001.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-19T08:34:28.000Z","updated_at":"2024-05-10T17:33:15.000Z","dependencies_parsed_at":"2024-03-22T16:31:20.548Z","dependency_job_id":"76ff622c-23c0-4a45-86ab-d21a587219a1","html_url":"https://github.com/Ashad001/IR-Indexing","commit_stats":null,"previous_names":["ashad001/ir-indexing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Ashad001/IR-Indexing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashad001%2FIR-Indexing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashad001%2FIR-Indexing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashad001%2FIR-Indexing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashad001%2FIR-Indexing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ashad001","download_url":"https://codeload.github.com/Ashad001/IR-Indexing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashad001%2FIR-Indexing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32578578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","boolean-model","indexing","information-retrieval","inverted-index","kmeans","knn","python","react","streamlit","vectorspacemodel"],"created_at":"2024-11-11T05:07:26.948Z","updated_at":"2026-05-03T17:31:25.720Z","avatar_url":"https://github.com/Ashad001.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Information Retreival Using Vector Space Model\n\nThis Flask app facilitates Information Retrieval using various indexing techniques and incorporates a React frontend for enhanced user interaction. Additionally, documents are ranked using the Vector Space Model based on TF-IDF (Term Frequency-Inverse Document Frequency). The project structure is outlined below, providing an overview of the organization and key components.\n\n## Project Structure\n\n```plaintext\nindexer/\n│\n├── api/\n│   ├── data/\n│   │   ├── ResearchPapers/\n│   │   │   ├── (Research papers files)\n│   │   └── Stopword-List.txt\n│   ├── docs/\n│   ├── logs/\n│   ├── src/\n│   │   ├── indexer/\n│   │   ├── indexes/\n│   │   ├── models/\n│   │   ├── processing/\n│   │   ├── vocab/\n│   │   ├── __pycache__/\n│   │   ├── __init__.py\n│   │   ├── logger.py\n│   │   ├── retreival.py\n│   │   └── utils.py\n│   ├── .flaskenv\n│   ├── app.py\n│   ├── README.md\n│   └── requirements.txt\n│\n├── node_modules/\n├── public/\n├── src/\n│   ├── App.css\n│   ├── App.js\n│   ├── App.test.js\n│   ├── index.css\n│   ├── index.js\n│   ├── logo.svg\n│   ├── reportWebVitals.js\n│   └── setupTests.js\n│\n├── .gitignore\n├── package-lock.json\n├── package.json\n└── README.md\n\n```\n\n## Project Components\n\n- **`data/`:** Contains research papers in the `ResearchPapers/` directory and a file `Stopword-List.txt` with common stop words.\n  - Simply place new files in this folder, and the app will automatically index them.\n  - Don't Remove `Stopword-List.txt` as it is used for stop word removal, though you can update the .txt file manually.\n\n- **`flows/`:** Contains diagrams and drawings illustrating data flows and UI design.\n\n- **`src/`:** Contains the source code for the app and various modules for indexing and retrieval.\n\n- **`static/`:** Includes JavaScript (`script.js`) and CSS (`styles.css`) files for static content.\n\n- **`templates/`:** Contains HTML template for rendering pages.\n  \n- **`tests/`:** Contains unit tests with corresponding test sets for various functionalities.\n  \n  - **`tests/test_sets:`** Add your test sets in the files\n    - `golden_boolean_queries.txt`\n    - `golden_proximity_queries.txt`\n    - Enter Queries of the form:\n      - Example Query: TOUR_QUERY \n      - Result-Set: EXPECTED_RESULTS\n\n- **`app.py`:** The main Flask application file.\n\n## Flow \u0026 Design\n### Data Flow \n![Data Flow](flows/dataflow.png)\n\n### UI design\n![UI](flows/ui.png)\n\n\n## Functionality\n\nThe app offers efficient retrieval capabilities, emphasizing performance and user experience.\n\n### Index Generation and Metadata Logging\n\n- Index generation occurs at the beginning and is only performed once, saving indexes to files.\n- Metadata, including information about file structure and indexes, is logged for future reference.\n- If indexes are requested again, the app checks  for changes in data and regenerates only the necessary indexes.\n\n### Performance Logging\n\n- Essential performance metrics are logged, providing insights into processing times for index formation, search operations, and more.\n- This information helps in monitoring and optimizing the efficiency of the retrieval system.\n\n### Query Processing\n\n- The app prompts users to enter queries, whether boolean or proximity-based.\n- The algorithm determines the query type and performs the search accordingly.\n- Suggestions for words are provided to users, enhancing the query input experience.\n- Trie-based searching is employed for efficient and fast word suggestions.\n\n### Search Results Presentation\n\n- Documents are ranked using the Vector Space Model based on TF-IDF scores.\n- If documents match the user's query, the app presents the corresponding document IDs along with their relevance scores.\n- In the absence of matching documents, the app attempts to correct the query using Levenshtein distance on a word-by-word basis.\n- The corrected query is presented to the user, and if the original and corrected queries are identical, the user is informed that no documents match the query.\n- Each document in the search result is accompanied by a static summary. Hovering over the document displays its rank/score.\n\n### Logging User Interaction\n\n- The app logs important information about user queries, errors, and search results.\n- This logging allows for a comprehensive review of user interactions, aiding in system analysis and improvement.\n\nThe combined features ensure a seamless and efficient experience for users interacting with the IR-Indexing app, promoting effective information retrieval and user-friendly query processing.\n\n## Running the Project\n\nTo run the project, follow these steps:\n\n1. Set up a Python environment and install dependencies:\n\n    ```bash\n    cd api \n    python -m venv venv\n    venv\\Scripts\\activate\n    ```\n    \n    ```bash\n    pip install -r requirements.txt\n    ```\n\n2. Run the Flask app:\n\n    ```bash\n    cd .. (to go back to the root directory)\n    npm install\n    yarn start-api\n    ```\n\n3. Open a web browser and navigate to `http://127.0.0.1:5000/` to interact with the app.\n\nFor the React frontend:\n\n1. Start the React development server:\n\n    ```bash\n    yarn start\n    ```\n\n2. The React app will be running on `http://localhost:3000/` by default.\n\n\n\n## Acknowledgements \n- The Porter Stemmer implementation is based on the original algorithm by Martin Porter.\n  -  Source: [https://vijinimallawaarachchi.com/2017/05/09/porter-stemming-algorithm/](https://vijinimallawaarachchi.com/2017/05/09/porter-stemming-algorithm/)\n  -   GitHub Repository: [https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py](https://github.com/jedijulia/porter-stemmer/)\n\n- The Levenshtein distance algorithm is based on the original algorithm by Vladimir Levenshtein.\n  - Source: [https://en.wikipedia.org/wiki/Levenshtein_distance](https://en.wikipedia.org/wiki/Levenshtein_distance)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashad001%2Fir-indexing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashad001%2Fir-indexing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashad001%2Fir-indexing/lists"}