{"id":25819484,"url":"https://github.com/matthewlabasan/cs6111-project1","last_synced_at":"2026-06-16T09:32:07.258Z","repository":{"id":279856722,"uuid":"937270219","full_name":"MatthewLabasan/CS6111-Project1","owner":"MatthewLabasan","description":"Project 1 for COMS W6111: Advanced Database Systems. Developed by Matthew Labasan and Phoebe Tang.","archived":false,"fork":false,"pushed_at":"2025-03-27T21:43:26.000Z","size":55,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-27T22:31:46.219Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MatthewLabasan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-22T18:29:09.000Z","updated_at":"2025-03-27T21:43:30.000Z","dependencies_parsed_at":"2025-02-28T05:19:23.485Z","dependency_job_id":"85fbc290-d7f6-478c-b083-69df1b999bc1","html_url":"https://github.com/MatthewLabasan/CS6111-Project1","commit_stats":null,"previous_names":["matthewlabasan/cs6111-project1"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MatthewLabasan/CS6111-Project1","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MatthewLabasan","download_url":"https://codeload.github.com/MatthewLabasan/CS6111-Project1/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project1/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34400451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-28T08:56:14.956Z","updated_at":"2026-06-16T09:32:07.253Z","avatar_url":"https://github.com/MatthewLabasan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CS6111-Project1\n\n# Table of Contents\n1. [Introduction](#introduction)\n2. [Getting Started](#getting-started)\n    - [Prerequisites](#prerequisites)\n    - [Installation](#installation)\n3. [Usage](#usage)\n4. [Description of Project](#description-of-project)\n    - [Internal Design](#internal-design)\n        - [Notable External Libraries Used](#notable-external-libraries-used)\n    - [Query-Modification Method](#query-modification-method)\n\n# Introduction\nThis project is an exploration on query expansion utilizing the Google Custom Search API. With this project, we learned about possible methods of query expansion through word scoring equations such as tf-idf, word placement algorithms using bigrams, and the use of user relevance feedback. It was built for Project 1 of COMS6111 - Advanced Database Systems.\n\nDeveloped by Matthew Labasan and Phoebe Tang.\n\n# Getting Started\n## Prerequisites\n1. Python 3.8.1 or above\n2. Google Custom Search API Key and Google Search Engine Key\n\n## Installation\nClone this repository to your system, navigate to the directory, and run the following lines of code:\n1. `python3 -m venv venv`\n2. `source ./venv/bin/activate`\n3. `pip install -r requirements.txt`\n\n# Usage\n1. Run \u0026 replace with your parameters, using a query in quotations: \n    - `python main.py \u003cAPI Key\u003e \u003cEngine Key\u003e \u003cPrecision\u003e \u003cQuery\u003e`\n    - Example usage: `main.py \u003cAPI Key\u003e \u003cEngine Key\u003e 0.9 “hello world”`\n2. Type in Y or N to give user-relevance feedback. Any other letters are not allowed.\n3. Sample results can be viwed in the `transcript_nokeys.txt` file.\n\n# Description of Project\n## Internal Design\nHere, we will outline our `main()` method:\n1. Extract the arguments to obtain the search engine ID, API key, precision, and query\n2. Enter a loop. The program calls the Google Custom Search API to retrieve search results for a given query. This method is defined as `search()`\n    - It uses the build function from googleapiclient.discovery\n    - Makes a request to see `cse().list()` with given engine ID\n    - Returns a dictionary of the top 10 query results\n    - Source Code via [Google](https://github.com/googleapis/google-api-python-client/blob/main/samples/customsearch/main.py)\n3. Then, call the `user_relevance()` function to collect user feedback \n    - Iterate through top 10 results and display dictionary content\n    - Store relevant and irrelevant documents in separate lists\n    - Calculate the precision score and return the score, relevant, and irrelevant lists\n4. Next, call our `expand()`and `insert_keywords()` functions to find new keywords and insert them into the new query\n    - More details below in query-modification\n5. Until target precision is reached, repeat steps 1-4 with the new query\n    - Break loop if precision exceeds or equals the target precision\n    - Break if we receive a precision of 0 (unable to expand query)\n\n### Notable External Libraries Used\n1. `numpy`: For assistance in matrix operations\n2. `nltk`: For word tokenization to help find bigrams for keyword insertion\n3. `sklearn`: For tf-idf calculations \n4. `googleapiclient`: For tf-idf calculations \n\n## Query-Modification Method\nOur query-expansion algorithm is based on the tf-idf scoring system. The tf-idf score of all words in the relevant documents and the non-relevant documents were separately computed. Then, we removed words from the relevant word list that was present in the non-relevant word list to ensure that only relevant, high scoring tf-idf words were included in our query expansion. \n1. We call the `expand()` function to expand the query\n    - It takes in the original query, number of keywords, list of relevant results, and non relevant results\n    - We build two corpuses, one containing the words from the title and snippet of relevant documents, and one for the irrelevant documents\n    - We then use the tf-idf library from scikit-learn and retrieve the tf-idf scores for the words of each document in both corpuses. We then averaged the scores across the documents of the respective corpus class to get an overall value. \n    - We then prune the top nonrelevant keywords from the top relevant keywords list to make sure that we won’t include any nonrelevant keywords in the new query\n    - We then call our insert keywords function to reorder the new query\n\nOur query-word order algorithm is based on the use of bigrams that are present in the relevant document corpus. We worked on two versions of reordering the query – functions insert_keywords and insert_keywords_v2. We ended up using `insert_keywords` but left the `insert_keywords_v2` function for reference\n1. `insert_keywords`:\n\n    This method will insert keywords based on present bigrams. Keywords will only be inserted at the start or end of the query since it is assumed that the original user query is in the correct order (we don’t want to risk modification of this order). The logic is below:\n    - We get all the bigrams from the document corpus provided (relevant corpus) using the nltk library – methods `word_tokenize()` and `bigrams()`. It is worthy to note that the `word_tokenize()` method does not ignore punctuation.\n    - If we have more than 1 keyword:\n        - Check for bigrams between each keyword and the start and end query terms. For example, for query “hello world” and keywords (why, there), we would search for the bigrams: (why, hello), (world, why), etc.\n        - If a bigram is present, we place the keyword at either the start or end of the existing query, depending on the found bigram.\n        - We do this twice, once for each of the two keywords to add.\n        - If only one keyword has an associated bigram that allows proper placement, we append the remaining keyword to the end of the query.\n        - If no bigrams are found, we check for a bigram of the keywords and use that bigram to determine ordering of the keywords. We then append it at the end of the query.\n    - If there is only 1 keyword:\n        - Check if it has a bigram with the start or end word of the query\n        - Append it in the proper place if so. Else, append it to the end of the query.\n2. `insert_keywords_v2`: \n    - We use counter from collections to keep track of the occurrences for bigrams\n    - Iterate through all of the documents from the corpus and update the occurrence of each bigram accordingly.\n    - We then sort the bigrams by number of occurrences in decreasing order\n    - First, we check if the top bigram contains both keywords\n        - If so, then we look for any bigram containing (query term, keyword1). If found, then we append the bigram after the query term\n    - Next, we iterate through all the query terms.\n        - Check if we find any (query term, keyword), if found, insert keyword after the query term and add it to the placed word set\n        - Check if we find any (keyword, query term), if found, insert keyword before the query term and add it to the placed word set\n    - Finally, we check if there are any keywords left over. If so, just append it to the end of query.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatthewlabasan%2Fcs6111-project1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatthewlabasan%2Fcs6111-project1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatthewlabasan%2Fcs6111-project1/lists"}