{"id":26885315,"url":"https://github.com/matthewlabasan/cs6111-project2","last_synced_at":"2026-01-29T11:34:57.668Z","repository":{"id":284440961,"uuid":"954949412","full_name":"MatthewLabasan/CS6111-Project2","owner":"MatthewLabasan","description":"Project 2 of COMS6111: Advanced Database Systems - Information Extraction exploration using the ISE algorithm with SpanBERT \u0026 Gemini. Developed by Matthew Labasan and Phoebe Tang.","archived":false,"fork":false,"pushed_at":"2025-04-26T21:15:37.000Z","size":292,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-05T13:47:06.589Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MatthewLabasan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-25T21:42:16.000Z","updated_at":"2025-04-26T21:15:40.000Z","dependencies_parsed_at":"2025-03-25T23:23:43.609Z","dependency_job_id":"3bae937d-c61e-4225-8728-f677c47dec64","html_url":"https://github.com/MatthewLabasan/CS6111-Project2","commit_stats":null,"previous_names":["matthewlabasan/cs6111-project2"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MatthewLabasan/CS6111-Project2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MatthewLabasan","download_url":"https://codeload.github.com/MatthewLabasan/CS6111-Project2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MatthewLabasan%2FCS6111-Project2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28876729,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-29T10:31:27.438Z","status":"ssl_error","status_checked_at":"2026-01-29T10:31:01.017Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-31T18:52:35.081Z","updated_at":"2026-01-29T11:34:57.660Z","avatar_url":"https://github.com/MatthewLabasan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CS6111-Project2\n# Table of Contents\n1. [Introduction](#introduction)\n2. [Getting Started](#getting-started)\n    - [Prerequisites](#prerequisites)\n    - [Installation](#installation)\n3. [Usage](#usage)\n4. [Description of Project](#description-of-project)\n    - [Internal Design](#internal-design)\n        - [Notable External Libraries Used](#notable-external-libraries-used)\n\n# Introduction\nThis project focuses on information extraction from web sources using two different approaches: a traditional multi-step pipeline with SpanBERT and a modern LLM-based method using Google Gemini. Our goal is to extract structured data from unstructured web text by iteratively expanding a seed query and retrieving relevant tuples.\n\nBy implementing the Iterative Set Expansion (ISE) algorithm, we:\n- Retrieved and parsed web pages using Google Custom Search and Beautiful Soup.\n- Preprocessed and annotated text with spaCy for named entity recognition.\n- Extracted relations using either SpanBERT (fine-tuned for specific relations) or Google Gemini (a modern LLM-based approach).\n\nThis project reinforced our understanding of information extraction, specifically with the ISE algorithm, and gave us experience in implementing such algorithms with LLM's. This project was built for Project 2 of COMS6111 - Advanced Database Systems.\n\nDeveloped by Matthew Labasan and Phoebe Tang.\n\n# Getting Started\n## Prerequisites\n1. Python 3.10.1 or above\n2. Install `wget` using `brew install wget`\n    - Make sure to restart your terminal after this installation.\n3. Google Custom Search Engine API Key\n4. Google Search Engine Key\n5. Google Gemini 2.0 Flash model (free tier) API Key\n\n## Installation\n1. Clone the repository  \n  `git clone https://github.com/MatthewLabasan/CS6111-Project2.git`  \n2. Move into the respository  \n  `cd ./CS6111-Project2`  \n3. Create a virtual environment and activate it  \n    - `python3 -m venv dbproj`  \n    - `source dbproj/bin/activate`  \n4. Install requirements.txt\n5. Install trained SpanBERT\n    The SpanBERT classifier will be used to extract the following four types of relations from text documents:\n    - Schools_Attended (internal name: per:schools_attended)\n    - Work_For (internal name: per:employee_of)\n    - Live_In (internal name: per:cities_of_residence)\n    - Top_Member_Employees (internal name: org:top_members/employees)  \n6. Run the following code to install it:  \n    - `git clone https://github.com/Shreyas200188/SpanBERT`  \n    - `cd SpanBERT`  \n    - `pip3 install -r requirements.txt`  \n    - `bash download_finetuned.sh`  \n7. Remove this file from the SpanBERT repository  \n    - `rm spacy_help_functions.py`\n8. Move these files in `/CS6111-Project2` into `/SpanBERT`  \n    - `cd ..`  \n    - `mv gemini_helper_6111.py ./SpanBERT`  \n    - `mv project2.py ./SpanBERT`  \n    - `mv spacy_help_functions.py ./SpanBERT`  \n    - `cd SpanBERT`  \n\n__Note__: For specific instructions on installation on a Google VM instance, view the [course website](https://www.cs.columbia.edu/~gravano/cs6111/Proj2/).\n\n# Usage\n1. Run \u0026 replace with your parameters, using a query in quotations: \n `python3 project2.py [-spanbert|-gemini] \u003cgoogle api key\u003e \u003cgoogle engine id\u003e \u003cgoogle gemini api key\u003e \u003cr\u003e \u003ct\u003e \u003cq\u003e \u003ck\u003e`\n    - [-spanbert|-gemini] is either -spanbert or -gemini, to indicate which relation extraction method we are requesting\u003e\n    - `\u003cgoogle api key\u003e` is your Google Custom Search Engine JSON API Key (see above)\n    - `\u003cgoogle engine id\u003e` is your Google Custom Search Engine ID (see above)\n    - `\u003cgoogle gemini api key\u003e` is your Google Gemini API key (see above)\n    - `\u003cr\u003e` is an integer between 1 and 4, indicating the relation to extract: 1 is for Schools_Attended, 2 is for Work_For, 3 is for Live_In, and 4 is for Top_Member_Employees\n    - `\u003ct\u003e` is a real number between 0 and 1, indicating the \"extraction confidence threshold,\" which is the minimum extraction confidence that we request for the tuples in the output; t is ignored if we are specifying -gemini\n    - `\u003cq\u003e` is a \"seed query,\" which is a list of words in double quotes corresponding to a plausible tuple for the relation to extract (e.g., \"bill gates microsoft\" for relation Work_For)\n    - `\u003ck\u003e` is an integer greater than 0, indicating the number of tuples that we request in the output\n    - Example usage: `python3 project2.py -gemini \u003cgoogle api key\u003e \u003cgoogle engine id\u003e \u003cgoogle gemini api key\u003e 1 0.8 “Obama Columbia” 10`\n\n# Description of Project\n## Internal Design\nFor a description of the internal design and SpanBERT / Gemini extraction methods, please see p.4-8 of our report [here](./transcripts/Project2_Report.pdf).\nFor sample transcripts and usage, please see our result transcripts for each LLM [here](./transcripts).\n\n### Notable External Libraries Used\n1. `googleapiclient`: For Google Search\n2. `google.generativeai`: For using Google Gemini to extract relations\n3. `time`: For spacing out timeouts to avoid rate limitation\n4. `requests`: To fetch websites from URLs \n5. `BeautifulSoup`: For processing raw text from webpage to ignore HTML tags, images, and other content that would interfere with information extraction process\n6. `re`: For using regular expressions to parse returned text\n7. `spacy`: process and annotate text through linguistic analysis\n8. `spanbert`: For extracting relations using bert\n9. `spacy_help_functions`: Started functions via [SpanBERT](https://github.com/Shreyas200188/SpanBERT) repository. Modified for our project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatthewlabasan%2Fcs6111-project2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatthewlabasan%2Fcs6111-project2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatthewlabasan%2Fcs6111-project2/lists"}