{"id":27426165,"url":"https://github.com/posgnu/tiny-search","last_synced_at":"2025-04-14T12:31:35.646Z","repository":{"id":186170124,"uuid":"615623669","full_name":"posgnu/tiny-search","owner":"posgnu","description":"Tiny search engine for UCI ICS domain webpages","archived":false,"fork":false,"pushed_at":"2023-03-23T04:52:05.000Z","size":384,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-08-04T19:07:51.785Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/posgnu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-03-18T07:23:38.000Z","updated_at":"2023-08-04T19:08:25.416Z","dependencies_parsed_at":null,"dependency_job_id":"fffbf18a-54b0-440c-9644-bf7faa276d21","html_url":"https://github.com/posgnu/tiny-search","commit_stats":null,"previous_names":["posgnu/tiny-search"],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posgnu%2Ftiny-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posgnu%2Ftiny-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posgnu%2Ftiny-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posgnu%2Ftiny-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/posgnu","download_url":"https://codeload.github.com/posgnu/tiny-search/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248881452,"owners_count":21176858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-14T12:31:12.207Z","updated_at":"2025-04-14T12:31:35.631Z","avatar_url":"https://github.com/posgnu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tiny Search Engine\nThis repository implements a simple search engine that works with UCI ICS domain webpages.\n\n## Documents data\nThe provided dataset, developer.zip, contains data that was gathered by web crawlers from 88 domains associated with ICS. The dataset consists of a little under 56,000 web pages, which are stored in folders with one folder per domain. Each file inside a folder corresponds to one web page, and the files are stored in JSON format with two fields: \"url\" and \"content.\"\n\nThe \"url\" field contains the URL of the web page, and the \"content\" field contains the content of the page as found during crawling. It is important to note that some of the pages in the dataset may not contain any HTML at all, and when they do, it may not be well-formed. For example, there might be an open \u003cstrong\u003e tag but the associated closing \u003c/strong\u003e tag might be missing. Therefore, it is important to select a parser library that can handle broken HTML when working with this dataset.\n## How to run\n### Step 0\nSetup python environment.\n```sh\npip install -r requirements.txt\n```\nPrepare the dataset. Download [developer.zip](https://www.dropbox.com/s/vcfy7ad3osqyx23/developer.zip?dl=0) and unzip it under pageset directory.\n\n### Step 1 \nBuild the inverted index and calculate the tf-idf score matrix. The below command will generate `inverted_index.pkl` and `tfvectorizer.pkl`. If `tfvectorizer.pkl` already exists then the script loads the existing tf-idf matrix and use it to build the inverted index. \n```sh\npython build_index.py\n```\n\n### Step 2\nRun GUI interface with the following command.\n```sh\npython manage.py runserver\n```\nThis will locally run the tiny search server.\n\n### Test\nRun a test search on 20 evaluation queries\n```sh\npython search.py\n```\nThis will write retrieval results for 20 test queries in tests directory.\n\n## Preview\n#### Main search page\n![](./screenshot/main.png)\n#### Result page\n![](./screenshot/search.png)\n\n## Design\n\n### Detect and eliminate duplicate pages\nTiny search used simhash to remove duplicate pages in the dataset.\n\n[paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf)\n\n### Inverted index\n\ntoken | (document_id, location_list, tf-idf_score, important_mutiplier)\n\nwhere document_id is a URL of the document.\n\nimportant_multiplier:\n2: if token is in h3 tag\n3: if token is in h2 tag\n4: if token is in h1 tag\n5: if token is in title tag\n\n#### Merging\n1. Build the inverted index until 5000 urls\n2. Write the partial index to disk. This will write a number of shard{idx}.json files in index directory.\n3. Merge partial index in alphabetical order\n4. Store the index in multiple files. This will store the entire inverted index in multiple token_shard{idx}.json files in index directory.\n\n\n### Search\n* term-at-a-time\nTiny search process a query term by term. It iteratively fetch the inverted index for tokens in the query sequentially and cumulate the weight for each url. In the end, fetched urls are sorted according to their weights.\n\n\n### Ranking\nfinal_score = tf_idf_cosine_score * important_multiplier\n\n* TF-IDF score\n\n* Important words\nh1, h2, h3, and title is considered.\n\n### Graphic user interface\nGUI for tiny search is built with Django.\n\n## Test queries\nThe performance of tiny search engine was evaluated in terms of ranking performance (effectiveness) and in terms of runtime performance (efficiency) of 20 test queries.\n\n### Queries that produce satisfactory outcomes\n* NIPS\n* Iftekhar Ahmed\n* how to write code\n* graph algorithm\n* apple\n* famous conference\n* Chicago Recommendation Data\n* michael franz\n* python\n* linear regression\n\n\n### Queries yielding unsatisfactory outcomes.\n* machine learning\n* information retrieval\n* cs221\n* pierre baldi\n* Definition of search engine\n* mobile security\n* dean of computer science\n* computer network\n* compiler lecture\n* This is very long queries without any special meaning but you need to find out the proper results\n\n1. Some results are not about machine learning but only containes a lot of either machine or learning.\ndistill out pages containes all the words in query, use word position\n\n2. dummy pages that is unimportant but containes a lot of keywords \nimportant word with special tag\n\n3. too many duplicate pages\nsimhash\n\n4. too much time for common words\nreduce the size of token shard\nskip some urls\n\n5. important words in query should be weighted more\ntf-idf\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposgnu%2Ftiny-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fposgnu%2Ftiny-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposgnu%2Ftiny-search/lists"}