{"id":16815916,"url":"https://github.com/jmdeldin/forager","last_synced_at":"2025-03-17T12:44:21.748Z","repository":{"id":8002057,"uuid":"9409536","full_name":"jmdeldin/forager","owner":"jmdeldin","description":"A proof-of-concept search engine implemented in Clojure.","archived":false,"fork":false,"pushed_at":"2013-05-17T05:39:43.000Z","size":180,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-23T22:27:55.781Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jmdeldin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-04-13T06:11:04.000Z","updated_at":"2023-11-30T09:54:57.000Z","dependencies_parsed_at":"2022-07-11T04:46:07.721Z","dependency_job_id":null,"html_url":"https://github.com/jmdeldin/forager","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jmdeldin%2Fforager","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jmdeldin%2Fforager/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jmdeldin%2Fforager/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jmdeldin%2Fforager/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jmdeldin","download_url":"https://codeload.github.com/jmdeldin/forager/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244037978,"owners_count":20387810,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T10:36:46.215Z","updated_at":"2025-03-17T12:44:21.728Z","avatar_url":"https://github.com/jmdeldin.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Forager\n[![Build Status](https://travis-ci.org/jmdeldin/forager.png)](https://travis-ci.org/jmdeldin/forager)\n\nForager is a proof-of-concept search engine in Clojure. I am writing it\nfor my NLP independent study's final project. The goal is to get a\nnuts-and-bolts understanding of language modeling (n-grams and\ntokenizing), indexing with term frequency-inverse document frequency\n(tf-idf), and retrieving documents with Boolean queries.\n\n## Features\n\n- interactive interface via a Clojure REPL\n- indexing of documents via tf-idf\n- Boolean query operators (`AND`, `OR`, `NOT`)\n  - `NEAR` or `WITHIN` a certain distance (low priority)\n- methods to evaluate information retrieval precision and recall\n\n### Optional Features\n\nIf time allows, these features would be nice to have:\n\n- compressed indexing\n- `NEAR` or `WITHIN` a certain distance operators\n\n## Data\n\nInitially, Forager will work on the plain-text short stories of Rider\nHaggard, as the Coursera NLP course provides the data and a few sample\nqueries to evaluate Forager on.\n\nIf time allows, implementing support for one of the following data sets:\n\n- Plain text from [Project Gutenberg](http://www.gutenberg.org)\n- [Wikipedia's data dump](http://en.wikipedia.org/wiki/Wikipedia:Database_download)\n- [Reuters-21578 data set](http://www.daviddlewis.com/resources/testcollections/reuters21578/)\n\n## Interface\n\nI haven't put much thought towards this yet, but I imagine interaction\nwill be something like this, returning the document identifier and an\nexcerpt in some kind of structure:\n\n```clojure\n(index \"path/to/reut2-000.sgm\")\n\n(query \"butter\")\n;; =\u003e FROM `EC MINISTERS CONSIDER BIG AGRICULTURE PRICE CUTS'\n;; =\u003e \"Routine sales of BUTTER were made.\"\n\n(query (AND \"butter\" \"cereal\"))\n;; =\u003e FROM `EC MINISTERS CONSIDER BIG AGRICULTURE PRICE CUTS'\n;; =\u003e \"...16 mln tonnes of unwanted CEREALS, over one mln tonnes of BUTTER...\"\n```\n\n## Background\n\n### Indexing\n\nIn small-scale information retrieval problems, one can create a matrix\nof documents and term frequencies. However, this requires a lot of\nmemory to construct a matrix that's the #documents by all keywords. An\nalternative, is to create an inverted index, which is a dictionary of\nkeywords, where each keyword points to a sorted list of document IDs.\n\n#### `WITHIN` Operator\n\nOne way to implement a k-word proximity search is with a\ndivide-and-conquer approach, as described by\n[this article](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.3610).\nThe algorithm proposed in that article is:\n\n1. Find the median *v* between the two keywords\n2. Scan the list of keyword positions and divide the list into two\nsmaller lists, L (positions \u003c *v*) and R (positions \u003e *v*). Keep the\nlargest positions of each keyword in L and the smallest positions of\neach keyword in R.\n3. Find the minimal intervals which lie on both L and R with the\nplane-sweep algorithm (this scans from left-to-right and finds intervals\n[left-start, right-end] containing all keywords).\n4. If L or R contains all *k* kewords, recursively find minimal\nintervals in that list.\n\n## References\n\n- [original tf-idf paper](http://dl.acm.org/citation.cfm?id=358466)\n- [comparison of tf-idf interpretations](http://dl.acm.org/citation.cfm?id=1390334.1390409)\n- [additional historical references](http://nlp.stanford.edu/IR-book/html/htmledition/references-and-further-reading-6.html)\n- [Google's original paper](http://infolab.stanford.edu/~backrub/google.html)\n- [Text retrieval by using k-word proximity search](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.3610)\n\n### Evaluation\n\n- Performance on Coursera data set\n- Performance on test queries\n\n## Author\n\nJon-Michael Deldin, `dev@jmdeldin.com`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjmdeldin%2Fforager","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjmdeldin%2Fforager","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjmdeldin%2Fforager/lists"}