{"id":39258577,"url":"https://github.com/hathix/searchbetter","last_synced_at":"2026-01-18T00:24:26.120Z","repository":{"id":41827774,"uuid":"86384080","full_name":"hathix/searchbetter","owner":"hathix","description":"SearchBetter: query rewriting for search engines on small corpuses (Harvard research project)","archived":false,"fork":false,"pushed_at":"2017-07-15T13:17:49.000Z","size":17861,"stargazers_count":33,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-08-30T04:32:14.617Z","etag":null,"topics":["edtech","harvard","python","query-rewriting","search-engine","word2vec"],"latest_commit_sha":null,"homepage":"http://searchbetter.readthedocs.io/en/latest/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hathix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-27T21:07:11.000Z","updated_at":"2025-01-23T07:18:30.000Z","dependencies_parsed_at":"2022-08-19T00:01:36.269Z","dependency_job_id":null,"html_url":"https://github.com/hathix/searchbetter","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/hathix/searchbetter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hathix%2Fsearchbetter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hathix%2Fsearchbetter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hathix%2Fsearchbetter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hathix%2Fsearchbetter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hathix","download_url":"https://codeload.github.com/hathix/searchbetter/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hathix%2Fsearchbetter/sbom","scorecard":{"id":457554,"data":{"date":"2025-08-11","repo":{"name":"github.com/hathix/searchbetter","commit":"bbcce5a9597adc82c59e236db94d69f5f2ec7217"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.9,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: MIT License: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":1,"reason":"9 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-55x5-fj6c-h6m8","Warn: Project is vulnerable to: PYSEC-2021-19 / GHSA-jq4v-f5q6-mjqq","Warn: Project is vulnerable to: PYSEC-2020-62 / GHSA-pgww-xf46-h92r","Warn: Project is vulnerable to: PYSEC-2022-230 / GHSA-wrxv-2j5q-m38w","Warn: Project is vulnerable to: PYSEC-2018-12 / GHSA-xp26-p53h-6h2p","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: PYSEC-2018-28 / GHSA-x84v-xcm2-53pg"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-19T10:06:49.679Z","repository_id":41827774,"created_at":"2025-08-19T10:06:49.679Z","updated_at":"2025-08-19T10:06:49.679Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28523714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T23:53:28.710Z","status":"ssl_error","status_checked_at":"2026-01-17T23:52:20.131Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["edtech","harvard","python","query-rewriting","search-engine","word2vec"],"created_at":"2026-01-18T00:24:20.968Z","updated_at":"2026-01-18T00:24:26.090Z","avatar_url":"https://github.com/hathix.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SearchBetter: query rewriting for search engines on small corpuses\n\n\u003e by Neel Mehta, Harvard University\n\nSearchBetter lets you make powerful, fast, and drop-in search engines for any dataset, no matter how small or how large. It also offers built-in query rewriting, which uses NLP to help your search engines find semantically-related content to the user's search term.\n\nFor instance, a search for `machine learning` might only return results for items that contain the words \"machine learning\". But with query rewriting, you would get results not only for `machine learning` but also, say, `artificial intelligence` and `neural networks`.\n\nSearchBetter lets you power up your search engines with minimal effort. It's especially useful if you have a small dataset to search on, or if you don't have the time or data to make fancy bespoke query rewriting algorithms.\n\n## Getting started\n\nTo drop this module into your app:\n\n```\npip install searchbetter\n```\n\nFor more advanced analysis and research purposes, use the [interactive demo]((https://github.com/hathix/searchbetter/blob/master/notebooks/searchbetter-demo.ipynb)) to get yourself set up!\n\n## Usage\n\n[Try out the interactive demo](https://github.com/hathix/searchbetter/blob/master/notebooks/searchbetter-demo.ipynb)!\n\nFor a truly quick-and-dirty dive into SearchBetter (no setup required), use:\n\n```python\nfrom searchbetter import rewriter\n\nquery_rewriter = rewriter.WikipediaRewriter()\nquery_rewriter.rewrite('biochemistry')\n```\n\n## Documentation\n\nDocumentation is available online at \u003chttp://searchbetter.readthedocs.io/\u003e.\n\nTo build the docs yourself using Sphinx:\n\n```\ncd docs\nmake html\nopen _build/html/index.html\n```\n\n## Where to find data\n\nSome of this data is proprietary to Harvard and HarvardX. Other info, like the Udacity API and Wikipedia dump, is open to the public.\n\nName           | URL                                             | What to name file\n-------------- | ----------------------------------------------- | -------------------------------------------------------\nUdacity API    | \u003chttps://www.udacity.com/public-api/v0/courses\u003e | `udacity-api.json`\nWikipedia dump | See below                                       | `wikiclean8`\nedX courses    | Proprietary                                     | `Master CourseListings - edX.csv`\nDART data      | Proprietary                                     | `corpus_HarvardX_LatestCourses_based_on_2016-10-18.csv`\n\n### How to prepare Wikipedia data\n\nDownload and unzip the `enwik8` dataset from \u003chttp://www.mattmahoney.net/dc/enwik8.zip\u003e. Then run:\n\n```\nperl processing-scripts/wiki-clean.pl enwik8 \u003e wikiclean8\n```\n\nThis might take a minute or two to run.\n\n## Context\n\nSearchBetter was designed as part of a research project by [Neel Mehta](https://github.com/hathix), [Daniel Seaton](https://github.com/dseaton), and [Dustin Tingley](http://scholar.harvard.edu/dtingley/home) for Harvard's CS 91r, a research for credit course.\n\nIt was originally designed for [Harvard DART](https://dart.harvard.edu/), a tool that helps educators reuse HarvardX assets such as videos and exercises in their online or offline courses. SearchBetter is especially useful for MOOCs, which often have small corpuses and have to deal with many uncommon queries (students will search for the most unfamiliar terms, after all.) Still, SearchBetter has been made general-purpose enough that it can be used with any corpus or any search engine.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhathix%2Fsearchbetter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhathix%2Fsearchbetter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhathix%2Fsearchbetter/lists"}