{"id":21051699,"url":"https://github.com/abdullahwaqar/docsearx","last_synced_at":"2025-03-13T23:25:32.934Z","repository":{"id":38059043,"uuid":"231071054","full_name":"abdullahwaqar/docsearx","owner":"abdullahwaqar","description":"A simple search engine that ranks pdfs based on search keyword \u0026 TF-IDF weights and cosine similarity.","archived":false,"fork":false,"pushed_at":"2023-03-02T10:36:25.000Z","size":2046,"stargazers_count":1,"open_issues_count":15,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-20T18:43:37.467Z","etag":null,"topics":["inmemory","nltk","search-engine"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abdullahwaqar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-31T10:25:22.000Z","updated_at":"2022-02-13T12:54:48.000Z","dependencies_parsed_at":"2024-11-19T15:59:24.618Z","dependency_job_id":"0e86f032-e175-4a58-849b-3f8131a1b766","html_url":"https://github.com/abdullahwaqar/docsearx","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdullahwaqar%2Fdocsearx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdullahwaqar%2Fdocsearx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdullahwaqar%2Fdocsearx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdullahwaqar%2Fdocsearx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abdullahwaqar","download_url":"https://codeload.github.com/abdullahwaqar/docsearx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243496894,"owners_count":20300157,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inmemory","nltk","search-engine"],"created_at":"2024-11-19T15:59:16.870Z","updated_at":"2025-03-13T23:25:32.911Z","avatar_url":"https://github.com/abdullahwaqar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# docsearcx | A minimal document search engine.\n![alt text](https://github.com/abdullahwaqar/docsearx/blob/master/docs/Screenshot.png \"Web App Screenshot\")\n\ndocsearcx is a simple search engine that retrieves information from ***pdfs*** based on term frequency-inverse Document frequency and cosine similarity to retrieve relevant documents.\n\n\n## Limitation\nFor the sake of POC this application relies on in memory storage.\n\n---\n\n## Setup\n### Installing Pipenv\nIf pipenv is already installed skip this step.\n\n```pip install pipenv```\n\n\n### Installing Dependencies\n\n```pipenv install```\n\n\u0026 Activate the virtual environment shell by\n\n```pipenv shell```\n\n### Running the Flask app\n\n```python app.py```\n\n### Running Client\n\n```\ncd client/\n\nnpm install\n\nnpm run serve\n```\n\n---\n\n### Term Frequency-inverse Document Frequency\nTF-IDF is a numerical statistics which reflects how important a word is to a document. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdullahwaqar%2Fdocsearx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabdullahwaqar%2Fdocsearx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdullahwaqar%2Fdocsearx/lists"}