{"id":17526843,"url":"https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi","last_synced_at":"2025-03-06T06:30:56.679Z","repository":{"id":250523733,"uuid":"834698940","full_name":"SomeOddCodeGuy/OfflineWikipediaTextApi","owner":"SomeOddCodeGuy","description":"This small API downloads and exposes access to NeuML's txtai-wikipedia and full wikipedia datasets, taking in a query and returning full article text","archived":false,"fork":false,"pushed_at":"2024-08-01T00:47:52.000Z","size":35,"stargazers_count":29,"open_issues_count":0,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-01T03:30:07.966Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SomeOddCodeGuy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-28T04:57:12.000Z","updated_at":"2024-08-01T00:47:51.000Z","dependencies_parsed_at":"2024-08-01T03:25:04.049Z","dependency_job_id":null,"html_url":"https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi","commit_stats":null,"previous_names":["someoddcodeguy/offlinewikipediatextapi"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SomeOddCodeGuy%2FOfflineWikipediaTextApi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SomeOddCodeGuy%2FOfflineWikipediaTextApi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SomeOddCodeGuy%2FOfflineWikipediaTextApi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SomeOddCodeGuy%2FOfflineWikipediaTextApi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SomeOddCodeGuy","download_url":"https://codeload.github.com/SomeOddCodeGuy/OfflineWikipediaTextApi/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242161429,"owners_count":20081871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-20T15:02:33.786Z","updated_at":"2025-03-06T06:30:56.654Z","avatar_url":"https://github.com/SomeOddCodeGuy.png","language":"Python","funding_links":[],"categories":["Python","self-hosted"],"sub_categories":[],"readme":"# Offline Wikipedia Text API\n\nWelcome to the Offline Wikipedia Text API! This project provides a simple way to search and retrieve Wikipedia articles from an offline dataset using the `txtai` library. The API offers three endpoints to get full articles by title, full articles by search prompt, and summary snippets of articles by search prompt.\n\n## Features\n\n- **Offline Access**: All Wikipedia article texts are stored offline, allowing for fast and private access.\n- **Search Functionality**: Uses the powerful `txtai` library to search for articles by prompts.\n\n## Requirements\n\n* This project requires a minimum of 60GB of hard disk space to store the related datasets\n* This project utilizes Git to pull down the needed datasets (https://git-scm.com/downloads)\n  * This can be skipped by downloading the datasets into their respective folders in the project directory.\n    * \"wiki-dataset\" folder: https://huggingface.co/datasets/NeuML/wikipedia-20240101\n    * \"txtai-wikipedia\" folder: https://huggingface.co/NeuML/txtai-wikipedia\n  * The existence of the two dataset folders should skip the git calls, bypassing their need.\n* This project is a Python project, and requires Python to run.\n\n## Important Notes\n\nThere ARE scripts for Mac and Windows, but they are in the \"Untested\" folder because of two reasons:\n- A) On Mac, I ran into an issue with the XCode supplied git that it doesn't handle large files well. The result\n  is that I can't download the wikipedia datasets cleanly in that script. Once the sets are in their respective locations, the\n  script works great. You can find more in the \"Untested\" folder readme.\n- B) I don't have a Linux machine to test with. I've had a couple of people tell me it works fine, so I have\n  an expectation that it will.\n\nDuring first run, the app will first download about 60GB worth of datasets (see above), and then will take about 10-15\nminutes to do some indexing. This will only occur on first run; just let it do its thing. If, for any reason, you kill\nthe process halfway through and need to redo it, you can simply delete the \"title_to_index.json\" file and it will be\nrecreated. You can also delete the \"wiki-dataset\" and \"txtai-wikipedia\" folders to redownload.\n\nIf you're dataset savvy and want to make new, more up to date, datasets to use with this- NeuML's Hugging Face repos give\ninstructions on how.\n\nThis project relies heavily on [txtai](https://github.com/neuml/txtai/), which uses various libraries to download\nand utilize small models itself for searching. Please see that project for an understanding of what gets downloaded\nand where.\n\n\n\n1. **Clone the Repository**\n    ```sh\n    git clone https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi\n    cd OfflineWikipediaTextApi\n    ```\n### Installation via Scripts\n\n2. **Run the API** \n    - **For Windows**:\n    \n        *To run with the default configuration (current directory as the base for datasets):*\n        ```cmd\n        run_windows.bat\n        ```\n        *To run with a custom directory for the wiki data (parent of `wiki-dataset` and `txtai-wikipedia`):*\n        ```cmd\n        run_windows.bat --database_dir path\\to\\datadirs\n        ```\n    \n    - **For Linux or MacOS**:\n        \n        *To run with the default configuration (current directory as the base for datasets):*\n        ```sh\n        ./run_linux.sh\n        ```\n        *Or with custom directory for the wiki data (parent of wiki-dataset and txtai-wikipedia):*\n        ```sh\n        ./run_linux.sh --database_dir path/to/datadirs\n        ```\n      - The script was tested on Linux and it might work on MacOS.\n      - There are currently scripts within \"Untested\", though there is a known issue for MacOS related to git. A workaround \n        is presented in the README for that folder.\n\n### Manual Installation\n\n1) Pull down the code from https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi\n   `git clone https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi`\n2) Open command prompt and navigate to the folder containing the code\n   `cd OfflineWikipediaTextApi`\n3) Optional: create a python virtual environment.\n   1) Windows: `python -m venv venv`\n   2) MacOS: `python3 -m venv venv`\n   3) Linux: `python -m venv venv`\n4) Optional: activate python virtual environment.\n   1) Windows: `venv\\Scripts\\activate`\n   2) MacOS/Linux: `venv/bin/activate`\n   3) Fish shell: `venv/bin/activate.fish`\n5) Pip install the requirements from requirements.txt\n   1) Windows: `python -m pip install -r requirements.txt`\n   2) MacOS: `python3 -m pip install -r requirements.txt`\n   3) Linux: `python -m pip install -r requirements.txt`\n6) Pull down the two needed datasets into the following folders within the project folder:\n   1) `wiki-dataset` folder: https://huggingface.co/datasets/NeuML/wikipedia-20240901 \n        You would need git-lfs installed to clone it\n        Windows: https://git-lfs.com/\n        Mac: https://git-lfs.com/ or `brew install git-lfs`\n        Linux Ubuntu/Debian: `sudo apt install git-lfs`\n        Then run:\n        `git lfs install`\n        `git clone https://huggingface.co/datasets/NeuML/wikipedia-20240901`\n        The dataset requieres to be called `wiki-dataset` so rename it:\n        `mv wikipedia-20240901 wiki-dataset`      \n   2) `txtai-wikipedia` folder: https://huggingface.co/NeuML/txtai-wikipedia\n        `git clone https://huggingface.co/NeuML/txtai-wikipedia`\n   3) See project structure below to make sure you did it right\n7) Run start_api.py\n   1) Windows: python start_api.py\n   2) MacOS/Linux: python3 start_api.py\n\nStep 7 will take between 10-15 minutes on the first run only. This is to index some stuff for future runs. After that\nit should be fast.\n\nYour project should look like this:\n\n```plain\n\n- OfflineWikipediaTextApi/\n   - wiki-dataset/\n       - train/\n           - data-00000-of-00044.arrow\n           - data-00001-of-00044.arrow\n           - ...\n       - pageviews.sqlite\n       - README.md\n   - txtai-wikipedia\n       - config.json\n       - documents\n       - embeddings\n       - README.md\n   - start_api.py\n   - ...\n```\n\n\n## Configuration\n\nThe API configuration is managed through the `config.json` file:\n\n```json\n{\n    \"host\": \"0.0.0.0\",\n    \"port\": 5728,\n    \"verbose\": false\n}\n```\n\nThe \"verbose\" is for changing whether the API library uvicorn outputs all logs vs just warning logs. Set to \nwarning by default.\n\n## Endpoints\n\n### 1. Get Top Article by Prompt Query\n\n**Endpoint**: `/top_article`\n\n#### Example cURL Command\n```sh\ncurl -G \"http://localhost:5728/top_article\" --data-urlencode \"prompt=Quantum Physics\" --data-urlencode \"percentile=0.5\" --data-urlencode \"num_results=10\"\n```\n\n`NOTE: The num_results for top_article is the number of results to compare to find the top article. This endpoint\nalways returns a single result, but the higher your num_results the more articles it will compare in an attempt to\nfind the top scoring`\n\n### 2. Get Top N Articles by Prompt Query\n\n**Endpoint**: `/top_n_articles`\n\n#### Example cURL Command\n```sh\ncurl -G \"http://localhost:5728/top_n_articles\" --data-urlencode \"prompt=quantum physics and gravity\" --data-urlencode \"percentile=0.4\" --data-urlencode \"num_results=80\" --data-urlencode \"num_top_articles=6\"\n```\n\n`NOTE: The num_results for top_n_articles is the number of results to compare to find the top N articles, where num_top_articles is N.\nThe output articles are given in order of score, where largest scored article is first by default (descending).\nIf percentile, num_results, and num_top_articles are not specified, then default values of 0.5, 20, and 8 will be used respectively.\nnum_top_articles can also be negative, where a negative number will give the results as ascending score rather then descending - this is useful\nwhen context is truncated by LLM.`\n\n### 3. Get Full Article by Title\n\n**Endpoint**: `/articles/{title}`\n\n#### Example cURL Command\n```sh\ncurl -X GET \"http://localhost:5728/articles/Applications%20of%20quantum%20mechanics\"\n```\n\n### 4. Get Wiki Summaries by Prompt Query\n\n**Endpoint**: `/summaries`\n\n#### Example cURL Command\n```sh\ncurl -G \"http://localhost:5728/summaries\" --data-urlencode \"prompt=Quantum Physics\" --data-urlencode \"percentile=0.5\" --data-urlencode \"num_results=1\"\n```\n\n### 5. Get Full Wiki Articles by Prompt Query\n\n**Endpoint**: `/articles`\n\n#### Example cURL Command\n```sh\ncurl -G \"http://localhost:5728/articles\" --data-urlencode \"prompt=Artificial Intelligence\" --data-urlencode \"percentile=0.5\" --data-urlencode \"num_results=1\"\n```\n\n## License\n\nThis project is licensed under the Apache 2.0 License. See the `LICENSE` file for more details.\n\n### Third-Party Licenses\n\nThis project imports dependencies in the requirements.txt:\n\n- [Uvicorn](https://github.com/encode/uvicorn/)\n- [FastAPI](https://github.com/tiangolo/fastapi/)\n- [Datasets](https://github.com/huggingface/datasets/)\n- [Txtai](https://github.com/neuml/txtai/)\n- [Faiss-cpu](https://github.com/facebookresearch/faiss/)\n- [Colorama](https://github.com/tartley/colorama/)\n- [NumPy](https://github.com/numpy/numpy/)\n\nPlease see ThirdParty-Licenses directory for details on their licenses.\n\n## License and Copyright\n\n    OfflineWikipediaTextApi\n    Copyright (C) 2024 Christopher Smith","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSomeOddCodeGuy%2FOfflineWikipediaTextApi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSomeOddCodeGuy%2FOfflineWikipediaTextApi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSomeOddCodeGuy%2FOfflineWikipediaTextApi/lists"}