{"id":13566900,"url":"https://github.com/geeks-of-data/knowledge-gpt","last_synced_at":"2025-04-04T00:32:28.368Z","repository":{"id":148442124,"uuid":"612770423","full_name":"geeks-of-data/knowledge-gpt","owner":"geeks-of-data","description":"Extract knowledge from all information sources using gpt and other language models. Index and make Q\u0026A session with information sources.","archived":false,"fork":false,"pushed_at":"2023-04-25T06:55:08.000Z","size":3522,"stargazers_count":283,"open_issues_count":9,"forks_count":54,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-28T16:40:23.484Z","etag":null,"topics":["context","embedding","embedding-vectors","gpt","gpt3-turbo","gpt4","huggingface","huggingface-transformers","information-extraction","language-model","llama","llm","natural-language-processing","openai","python","question-answering","scraper","sentence-embeddings","sentence-similarity","vector-search"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/knowledgegpt/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/geeks-of-data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-11T23:15:30.000Z","updated_at":"2025-03-12T02:32:28.000Z","dependencies_parsed_at":"2023-05-20T07:00:58.899Z","dependency_job_id":null,"html_url":"https://github.com/geeks-of-data/knowledge-gpt","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geeks-of-data%2Fknowledge-gpt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geeks-of-data%2Fknowledge-gpt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geeks-of-data%2Fknowledge-gpt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geeks-of-data%2Fknowledge-gpt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/geeks-of-data","download_url":"https://codeload.github.com/geeks-of-data/knowledge-gpt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247103290,"owners_count":20884023,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["context","embedding","embedding-vectors","gpt","gpt3-turbo","gpt4","huggingface","huggingface-transformers","information-extraction","language-model","llama","llm","natural-language-processing","openai","python","question-answering","scraper","sentence-embeddings","sentence-similarity","vector-search"],"created_at":"2024-08-01T13:02:19.124Z","updated_at":"2025-04-04T00:32:26.737Z","avatar_url":"https://github.com/geeks-of-data.png","language":"Python","funding_links":[],"categories":["Python","Packages","Langchain","ChatGPT Integrated Projects","SDK, Libraries, Frameworks"],"sub_categories":["Python","Python library, sdk or frameworks"],"readme":"\u003c!-- Use the context of other files to complete here --\u003e\n![knowledgegpt](static_files/logo.png)\n\n\n# knowledgegpt\n\n***knowledgegpt*** is designed to gather information from various sources, including the internet and local data, which\ncan be used to create prompts. These prompts can then be utilized by OpenAI's GPT-3 model to generate answers that are\nsubsequently stored in a database for future reference.\n\nTo accomplish this, the text is first transformed into a fixed-size vector using either open source or OpenAI models.\nWhen a query is submitted, the text is also transformed into a vector and compared to the stored knowledge embeddings.\nThe most relevant information is then selected and used to generate a prompt context.\n\n***knowledgegpt*** supports various information sources including websites, PDFs, PowerPoint files (PPTX), and\ndocuments (Docs). Additionally, it can extract text from YouTube subtitles and audio (using speech-to-text technology)\nand use it as a source of information. This allows for a diverse range of information to be gathered and used for\ngenerating prompts and answers.\n\n## Pypi Link: https://pypi.org/project/knowledgegpt/\n\n# Installation\n\n1. PyPI installation, run in terminal:  `pip install knowledgegpt`\n\n2. Or you can use the latest version from the repository: `pip install -r requirements.txt` and then `pip install .`\n\n3. Download needed language model for parsing: `python3 -m spacy download en_core_web_sm`\n\n## How to use\n\n#### Restful API\n\n```uvicorn server:app --reload```\n\n#### Set Your API Key\n\n1. Go to [OpenAI \u003e Account \u003e Api Keys](https://platform.openai.com/account/api-keys)\n2. Create new screet key and copy\n3. Enter the key to [example_config.py](./examples/example_config.py)\n\n#### How to use the library\n\n```python\n# Import the library\nfrom knowledgegpt.extractors.web_scrape_extractor import WebScrapeExtractor\n\n# Import OpenAI and Set the API Key\nimport openai\nfrom example_config import SECRET_KEY \nopenai.api_key = SECRET_KEY\n\n# Define target website\nurl = \"https://en.wikipedia.org/wiki/Bombard_(weapon)\"\n\n# Initialize the WebScrapeExtractor\nscrape_website = WebScrapeExtractor( url=url, embedding_extractor=\"hf\", model_lang=\"en\")\n\n# Prompt the OpenAI Model\nanswer, prompt, messages = scrape_website.extract(query=\"What is a bombard?\",max_tokens=300,  to_save=True, mongo_client=db)\n\n# See the answer\nprint(answer)\n\n# Output: 'A bombard is a type of large cannon used during the 14th to 15th centuries.'\n\n```\n\nOther examples can be found in the [examples](./examples) folder.\nBut to give a better idea of how to use the library, here is a simple example:\n\n```python\n# Basic Usage\nbasic_extractor = BaseExtractor(df)\nanswer, prompt, messages = basic_extractor.extract(\"What is the title of this PDF?\", max_tokens=300)\n```\n\n```python\n# PDF Extraction\npdf_extractor = PDFExtractor( pdf_file_path, extraction_type=\"page\", embedding_extractor=\"hf\", model_lang=\"en\")\nanswer, prompt, messages = pdf_extractor.extract(query, max_tokens=1500)\n```\n\n```python\n# PPTX Extraction\nppt_extractor = PowerpointExtractor(file_path=ppt_file_path, embedding_extractor=\"hf\", model_lang=\"en\")\nanswer, prompt, messages = ppt_extractor.extract( query,max_tokens=500)\n```\n\n```python\n# DOCX Extraction\ndocs_extractor = DocsExtractor(file_path=\"../example.docx\", embedding_extractor=\"hf\", model_lang=\"en\", is_turbo=False)\nanswer, prompt, messages = \\\n    docs_extractor.extract( query=\"What is an object detection system?\", max_tokens=300)\n```\n\n```python\n# Extraction from Youtube video (audio)\nscrape_yt_audio = YoutubeAudioExtractor(video_id=url, model_lang='tr', embedding_extractor='hf')\nanswer, prompt, messages = scrape_yt_audio.extract( query=query, max_tokens=1200)\n\n# Extraction from Youtube video (transcript)\nscrape_yt_subs = YTSubsExtractor(video_id=url, embedding_extractor='hf', model_lang='en')\nanswer, prompt, messages = scrape_yt_subs.extract( query=query, max_tokens=1200)\n```\n## Docker Usage\n\n```bash\ndocker build -t knowledgegptimage .\ndocker run -p 8888:8888 knowledgegptimage\n```\n\n## How to contribute\n\n0. Open an issue\n1. Fork the repo\n2. Create a new branch\n3. Make your changes\n4. Create a pull request\n\n## FEATURES\n\n- [x] Extract knowledge from the internet (i.e. Wikipedia)\n- [x] Extract knowledge from local data sources - PDF\n- [x] Extract knowledge from local data sources - DOCX\n- [x] Extract knowledge from local data sources - PPTX\n- [x] Extract knowledge from youtube audio (when caption is not available)\n- [x] Extract knowledge from youtube transcripts\n- [x] Extract knowledge from whole youtube playlist\n\n## TODO\n\n\n- [x] FAISS support \n- [ ] Add a vector database (Pinecone, Milvus, Qdrant etc.)\n- [x] Add Whisper Model\n- [x] Add Whisper Local Support (not over openai API)\n- [ ] Add Whisper for audio longer than 25MB\n- [ ] Add a web interface\n- [ ] Migrate to Promptify for prompt generation\n- [x] Add ChatGPT support\n- [ ] Add ChatGPT support with a better infrastructure and planning\n- [ ] Increase the number of prompts\n- [ ] Increase the number of supported knowledge sources\n- [ ] Increase the number of supported languages\n- [ ] Increase the number of open source models\n- [ ] Advanced web scraping\n- [ ] Prompt-Answer storage (the odds are that this will be done in a separate project)\n- [ ] Add a better documentation \n- [ ] Add a better logging system\n- [ ] Add a better error handling system\n- [ ] Add a better testing system\n- [ ] Add a better CI/CD system\n- [x] Dockerize the project\n- [ ] Add search engine support, such as Google, Bing, etc.\n- [ ] Add support for opensource OpenAI alternatives (for answer generation)\n- [ ] Evaluating dependencies and removing unnecessary ones\n- [ ] Providing prompt flexibility for using with whatever model\n\n( To be extended...)\n\n## System Architecture\n\n\u003c!-- ![System Architecture](static_files/Knowledge-ex.png) --\u003e\n(To be updated with a better image)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeeks-of-data%2Fknowledge-gpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeeks-of-data%2Fknowledge-gpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeeks-of-data%2Fknowledge-gpt/lists"}