{"id":19926367,"url":"https://github.com/fullstackwithlawrence/openai-embeddings","last_synced_at":"2026-04-13T06:45:33.622Z","repository":{"id":210105368,"uuid":"725741508","full_name":"FullStackWithLawrence/openai-embeddings","owner":"FullStackWithLawrence","description":"OpenAI chatGPT hybrid search and retrieval augmented generation","archived":false,"fork":false,"pushed_at":"2025-02-19T01:22:55.000Z","size":1162,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-26T06:05:44.896Z","etag":null,"topics":["ci","ci-cd","embeddings","github-actions","hybrid-search","langchain","langchain-python","openai","openai-api","pdf","pdf-document","pinecone","pre-commit","pre-commit-hooks","pydantic","pytest","python","rag","retrieval-augmented-generation","semantic-release"],"latest_commit_sha":null,"homepage":"https://www.youtube.com/@FullStackWithLawrence","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FullStackWithLawrence.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":".github/CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"lpm0073","patreon":"FullStackWithLawrence"}},"created_at":"2023-11-30T19:24:25.000Z","updated_at":"2025-02-07T14:36:49.000Z","dependencies_parsed_at":"2024-09-17T16:47:08.982Z","dependency_job_id":"0b7575d8-610b-46e3-b607-d9d015938dfd","html_url":"https://github.com/FullStackWithLawrence/openai-embeddings","commit_stats":null,"previous_names":["lpm0073/netec-llm","fullstackwithlawrence/hybrid-search-retriever","fullstackwithlawrence/openai-embeddings"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FullStackWithLawrence%2Fopenai-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FullStackWithLawrence%2Fopenai-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FullStackWithLawrence%2Fopenai-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FullStackWithLawrence%2Fopenai-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FullStackWithLawrence","download_url":"https://codeload.github.com/FullStackWithLawrence/openai-embeddings/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241354994,"owners_count":19949291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ci","ci-cd","embeddings","github-actions","hybrid-search","langchain","langchain-python","openai","openai-api","pdf","pdf-document","pinecone","pre-commit","pre-commit-hooks","pydantic","pytest","python","rag","retrieval-augmented-generation","semantic-release"],"created_at":"2024-11-12T22:29:00.910Z","updated_at":"2026-04-13T06:45:33.603Z","avatar_url":"https://github.com/FullStackWithLawrence.png","language":"Python","funding_links":["https://github.com/sponsors/lpm0073","https://patreon.com/FullStackWithLawrence"],"categories":[],"sub_categories":[],"readme":"# OpenAI Embeddings Example\n\n🤖 Retrieval Augmented Generation and Hybrid Search 🤖\n\n[![FullStackWithLawrence](https://a11ybadges.com/badge?text=FullStackWithLawrence\u0026badgeColor=orange\u0026logo=youtube\u0026logoColor=282828)](https://www.youtube.com/@FullStackWithLawrence)\u003cbr\u003e\n[![OpenAI](https://a11ybadges.com/badge?logo=openai)](https://platform.openai.com/)\n[![LangChain](https://a11ybadges.com/badge?text=LangChain\u0026badgeColor=0834ac)](https://www.langchain.com/)\n[![Pinecone](https://a11ybadges.com/badge?text=Pinecone\u0026badgeColor=000000)](https://www.pinecone.io/)\n[![Python](https://a11ybadges.com/badge?logo=python)](https://www.python.org/)\n[![Pydantic](https://a11ybadges.com/badge?text=Pydantic\u0026badgeColor=E520E9)](https://pydantic.dev/)\u003cbr\u003e\n[![Release Notes](https://img.shields.io/github/release/FullStackWithLawrence/openai-embeddings)](https://github.com/FullStackWithLawrence/openai-embeddings/releases)\n![GHA pushMain Status](https://img.shields.io/github/actions/workflow/status/FullStackWithLawrence/openai-embeddings/pushMain.yml?branch=main)\n[![AGPL License](https://img.shields.io/github/license/overhangio/tutor.svg?style=flat-square)](https://www.gnu.org/licenses/agpl-3.0.en.html)\n[![hack.d Lawrence McDaniel](https://img.shields.io/badge/hack.d-Lawrence%20McDaniel-orange.svg)](https://lawrencemcdaniel.com)\n\nA Hybrid Search and Augmented Generation prompting solution using Python [OpenAI API Embeddings](https://platform.openai.com/docs/guides/embeddings) persisted to a [Pinecone](https://docs.pinecone.io/docs/python-client) vector database index and managed by [LangChain](https://www.langchain.com/). Demonstrates the following:\n\n- **System Prompting**. How do use the system prompt to modify LLM text completion behavior.\n- **Templates**. How to create templates in order keep your prompts DRY.\n- **LangChain**. How to setup a project using LangChain as an alternative to vendor specific LLM PyPi packages.\n- **PDF Loader**. a command-line pdf loader program that extracts text, vectorizes, and\n  loads into a Pinecone dot product vector database that is dimensioned to match OpenAI embeddings.\n- **Pinecone**. How to create, load, and query a Pinecone vector database.\n- **Retrieval Augmented Generation (RAG)**. A chatGPT prompt based on a hybrid search retriever that locates relevant documents from the vector database and includes these in OpenAI prompts.\n\nSecondarily, I also use this repo for demonstrating how to setup [Pydantic](https://docs.pydantic.dev/latest/) to manage your project settings and how to safely work with sensitive credentials data inside your project.\n\n## Installation\n\n```console\ngit clone https://github.com/FullStackWithLawrence/openai-embeddings.git\ncd openai-embeddings\nmake init\n\n# Linux/macOS\nsource venv/bin/activate\n\n# Windows Powershell (admin)\nvenv\\Scripts\\activate\n```\n\nYou'll also need to add your api keys to the .env file in the root of the repo.\n\n- Get your [OpenAI API key](https://platform.openai.com/api-keys)\n- Get your [Pinecone API Key](https://app.pinecone.io/)\n\n```console\nOPENAI_API_ORGANIZATION=PLEASE-ADD-ME\nOPENAI_API_KEY=PLEASE-ADD-ME\nPINECONE_API_KEY=PLEASE-ADD-ME\n```\n\n## Usage\n\n```console\n# example 1 - generic assistant\npython3 -m models.examples.prompt \"you are a helpful assistant\" \"What analytics and accounting courses does Wharton offer?\"\n\n# example 2 - assistant with improved system prompting\npython3 -m models.examples.prompt \"You are a student advisor at University of Pennsylvania. You provide concise answers of 100 words or less.\" \"What analytics and accounting courses does Wharton offer?\"\n\n# example 3 - templated assistant: Online courses\npython3 -m models.examples.online_courses \"analytics and accounting\"\n\n# example 4 - templated assistant: Certification programs\npython3 -m models.examples.certification_programs \"analytics and accounting\"\n\n# example 5 - Retrieval Augmented Generation\npython3 -m models.examples.load \"/path/to/your/pdf/documents\"\npython3 -m models.examples.rag \"What analytics and accounting courses does Wharton offer?\"\n```\n\n### Retrieval Augmented Generation\n\nFor the question, _\"What analytics and accounting courses does Wharton offer?\"_, an\nembedding can potentially dramatically alter the response generated by chatGPT. To illustrate, I uploaded a batch of 21 sets of lecture notes in PDF format for an online analytics course taught by Wharton professor [Brian Bushee](https://accounting.wharton.upenn.edu/profile/bushee/). You can download these from https://cdn.lawrencemcdaniel.com/fswl/openai-embeddings-data.zip to test whether your results are consistent.\n\n#### The control set\n\nExample 1 above, a generic chatGPT prompt with no additional guidance provided by a system prompt nor an embedding, generates the following response:\n\n```console\nWharton offers a variety of analytics and accounting courses. Some of the analytics courses include:\n\n1. Introduction to Business Analytics: This course provides an overview of the fundamentals of business analytics, including data analysis, statistical modeling, and decision-making.\n\n2. Data Visualization and Communication: This course focuses on the effective presentation and communication of data through visualizations and storytelling techniques.\n\n3. Predictive Analytics: This course explores the use of statistical models and machine learning algorithms to predict future outcomes and make data-driven decisions.\n\n4. Big Data Analytics: This course covers the analysis of large and complex datasets using advanced techniques and tools, such as Hadoop and Spark.\n\nIn terms of accounting courses, Wharton offers:\n\n1. Financial Accounting: This course provides an introduction to the principles and concepts of financial accounting, including the preparation and analysis of financial statements.\n\n2. Managerial Accounting: This course focuses on the use of accounting information for internal decision-making and planning, including cost analysis and budgeting.\n\n3. Advanced Financial Accounting: This course delves into more complex accounting topics, such as consolidations, partnerships, and international accounting standards.\n\n4. Auditing and Assurance Services: This course covers the principles and practices of auditing, including risk assessment, internal controls, and audit procedures.\n\nThese are just a few examples of the analytics and accounting courses offered at Wharton. The school offers a wide range of courses to cater to different interests and skill levels in these fields.\n(venv) (base) mcdaniel@MacBookAir-Lawrence openai-embeddings % python3 -m models.examples.online_courses \"analytics and accounting\"\n```\n\n#### Same prompt but with an embedding\n\nAfter creating an embedding from the sample set of pdf documents, you can prompt models.examples.rag with the same question, and it should provide a quite different response compared to the control from example 1. It should resemble the following:\n\n```console\nWharton offers a variety of analytics and accounting courses. Some of the courses offered include:\n\n1. Accounting-Based Valuation: This course, taught by Professor Brian Bushee, focuses on using accounting information to value companies and make investment decisions.\n\n2. Review of Financial Statements: Also taught by Professor Brian Bushee, this course provides an in-depth understanding of financial statements and how to analyze them for decision-making purposes.\n\n3. Discretionary Accruals Model: Another course taught by Professor Brian Bushee, this course explores the concept of discretionary accruals and their impact on financial statements and financial analysis.\n\n4. Discretionary Accruals Cases: This course, also taught by Professor Brian Bushee, provides practical applications of the discretionary accruals model through case studies and real-world examples.\n\nThese are just a few examples of the analytics and accounting courses offered at Wharton. The school offers a wide range of courses in these areas to provide students with a comprehensive understanding of financial analysis and decision-making.\n```\n\n## Requirements\n\n- [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). _pre-installed on Linux and macOS_\n- [make](https://gnuwin32.sourceforge.net/packages/make.htm). _pre-installed on Linux and macOS._\n- [OpenAI platform API key](https://platform.openai.com/).\n  _If you're new to OpenAI API then see [How to Get an OpenAI API Key](./doc/OPENAI_API_GETTING_STARTED_GUIDE.md)_\n- [Pinecone](https://www.pinecone.io/) API key. A vector database for storing embedding results.\n- [Python 3.12](https://www.python.org/downloads/): for creating virtual environment. Also used by pre-commit linters and code formatters.\n- [NodeJS](https://nodejs.org/en/download): used with NPM for configuring/testing Semantic Release.\n\n## Configuration defaults\n\nSet these as environment variables on the command line, or in a .env file that should be located in the root of the repo.\n\n```console\n# OpenAI API\nOPENAI_API_ORGANIZATION=ADD-ME-PLEASE\nOPENAI_API_KEY=ADD-ME-PLEASE\nOPENAI_CHAT_MODEL_NAME=gpt-4\nOPENAI_PROMPT_MODEL_NAME=gpt-4\nOPENAI_CHAT_TEMPERATURE=0.0\nOPENAI_CHAT_MAX_RETRIES=3\n\n# Pinecone API\nPINECONE_API_KEY=ADD-ME-PLEASE\nPINECONE_ENVIRONMENT=gcp-starter\nPINECONE_INDEX_NAME=openai-embeddings\nPINECONE_VECTORSTORE_TEXT_KEY=lc_id\nPINECONE_METRIC=dotproduct\nPINECONE_DIMENSIONS=1536\n\n# This package\nDEBUG_MODE=False\n```\n\n## Contributing\n\nThis project uses a mostly automated pull request and unit testing process. See the resources in .github for additional details. You additionally should ensure that pre-commit is installed and working correctly on your dev machine by running the following command from the root of the repo.\n\n```console\npre-commit run --all-files\n```\n\nPull requests should pass these tests before being submitted:\n\n```console\nmake test\n```\n\n### Developer setup\n\n```console\ngit clone https://github.com/lpm0073/automatic-models.git\ncd automatic-models\nmake init\nmake activate\n```\n\n### Github Actions\n\nActions requires the following secrets:\n\n```console\nPAT: {{ secrets.PAT }}  # a GitHub Personal Access Token\nOPENAI_API_ORGANIZATION: {{ secrets.OPENAI_API_ORGANIZATION }}\nOPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }}\nPINECONE_API_KEY: {{ secrets.PINECONE_API_KEY }}\nPINECONE_ENVIRONMENT: {{ secrets.PINECONE_ENVIRONMENT }}\nPINECONE_INDEX_NAME: {{ secrets.PINECONE_INDEX_NAME }}\n```\n\n## Additional reading\n\n- [Youtube - Vector Embeddings Tutorial – Code Your Own AI Assistant with GPT-4 API + LangChain + NLP](https://www.youtube.com/watch?v=yfHHvmaMkcA)\n- [Youtube - LangChain Explained in 13 Minutes | QuickStart Tutorial for Beginners](https://www.youtube.com/watch?v=aywZrzNaKjs)\n- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)\n- [What is a Vector Database?](https://www.pinecone.io/learn/vector-database/)\n- [LangChain RAG](https://python.langchain.com/docs/use_cases/question_answering/)\n- [LangChain Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)\n- [LanchChain Caching](https://python.langchain.com/docs/modules/model_io/llms/llm_caching)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffullstackwithlawrence%2Fopenai-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffullstackwithlawrence%2Fopenai-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffullstackwithlawrence%2Fopenai-embeddings/lists"}