{"id":21170217,"url":"https://github.com/dcarpintero/github-semantic-search","last_synced_at":"2025-04-13T15:04:53.562Z","repository":{"id":195824303,"uuid":"692490888","full_name":"dcarpintero/github-semantic-search","owner":"dcarpintero","description":"Semantic Search on Langchain Github Issues with Weaviate","archived":false,"fork":false,"pushed_at":"2023-09-25T10:16:03.000Z","size":2471,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T15:04:16.063Z","etag":null,"topics":["bm25","embedding-vectors","hybrid-search","langchain","large-language-models","python","semantic-search","streamlit","weaviate"],"latest_commit_sha":null,"homepage":"https://gh-semantic-search.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcarpintero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-16T16:34:10.000Z","updated_at":"2024-01-23T05:17:33.000Z","dependencies_parsed_at":"2025-01-21T10:43:18.992Z","dependency_job_id":"4b678e6b-f1e8-4560-85fa-e1ffd4d2fdec","html_url":"https://github.com/dcarpintero/github-semantic-search","commit_stats":null,"previous_names":["dcarpintero/github-semantic-search"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fgithub-semantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fgithub-semantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fgithub-semantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fgithub-semantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcarpintero","download_url":"https://codeload.github.com/dcarpintero/github-semantic-search/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248732483,"owners_count":21152852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bm25","embedding-vectors","hybrid-search","langchain","large-language-models","python","semantic-search","streamlit","weaviate"],"created_at":"2024-11-20T15:57:10.790Z","updated_at":"2025-04-13T15:04:53.533Z","avatar_url":"https://github.com/dcarpintero.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Open_inStreamlit](https://img.shields.io/badge/Open%20In-Streamlit-red?logo=Streamlit)](https://gh-semantic-search.streamlit.app/)\n[![Python](https://img.shields.io/badge/python-%203.8-blue.svg)](https://www.python.org/)\n[![CodeFactor](https://www.codefactor.io/repository/github/dcarpintero/github-semantic-search/badge)](https://www.codefactor.io/repository/github/dcarpintero/github-semantic-search)\n[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/dcarpintero/st-newsapi-connector/blob/main/LICENSE)\n\n# 🦜 Semantic Search on Langchain Github Issues with Weaviate 🔍\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./static/github-semantic-search.png\"\u003e\n\u003c/p\u003e\n\n##  🔍 What's Semantic Search?\n\n\u003e *Semantic search refers to search algorithms that consider the intent and contextual meaning of search phrases when generating results, rather than solely focusing on keyword matching. The goal is to provide more accurate and relevant results by understanding the semantics, or meaning, behind the query.*\n\n## 📋 How does it work?\n\n- **Ingesting Github Issues**: We use the [Langchain Github Loader](https://js.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_loaders/github)  to connect to the [Langchain Repository](http://github.com/langchain-ai/langchain) and fetch the GitHub issues (nearly 2.000), which are then converted to a pandas dataframe and stored in a pickle file. See [./data-pipeline/ingest.py](./data-pipeline/ingest.py).\n\n- **Generate and Index Vector Embeddings with Weaviate**: Weaviate generates vector embeddings at the object level (rather than for individual properties), it includes by default properties that use the text data type, in our case we skip the 'url' field (which will be also not filterable and not searchable) and set up the 'text2vec-openai' vectorizer. Given that our use case values fast queries over loading time, we have opted for the [HNSW](https://arxiv.org/abs/1603.09320) vector index type, which incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers).\n\n```python\nclass_obj = {\n        \"class\": \"GitHubIssue\",\n        \"description\": \"This class contains GitHub Issues from the langchain repository.\",\n        \"vectorIndexType\": \"hnsw\",\n        \"vectorizer\": \"text2vec-openai\",\n        \"moduleConfig\": {\n            \"text2vec-openai\": {\n                \"model\": \"ada\",\n                \"modelVersion\": \"002\",\n                \"type\": \"text\"\n            }\n        },\n        \"properties\": [\n            {\n                \"name\": \"title\",\n                \"dataType\": [\"text\"]\n            },\n            {\n                \"name\": \"url\",\n                \"dataType\": [\"text\"],\n                \"indexFilterable\": False,  \n                \"indexSearchable\": False,\n                \"vectorizePropertyName\": False\n            },\n            {\n                \"name\": \"description\",\n                \"dataType\": [\"text\"]\n            },\n            {\n                \"name\": \"creator\",\n                \"dataType\": [\"text\"],\n            },\n            {\n                \"name\": \"created_at\",\n                \"dataType\": [\"date\"]\n            },\n            {\n                \"name\": \"state\",\n                \"dataType\": [\"text\"],\n            },\n        ]\n    }\n```\n\nThe ingestion follows in batches of 100 records:\n\n```python\nwith client.batch as batch: \n    batch.batch_size = 100\n    for item in df.itertuples():\n        properties = {\n            \"title\": item.title,\n            \"url\": item.url,\n            \"labels\": item.labels,\n            \"description\": item.description,\n            \"creator\": item.creator,\n            \"created_at\": item.created_at,\n            \"state\": item.state,\n        }\n\n        batch.add_data_object(\n            data_object=properties, \n            class_name=\"GitHubIssue\")\n```\n\n- **Searching with Weaviate**: Our App supports:\n\n[Near-Text-Vector-Search](https://weaviate.io/developers/weaviate/search/similarity):\n\n```python\n@st.cache_data\ndef query_with_near_text(_w_client: weaviate.Client, query, max_results=10) -\u003e pd.DataFrame:\n    \"\"\"\n    Search GitHub Issues in Weaviate with Near Text.\n    Weaviate converts the input query into a vector through the inference API (OpenAI) and uses that vector as the basis for a vector search.\n    \"\"\"\n\n    response = (\n        _w_client.query\n        .get(\"GitHubIssue\", [\"title\", \"url\", \"labels\", \"description\", \"created_at\", \"state\"])\n        .with_near_text({\"concepts\": [query]})\n        .with_limit(max_results)\n        .do()\n    )\n\n    data = response[\"data\"][\"Get\"][\"GitHubIssue\"]\n    return  pd.DataFrame.from_dict(data, orient='columns')\n```\n\n[BM25-Search](https://weaviate.io/developers/weaviate/search/bm25):\n\n```python\n@st.cache_data\ndef query_with_bm25(_w_client: weaviate.Client, query, max_results=10) -\u003e pd.DataFrame:\n    \"\"\"\n    Search GitHub Issues in Weaviate with BM25.\n    Keyword (also called a sparse vector search) search that looks for objects that contain the search terms in their properties according to \n    the selected tokenization. The results are scored according to the BM25F function. It is .\n    \"\"\"\n\n    response = (\n        _w_client.query\n        .get(\"GitHubIssue\", [\"title\", \"url\", \"labels\", \"description\", \"created_at\", \"state\"])\n        .with_bm25(query=query)\n        .with_limit(max_results)\n        .with_additional(\"score\")\n        .do()\n    )\n\n    data = response[\"data\"][\"Get\"][\"GitHubIssue\"]\n    return  pd.DataFrame.from_dict(data, orient='columns')\n```\n\n[Hybrid-Search](https://weaviate.io/developers/weaviate/search/hybrid):\n\n```python\n@st.cache_data\ndef query_with_hybrid(_w_client: weaviate.Client, query, max_results=10) -\u003e pd.DataFrame:\n    \"\"\"\n    Search GitHub Issues in Weaviate with BM25.\n    Keyword (also called a sparse vector search) search that looks for objects that contain the search terms in their properties according to \n    the selected tokenization. The results are scored according to the BM25F function. It is .\n    \"\"\"\n\n    response = (\n        _w_client.query\n        .get(\"GitHubIssue\", [\"title\", \"url\", \"labels\", \"description\", \"created_at\", \"state\"])\n        .with_hybrid(query=query)\n        .with_limit(max_results)\n        .with_additional([\"score\"])\n        .do()\n    )\n```\n\n## 🚀 Quickstart\n\n1. Clone the repository:\n```\ngit@github.com:dcarpintero/github-semantic-search.git\n```\n\n2. Create and Activate a Virtual Environment:\n\n```\nWindows:\n\npy -m venv .venv\n.venv\\scripts\\activate\n\nmacOS/Linux\n\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\n3. Install dependencies:\n\n```\npip install -r requirements.txt\n```\n\n4. Ingest Data\n```\npython ./data-pipeline/ingest.py\n```\n\n5. Index Data\n```\npython ./data-pipeline/index.py\n```\n\n6. Launch Web Application\n\n```\nstreamlit run ./app.py\n```\n\n## 👩‍💻 Streamlit Web App\n\nDemo Web App deployed to [Streamlit Cloud](https://streamlit.io/cloud) and available at https://gh-semantic-search.streamlit.app/ \n\n## 📚 References\n\n- [Langchain Document Loaders - Github](https://js.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_loaders/github)\n- [Weaviate Vector Search](https://weaviate.io/developers/weaviate/search/similarity)\n- [Weaviate BM25 Search](https://weaviate.io/developers/weaviate/search/bm25)\n- [Weaviate Hybrid Search](https://weaviate.io/developers/weaviate/search/hybrid)\n- [Weaviate Schema Configuration](https://weaviate.io/developers/weaviate/configuration/schema-configuration)\n- [Weaviate - How to efficiently add data objects and cross-references to Weaviate](https://weaviate.io/developers/weaviate/manage-data/import)\n- [Get Started with Streamlit Cloud](https://docs.streamlit.io/streamlit-community-cloud/get-started)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Fgithub-semantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcarpintero%2Fgithub-semantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Fgithub-semantic-search/lists"}