{"id":25602551,"url":"https://github.com/mallickboy/python_search_engine","last_synced_at":"2025-04-13T08:55:10.286Z","repository":{"id":226093365,"uuid":"767730564","full_name":"mallickboy/Python_Search_Engine","owner":"mallickboy","description":"A domain-specific Python search engine leveraging Flask, Pinecone, and Sentence Transformers for semantic search. Deployed on Azure with Gunicorn, Nginx, and SSL for secure and scalable performance.","archived":false,"fork":false,"pushed_at":"2025-03-12T08:16:27.000Z","size":14557,"stargazers_count":5,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"version2.0","last_synced_at":"2025-04-04T08:03:03.282Z","etag":null,"topics":["azure","flask","gunicorn-with-flask-rest-api","huggingface-transformers","nginx","pinecone","pyhon","search-engine","semantic-search-engine","sentence-transformers"],"latest_commit_sha":null,"homepage":"http://pysearch.mallickboy.com/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mallickboy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-05T19:52:57.000Z","updated_at":"2025-02-11T10:11:32.000Z","dependencies_parsed_at":"2024-07-23T22:23:28.377Z","dependency_job_id":null,"html_url":"https://github.com/mallickboy/Python_Search_Engine","commit_stats":null,"previous_names":["mallickboy/domain_specific_search_engine","mallickboy/python_search_engine"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mallickboy%2FPython_Search_Engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mallickboy%2FPython_Search_Engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mallickboy%2FPython_Search_Engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mallickboy%2FPython_Search_Engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mallickboy","download_url":"https://codeload.github.com/mallickboy/Python_Search_Engine/tar.gz/refs/heads/version2.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248688544,"owners_count":21145763,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","flask","gunicorn-with-flask-rest-api","huggingface-transformers","nginx","pinecone","pyhon","search-engine","semantic-search-engine","sentence-transformers"],"created_at":"2025-02-21T17:01:33.798Z","updated_at":"2025-04-13T08:55:10.264Z","avatar_url":"https://github.com/mallickboy.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\nPython Search Engine 2.0 Server setup\n\u003c/h1\u003e\n\n### Pull the code\n\n``` mkdir pysearch  ```\n\n``` cd pysearch  ```\n\n``` git clone https://github.com/mallickboy/Python_Search_Engine.git ```\n\n``` cd Python_Search_Engine ```\n\n``` git checkout version2.0 ```\n\n### Create virtual environment \n\n``` sudo apt install python3.9 python3.9-venv python3.9-distutils ```\n\n``` python3.9 -m venv pysearch ```\n\n### Activate virtual environment (from parent folder)\n\n```Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass```   \u0026\n\n``` .\\search_engine\\Scripts\\activate ``` or\n\n``` source pysearch/bin/activate ```\n\n### Install PyTorch (  lightweight CPU version only )\n\n``` pip install torch --index-url https://download.pytorch.org/whl/cpu  ```\n\n### Install required libraries\n\n``` pip install -r requirements.txt  ```\n\n### Open firewall \u0026 update inbound port 8000 in Azure\n\n``` sudo ufw enable ```\n\n``` sudo ufw allow 8000 ```\n\n``` sudo ufw status ```\n\n### Run and Test \n\n``` python app.py  ```\n\nVisit http://{your_server_ip}:8000\n\n\n\u003ch2 align=\"center\"\u003e\nAdding SSL, NGINX and deployment using GUNICORNN\n\u003c/h1\u003e\n\n### Deploy the site using gunicorn\n\n``` gunicorn -w 4 -b 0.0.0.0:8000 app:app ```\n\n``` sudo nano /etc/systemd/system/pysearch.service ``` paste code of server_setup/pysearch.service\n\n``` sudo systemctl daemon-reload ```\n\n``` sudo systemctl restart pysearch ```\n\n``` sudo systemctl enable pysearch ```\n\n``` sudo systemctl status pysearch ```\n\n### Setup Domain/ Sub-Domain\n\nAdd New record in Advanced DNS in domain provider\n\n``` Type = \"A Record\"   Host= \"pysearch\"    Value = \"server public ip\"  TTL = \"Automatic\" ```  ( Sub-Domain )\n\n``` Type = \"A Record\"   Host= \"@\"    Value = \"server public ip\"  TTL = \"Automatic\" ```         ( Domain )\n\n### Installing SSL certificate\n\n``` sudo apt install certbot python3-certbot-nginx ```\n\n``` sudo certbot --nginx -d pysearch.mallickboy.com ```\n\n``` sudo certbot certificates ```\n\n``` sudo systemctl status certbot.timer ```  check auto renewal\n\n### Set-Up NGINX\n\n``` sudo nano /etc/nginx/sites-available/pysearch.mallickboy.com  ```  (copy contents of server_setup/pysearch.mallickboy.com)\n\n``` sudo nano /etc/nginx/sites-available/default  ```  (copy contents of server_setup/default)\n\n``` sudo ln -s /etc/nginx/sites-available/pysearch.mallickboy.com /etc/nginx/sites-enabled/ ```  \n\n``` sudo nginx -t  ```\n\n``` sudo systemctl reload nginx ```\n\n``` sudo systemctl restart nginx ```\n\n### Visit \n\n[ https://pysearch.mallickboy.com ](https://pysearch.mallickboy.com)\n\n\n\n\u003ch1 align=\"center\"\u003e\nOutput View\n\u003c/h1\u003e\n**Pinecone Side Vectors** \n![pinecone](https://github.com/user-attachments/assets/0bd8a37f-510c-471e-9206-76135b905bd1)\n\n**Client Side Search Results :**\n\n![Screenshot 2024-04-02 211102](https://github.com/user-attachments/assets/0919f4ca-4dc4-4ca4-9ccb-b65507d44f09)\n\n![Screenshot 2024-04-02 210855](https://github.com/user-attachments/assets/11d6e250-1818-4ec9-aeaf-4e146a0fcb55)\n\n![Screenshot 2024-04-02 210658](https://github.com/user-attachments/assets/4a87cff9-89ff-42fc-887b-ea1a1aee1765)\n\n\n\n**Server Side Messages :**\n\n![Screenshot 2024-04-02 211726](https://github.com/user-attachments/assets/d5ee1091-24ac-48b5-9253-3120f3f6305d)\n\n![image](https://github.com/user-attachments/assets/750f1b92-2595-4397-9302-3f9d180d5724)\n\n\u003ch1 align=\"center\"\u003e\nDesign and Discussion\n\u003c/h1\u003e\n\n**Group Members:** Tamal Mallick , Sushanta Das , Suvam Manna \n\n**Problem Description**:\n\nBuilding a **search engine for a specific domain** (Python) with the help of **web crawling, Socket**\n\n**programming, sentence embedding and Vector database**, to get relevant result for specific domain. User\n\nwill get result of a search query based on **cosine similarity search** in vector database. Also implemented and\n\nintegrated **Knapsack Cryptosystem** for securely transmitting user query and search results over network.\n\n**Algorithm and Design:**\n\n**1) Collecting web pages to make our search engine database**\n\ni) Implemented web crawling using multithreading with some starting links containing keyword\n\n“Python”, to collect link of webpage and then metadata (such as title, heading tag, some\n\nparagraph) to gather valuable information about each link. This information will be used to\n\nsearch for the webpage URL.\n\nii) We are collecting only the valid link / webpages by ignoring the links with status code in\n\nbetween (400 ,499)\n\niii) These data are stored in Vector database (Pinecone) as Word Embeddings (High dimensional\n\nVectors ,768 dimensions). Now we can search some query, database will return most similar\n\nresults (each result having title, link and description).\n\n**2) Building and running the webserver**\n\ni) Webserver is implemented using socket programming with multithreading which will handle\n\nmultiple HTTP request coming from clients.\n\nii) Once the server is running, if anyone search the server URL using web client (Browser) first\n\nthe server is sends Required html, CSS and JavaScript file to run the main client program and\n\naccess frontend to see outputs.\n\niii) Once the dedicated client’s program (client.js) is running, in case of Secure mode public key\n\nand private key is generated using knapsack cryptographic algorithm. First server and client\n\nexchange their public keys. Then it will continue to listen for search requests from client.\n\niv) Now onwards whenever the server receives a search request from client it performs a cosine\n\nsimilarity search on the vector database/local file and send the top results to the client. In\n\nsecure mode encryption and decryption are performed before each send operation and after\n\neach receive operation through socket.\n\n**3) Searching some query**\n\ni) User will visit the link at which server is running (like \u003chttp://192.168.29.37:8080/\u003e[).](http://192.168.29.37:8080/)[ ](http://192.168.29.37:8080/)A\n\nwebpage will open which has input box and search button. In Secure mode client will receive\n\npublic key of server.\n\n\n\nii) When user types some search query and hits submit button, client will send the query to the\n\nserver. Server has an endpoint for accepting Post request (POST /submit HTTP/1.1). Server\n\nwill accept the request.\n\niii) Now server will call a function for searching on vector database and finally send the relevant\n\nsearch results to the client.\n\niv) Results will be displayed in the client’s webpage.\n\n\n**Future Work**\n\nFuture enhancements and developments for the search engine may include increasing the searching speed,\n\nunderstand and collect more valuable metadata, providing summery of the top results, refining search\n\nalgorithms for enhanced accuracy and relevance, expanding the search database to encompass a broader\n\nrange of Python-related content, integrating advanced security features to mitigate emerging threats, and\n\nincorporating user feedback to continuously enhance the user experience and functionality.\n\n**Conclusion**\n\nThe creation of a domain-specific search engine tailored for Python represents a significant leap forward in\n\nproviding users with a robust platform for accessing relevant information within the Python programming\n\necosystem. By seamlessly integrating advanced technologies such as web crawling, vector databases, socket\n\nprogramming, and cryptographic protocols, this search engine delivers not only swift and accurate search\n\nresults but also ensures the security of user interactions. Also, implementation of multithreading and query-\n\nbased client holding allow us to save the resources and serve large number of clients at a time.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmallickboy%2Fpython_search_engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmallickboy%2Fpython_search_engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmallickboy%2Fpython_search_engine/lists"}