{"id":23177331,"url":"https://github.com/abdo-reda/searchengineproject","last_synced_at":"2025-10-13T05:07:07.135Z","repository":{"id":150004128,"uuid":"495203430","full_name":"Abdo-reda/SearchEngineProject","owner":"Abdo-reda","description":"This is a backup of the Search Engine Project.","archived":false,"fork":false,"pushed_at":"2022-06-13T12:25:37.000Z","size":5406,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-17T10:57:04.424Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Abdo-reda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-22T23:44:00.000Z","updated_at":"2022-05-22T23:47:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"775571b4-e2a7-4746-939d-b225d008534e","html_url":"https://github.com/Abdo-reda/SearchEngineProject","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abdo-reda%2FSearchEngineProject","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abdo-reda%2FSearchEngineProject/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abdo-reda%2FSearchEngineProject/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abdo-reda%2FSearchEngineProject/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Abdo-reda","download_url":"https://codeload.github.com/Abdo-reda/SearchEngineProject/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276097,"owners_count":20912287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-18T06:33:06.188Z","updated_at":"2025-10-13T05:07:07.130Z","avatar_url":"https://github.com/Abdo-reda.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SearchEngineProject\n\nThis is a backup of the Search Engine Project. You can find a summary and details of the Project in the report.pdf\n\n## Overview\nThis project implements a search engine that indexes and ranks web pages based on user queries. It processes data from CSV files containing website information, links, impressions, and keywords to provide relevant search results.\n\n## Features\n- File indexing and data storage\n- Page ranking using:\n  - Click-Through Rate (CTR)\n  - PageRank algorithm (with damping factor and sink node handling)\n  - Combined page score calculation\n- Search functionality supporting AND, OR, and quoted phrases\n- Results sorting using heap sort\n\n## How to run\nThere are two version of the project, one with UI and another one uses the terminal. \n- You will find executables for both versions.\n\nIn Each version there are four main text/csv files that are considered the main input for the program.\n1. webGraph.csv : Each row represents a direct connection between two websites (hyperlinks).\n2. numOfClicks.csv : Each row of this file contains the name of the website and its clicks (default = 0). \n3. numOfImpressions.csv : Each row of this file contains the name of the website and its impressions (default = 0).\n4. keywords.csv : Each row of this file contains the name of the website and its keywords.\n\n\n## Demo\n\n![Demo](./Resources/demo.gif)\n\n\n# Project Report in Markdown\n\n**Author:** Abdelrahman Abdelmonem  \n**ID:** 900192706  \n\n## 1. Pseudo Code\n\n### a. Indexing Algorithm\n\nThe indexing algorithm reads data from CSV files and stores it in suitable data structures.\n\n```c++\nfstream file(\"\")\nif (file.is_open())\n    while (getline(file, myLineString))\n        for (from i=0 to myLineString.size())\n            If (myLineString.at(i) == \",\")\n                Flag = 1;\n                //*do something depending on the file being read\n            else\n                websiteName += myLineString.at(i)\n        //**store the data into suitable data structures\n```\n\n- **Note:** The logic varies depending on the file (e.g., impressions file, graph file). Data is stored in structures like graphs, maps, and arrays.\n\n\n\n### b. Ranking Algorithm\n\nThe ranking algorithm consists of three parts:\n\n#### i. Getting CTR (Click-Through Rate)\n```c++\nfor (int i=0 to numberSites)\n    CTR[i] = clicks[i] / impressions[i]\n```\n- CTR is recalculated during each search to reflect updates.\n\n#### ii. Getting PageRank and Normalized PageRank\n- Modified PageRank equation with damping factor (λ = 0.85) and handling for sink nodes:\n  \n![PageRank Equation](./Resources/pageRankEquation.png)\n\n- Iterative algorithm (terminates after 100 iterations or when rank differences \u003c 0.01):\n```c++\nfor (int i=0 to size)\n    prevRnk[i] = 1.0 / size;\n    if (adjList[i].size() == 0)\n        sinkNodes.push_back(i);\n\nfor (int i=0 to 100)\n    // Calculate sinkNodesRnk\n    for (int j=0 to sinkNodes.size())\n        sinkNodesRnk += prevRnk[sinkNodes[j]];\n\n    // Update ranks\n    for (int j=0 to N)\n        for (int k=0 to reverseAdjList[j].size())\n            tempSum += prevRnk[tempNode] / adjList[tempNode].size();\n        tempSum = (1.0 - dampingFactor) + (dampingFactor * tempSum) + (dampingFactor * (sinkNodesRnk / size));\n        difference[j] = abs(prevRnk[j] - tempSum);\n        currRnk[j] = tempSum;\n        tempSum = 0;\n\n    // Break if difference \u003c 0.01\n```\n\n#### iii. Getting Final PageScore\n```c++\nfor (int i=0 to numberSites)\n    double tempOne = (0.1 * impressions[i]) / (1 + (0.1 * impressions[i]));\n    pageScore[i] = 0.4 * normPageRank[i] + ((1 - tempOne) * normPageRank[i] + tempOne * CTR[i]) * 0.6;\n\n// Sort using heapSort\nint* ordPageRank = new int[numSites];\nfor (int i=0 to numSites)\n    ordPageRank[i] = i;\nheap_sort(ordPageRank, numSites, pageScore);\n```\n\n\n\n## 2. Complexity Analysis\n\n### a. Indexing Algorithms\n- **Space Complexity:**\n  - Web graph file: \\(O(n^2)\\) (adjacency list for complete graph).\n  - Impressions/clicks files: \\(O(n)\\) (arrays).\n  - Keywords file: \\(O(nm)\\) (vector of vectors).\n- **Time Complexity:** \\(O(nm)\\) (reading files line by line).\n\n### b. Ranking Algorithms\n- **Space Complexity:** \\(O(n)\\) (arrays for CTR, PageRank, PageScore).\n- **Time Complexity:**\n  - CTR: \\(O(n)\\).\n  - PageRank: \\(O(n^2)\\) (worst-case complete graph).\n  - PageScore: \\(O(n \\log n)\\) (heapSort).\n- **Total Time Complexity:** \\(O(n^2)\\).\n\n\n## 3. Main Data Structures\n- **Arrays:** Store clicks, impressions, ranks, etc.\n- **Vectors (STL):** Used for adjacency lists, keywords, and search results.\n- **Unordered_map (Hash Table):** Maps website names to IDs.\n- **Heaps:** Implemented for heapSort.\n- **Graph Class:** Uses vectors for adjacency and reverse adjacency lists.\n- **Edge (Struct):** Represents graph edges.\n\n\n## 4. Design Tradeoffs/Limitations\n- **Assumptions:**\n  - All websites appear in either the web graph or impressions file.\n  - Keywords file does not contain websites missing from the above files.\n- **UI Limitations:**\n  - Search query length is limited for simplicity.\n  - Complex queries (e.g., mixed AND/OR/quotes) may yield unexpected results.\n- **Search Logic:** Words are evaluated based on preceding operators (AND/OR).\n\n\n## 5. References\n- [How Search Engines Operate](https://moz.com/beginners-guide-to-seo/how-search-engines-operate)\n- [PageRank Video](https://www.youtube.com/watch?v=_Wc9OkMKS3g)\n- [Click-Through Rate](https://www.wordstream.com/click-through-rate)\n- [Wikipedia: CTR](https://en.wikipedia.org/wiki/Click-through_rate)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdo-reda%2Fsearchengineproject","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabdo-reda%2Fsearchengineproject","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdo-reda%2Fsearchengineproject/lists"}