{"id":26657033,"url":"https://github.com/shreyas9699/pagerank","last_synced_at":"2026-05-02T14:41:26.657Z","repository":{"id":193237744,"uuid":"292263504","full_name":"Shreyas9699/Pagerank","owner":"Shreyas9699","description":null,"archived":false,"fork":false,"pushed_at":"2020-09-10T11:00:58.000Z","size":8261,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2023-09-07T10:24:30.225Z","etag":null,"topics":["crawling","pagerank","python","sqlite"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Shreyas9699.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-09-02T11:25:47.000Z","updated_at":"2023-09-07T10:24:32.545Z","dependencies_parsed_at":"2023-09-07T10:49:56.654Z","dependency_job_id":null,"html_url":"https://github.com/Shreyas9699/Pagerank","commit_stats":null,"previous_names":["shreyas9699/pagerank"],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shreyas9699%2FPagerank","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shreyas9699%2FPagerank/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shreyas9699%2FPagerank/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shreyas9699%2FPagerank/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Shreyas9699","download_url":"https://codeload.github.com/Shreyas9699/Pagerank/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245423247,"owners_count":20612749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","pagerank","python","sqlite"],"created_at":"2025-03-25T08:16:35.077Z","updated_at":"2026-05-02T14:41:26.620Z","avatar_url":"https://github.com/Shreyas9699.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Simple Python Search Spider, Page Ranker, and Visualizer\n\nThis is a set of programs that emulate some of the functions of a \nsearch engine.  They store their data in a SQLITE3 database named\n'spider2.sqlite'.  This file can be removed at any time to restart the\nprocess.   \n\nYou should install the SQLite browser to view and modify \nthe databases from:\n\nhttp://sqlitebrowser.org/\n\nThis program crawls a web site and pulls a series of pages into the\ndatabase, recording the links between pages.\n\nNote: I you want to recreate your own database, please remove spider2.sqlite from downloaded file.\n\ndel spider2.sqlite\nspider.py\n\nEnter web url or enter: Your choice(Make sure that website do not have restriction(Some website do not allow crawlling.))\nHow many pages: #\n\nIf you restart the program again and tell it to crawl more\npages, it will not re-crawl any pages already in the database.  Upon \nrestart it goes to a random non-crawled page and starts there.  So \neach successive run of spider.py is additive.\n\nspider.py\n\nYou can have multiple starting points in the same database - \nwithin the program these are called \"webs\".   The spider\nchooses randomly amongst all non-visited links across all\nthe webs.\n\nIf you want to dump the contents of the spider.sqlite file, you can \nrun spdump.py as follows:\n\nspdump.py\n\n(352, 13.965677009585145, 13.796689370187737, 1, 'https://timesofindia.indiatimes.com')\n(190, 7.534892228627917, 6.979403454996037, 293, 'https://timesofindia.indiatimes.com/entertainment/latest-new-movies/hindi-movies')\n(169, 7.943160468934033, 8.248922043159736, 2, 'https://timesofindia.indiatimes.com/rss.cms')\n(70, 6.722738089304296, 7.012401129969752, 592, 'https://timesofindia.indiatimes.com/newsletterhome.cms')\n(67, 5.335018299718141, 4.86370182006613, 765, 'https://timesofindia.indiatimes.com/entertainment/hindi/movie-details/nikamma/movieshow/70326484.cms')\n128 rows.\n\nThis shows the number of incoming links, the old page rank, the new page\nrank, the id of the page, and the url of the page.  The spdump.py program\nonly shows pages that have at least one incoming link to them.\n\nOnce you have a few pages in the database, you can run Page Rank on the\npages using the sprank.py program.  You simply tell it how many Page\nRank iterations to run.\n\nMac: python3 sprank.py \nWin: sprank.py \n\nHow many iterations:2\n1 0.546848992536\n2 0.226714939664\n[(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)]\n\nYou can dump the database again to see that page rank has been updated:\n\nMac: python3 spdump.py \nWin: spdump.py \n\n(352, 13.965677009585145, 13.796689370187737, 1, 'https://timesofindia.indiatimes.com')\n(190, 7.534892228627917, 6.979403454996037, 293, 'https://timesofindia.indiatimes.com/entertainment/latest-new-movies/hindi-movies')\n(169, 7.943160468934033, 8.248922043159736, 2, 'https://timesofindia.indiatimes.com/rss.cms')\n(70, 6.722738089304296, 7.012401129969752, 592, 'https://timesofindia.indiatimes.com/newsletterhome.cms')\n(67, 5.335018299718141, 4.86370182006613, 765, 'https://timesofindia.indiatimes.com/entertainment/hindi/movie-details/nikamma/movieshow/70326484.cms')\n128 rows.\n\nYou can run sprank.py as many times as you like and it will simply refine\nthe page rank the more times you run it.  You can even run sprank.py a few times\nand then go spider a few more pages sith spider.py and then run sprank.py\nto converge the page ranks.\n\nIf you want to restart the Page Rank calculations without re-spidering the \nweb pages, you can use spreset.py\n \nspreset.py \n\nAll pages set to a rank of 1.0\n\nsprank.py \n\nHow many iterations:50\n1 0.546848992536\n2 0.226714939664\n3 0.0659516187242\n4 0.0244199333\n5 0.0102096489546\n6 0.00610244329379\n...\n42 0.000109076928206\n43 9.91987599002e-05\n44 9.02151706798e-05\n45 8.20451504471e-05\n46 7.46150183837e-05\n47 6.7857770908e-05\n48 6.17124694224e-05\n49 5.61236959327e-05\n50 5.10410499467e-05\n[(512, 0.02963718031139026), (1, 12.790786721866658), (2, 28.939418898678284), (3, 6.808468390725946), (4, 13.469889092397006)]\n\nFor each iteration of the page rank algorithm it prints the average\nchange per page of the page rank.   The network initially is quite \nunbalanced and so the individual page ranks are changing wildly.\nBut in a few short iterations, the page rank converges.  You \nshould run prank.py long enough that the page ranks converge.\n\nIf you want to visualize the current top pages in terms of page rank,\nrun spjson.py to write the pages out in JSON format to be viewed in a\nweb browser.\n \nspjson.py \n\nCreating JSON output on spider.js...\nHow many nodes? 30\nOpen force.html in a browser to view the visualization\n\nYou can view this data by opening the file force.html in your web browser.  \nThis shows an automatic layout of the nodes and links.  You can click and \ndrag any node and you can also double click on a node to find the URL\nthat is represented by the node.\n\nThis visualization is provided using the force layout from:\n\nhttp://mbostock.github.com/d3/\n\nIf you rerun the other utilities and then re-run spjson.py - you merely\nhave to press refresh in the browser to get the new data from spider.js.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshreyas9699%2Fpagerank","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshreyas9699%2Fpagerank","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshreyas9699%2Fpagerank/lists"}