{"id":19575106,"url":"https://github.com/nitsas/simple-web-search-engine","last_synced_at":"2026-05-17T14:34:30.487Z","repository":{"id":23277493,"uuid":"26636199","full_name":"nitsas/simple-web-search-engine","owner":"nitsas","description":null,"archived":false,"fork":false,"pushed_at":"2014-11-14T15:54:50.000Z","size":132,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-26T11:25:21.724Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nitsas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-14T11:41:59.000Z","updated_at":"2014-11-14T12:47:41.000Z","dependencies_parsed_at":"2022-08-21T20:50:39.649Z","dependency_job_id":null,"html_url":"https://github.com/nitsas/simple-web-search-engine","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nitsas/simple-web-search-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitsas%2Fsimple-web-search-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitsas%2Fsimple-web-search-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitsas%2Fsimple-web-search-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitsas%2Fsimple-web-search-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nitsas","download_url":"https://codeload.github.com/nitsas/simple-web-search-engine/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitsas%2Fsimple-web-search-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33142252,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T09:28:26.183Z","status":"ssl_error","status_checked_at":"2026-05-17T09:27:52.702Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T06:45:46.083Z","updated_at":"2026-05-17T14:34:30.457Z","avatar_url":"https://github.com/nitsas.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"INTRO\n=====\n\nA **very** simple web search engine written in Python 2.\n\nOriginally created for my *Language Technology* course project, at the \nComputer Engineering and Informatics Department, University of Patras, around\n2010.\n\n***\n\nUSAGE\n=====\n\nThere are three basic ways to use the software:\n\n- with existing index file, jump to making queries\n- create index file from set of downloaded webpages\n- crawl first and then create index file\n\nWith existing index file, jump to making queries\n------------------------------------------------\n\nPrerequisites:\n\n- an index file (default name `index.xml`)\n- the url-map (default name `urls.pickle`)\n\nJust run:\n\n    python evaluate_index.py -i \u003cindex_file\u003e -u \u003curl_map_file\u003e\n\nOr, if you are using the default file names, just:\n\n    python evaluate_index.py\n\nCaution, this last command will actually start at the crawling step if it\ncan't find the index and url-map.\n\n`evaluate_index.py` will load the index file and url-map in memory and give \nyou a prompt to start issuing search queries.\n\nCreate index file from set of downloaded webpages\n-------------------------------------------------\n\nPrerequisites:\n\n- a directory containing '.html' files (default `./html/` directory)\n\nJust run:\n\n    python preprocessor.py\n    python indexer.py\n\nThe preprocessor will clean and tokenize every '.html' file in the given\ndirectory (let's call it `\u003cwebpages_dir\u003e`), and store the tokenized webpages \ninside a `./tokenized/` directory; page `\u003cwebpages_dir\u003e/x.html` will be \nstored as `./tokenized/x.txt` after the tokenization.\n\nCrawl first and then create index file\n--------------------------------------\n\nPrerequisites:\n\n- nothing!\n\nJust run:\n\n    python crawler.py\n    python preprocessor.py\n    python indexer.py\n\nThe crawler will start crawling webpages, starting from a default set of five\n*seed* webpages, and saving them inside a `./html/` directory. Each webpage\nmust pass a set of default requirements to be saved. Some of the default\nrequirements are:\n\n- page must be cacheable, i.e. no `no-store` attribute in the `cache-control`\n    header\n- page length must be at least 40000 characters, including html tags\n- must be a `text/html` page\n- language must be English, i.e. `content-language` must be `en`\n\nThe crawler extracts links to visit next, from every page it crawles, but\nthere are some links it does not follow. Default link requirements are:\n\n- only follow `http://` links, i.e. no `ftp://`, `mailto:` etc links\n- only crawl `.com` and `.co.uk` urls (no `.gov` etc urls)\n- block `twitter.com`, `facebook.com`, `wikipedia` and `imdb` urls\n- only follow urls ending in `.html`, `.htm` or `/`\n\nThe crawler will by default crawl until it has exactly 1000 pages (or it runs\nout of links).\n\nI might allow the user to change the defaults via command line parameters and\nconfiguration files, in the future, if I find the time. Don't count on it.\n\nAfter the crawler finishes, the preprocessor and indexer will process all \n`.html` pages inside the `./html/` directory, as described earlier.\n\nAfter the whole process ends, the user can start querying the index after\nrunning the command:\n\n    python evaluate_index.py\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitsas%2Fsimple-web-search-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnitsas%2Fsimple-web-search-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitsas%2Fsimple-web-search-engine/lists"}