{"id":22887685,"url":"https://github.com/hybridx/webscraper","last_synced_at":"2025-05-07T11:21:46.179Z","repository":{"id":104227234,"uuid":"134318651","full_name":"hybridx/WebScraper","owner":"hybridx","description":"webcrawler made from Beautiful soup ","archived":false,"fork":false,"pushed_at":"2023-04-05T07:41:22.000Z","size":4968,"stargazers_count":5,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-31T09:37:41.294Z","etag":null,"topics":["crawler","flask","google-dorks","javascript","python3","search-engine"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hybridx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-21T19:57:45.000Z","updated_at":"2023-04-05T07:41:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"f797e01d-3d31-43cd-bca9-f0a9e772843c","html_url":"https://github.com/hybridx/WebScraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hybridx%2FWebScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hybridx%2FWebScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hybridx%2FWebScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hybridx%2FWebScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hybridx","download_url":"https://codeload.github.com/hybridx/WebScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252866123,"owners_count":21816397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","flask","google-dorks","javascript","python3","search-engine"],"created_at":"2024-12-13T20:37:51.614Z","updated_at":"2025-05-07T11:21:46.170Z","avatar_url":"https://github.com/hybridx.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebScraper\n\nPROBLEM DEFINATION\n\nThe web creates challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research because people have different domains that they work in. People are likely to surf the web using high quality human maintained indices such as Yahoo or with search engines. Google takes it a step further with crawling and Page Rank. Because Google crawl’s these pages, it also gets into links with direct access to media files and text documents, which can be easily found by using Google dorks. But searching in this manner proves difficult for many users.\n\n\nWebScraper  \nWebScraper is an application which crawls and these webpages and stores links, that would prove useful later for downloading.\nThis is where the system proves smart, by only storing their links and not the entire files.\n    • Crawler\nRunning a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. It is difficult to measure how long crawling took overall because disks filled up, name servers crashed, or any number of other problems which stopped the system.\n\nExisting system \u0026 need for new system\n\nWe are against the likes of Google but still Google dorks aren’t as useable as the WebScraper.\n\nNeed For New System:\nMaking the same things easier and more usable is just another way of improving ease of access.  Using only the links to these media files has been a great triumph as it only stores media links which are easily downloadable. \n\nScope of the System\nThis system only focuses on Media Links which are hosted in different websites. The system doesn’t worry about what the content is until and unless it is a media file or some text based document (pdf,docx etc).\n\n\nFEASIBILITY STUDY\n    1) Technical feasibility:  \nThe technologies used in the project are very well documented and tested. The consistency provided by Beautiful Soup (https://pypi.org/project/beautifulsoup4/) has been really helpful for the developers.\nThe technology used (i.e. python) enables developers for rapid development and has huge support in the community.  \n2)    Economic Feasibility:\nThe System is a every growing system because of the crawler that is crawling the pages consistently and storing them in the database. But this application could easily be monetised by using advertisement, which would help with the systems hardware resources. \n\n3)    Operational feasibility:\n  The developed project is a web application. The basic knowledge of computer is enough for the user to use the application. Application runs only in the browser as a web page and doesn’t affect the execution of other programs, No special permission/ setups installation is required. No special training to the user is required hence the proposed system is operationally feasible.\n\n\nHardware \u0026 Software Setup Requirements (User):-\n\n      Software (min):\n          1. Browser (e.g. Firefox,Chrome)\n      Hardware (min):\n          • 1 GB RAM(To support heavy browsers)\n          • 20 GB HDD(To support newer operating systems)\n          • Intel P4 or above\n\nHardware \u0026 Software Setup Requirements (System Development):-\n\n    Software (min):\n          1 Python3.5 and above\n          2 Beautiful Soup 4\n          3 MongoDB\n\n    Hardware (min):\n        • 1 GB RAM(To support heavy browsers) (4 GB recommended)\n        • 20 GB HDD(To support newer operating systems)\n        • Storage for Mongodb according to the how much the system will scale  \n        • Intel Core 2 Duo processor and above\n\n\nTesting\n The most important measure of a search engine is the quality of its search results. While a complete user evaluation is beyond\n the scope of this paper, our own experience with WebScraper has shown it to produce good results for media search and has a lot\n of improvement. The numbers of results are considerably small. \nAside from search quality, WebScraper is designed to scale cost effectively to the size of the Web as it grows.\nIn just three URL’s our crawler was able to index approximately 15000 links. It is important for a search engine to crawl and\nindex efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly.\n\n\n\nCONCLUSION\n\nWebScraper is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly\ngrowing World Wide Web. \n\n\n\nREFERENCE\n\n\n    • http://stackoverflow.com\n    • https://use-the-index-luke.com/sql/testing-scalability\n    • http://infolab.stanford.edu/~backrub/google.html\n    • https://pypi.org/project/beautifulsoup4/\n    \n## TODO\n- https://typesense.org/downloads/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhybridx%2Fwebscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhybridx%2Fwebscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhybridx%2Fwebscraper/lists"}