An open API service indexing awesome lists of open source software.

https://github.com/rudrakshi99/web_crawler

A Spider🕷 or search engine bot that downloads and indexes content from all over the Internet.
https://github.com/rudrakshi99/web_crawler

crawler python spider

Last synced: 11 months ago
JSON representation

A Spider🕷 or search engine bot that downloads and indexes content from all over the Internet.

Awesome Lists containing this project

README

          

# Web Crawler 🕸

A web crawler, **spider** 🕷 , or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it's needed. They're called "web crawlers" because crawling is the technical term for automatically accessing a website and obtaining data via a software program.

Web crawlers go by many names, including spiders, robots, and bots, and these descriptive names sum up what they do — they crawl across the World Wide Web to index pages for search engines.

Search engines don’t magically know what websites exist on the Internet. The programs have to crawl and index them before they can deliver the right pages for keywords and phrases, or the words people use to find a useful page.

# How does a web crawler work?

The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.

Search engines crawl or visit sites by passing between the links on pages. However, if you have a new website without links connecting your pages to others, you can ask search engines to crawl your site by submitting your URL on Google Search Console.