Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2025-01-24 00:06:40 UTC
- JSON Representation
https://github.com/hudson-newey/user-web-crawler
The Archive.org Crawler works through volunteering users who install an extension on their browsers. When the user visits a webpage, the URL is anonymously added to the Archive.org database.
Last synced: 10 Jan 2025
https://github.com/geoffreybauduin/website-checker
Performs useful checks against a website, such as 404 errors reporting, structured data validation...
crawler seo structured-data web-spider website
Last synced: 25 Dec 2024
https://github.com/dylanhogg/cloud-products
A package for getting cloud products and product descriptions from a cloud provider website.
aws cloud-products crawler data text-processing
Last synced: 23 Jan 2025
https://github.com/schbenedikt/web-crawler
A simple web crawler using Python that stores the metadata of each web page in a database.
crawler database mariadb mysql python python-crawler web
Last synced: 08 Nov 2024
https://github.com/kahsolt/allchan
An image crawler for xChan(4chan/8ch/...) image board.
4chan 4chan-downloader 8chan crawler image-crawler
Last synced: 03 Jan 2025
https://github.com/christopher-besch/therapy_search
Compute Call Times from arztsuche-bw into a Calendar.
appointments calendar crawler gatsby therapy time-management typescript
Last synced: 28 Dec 2024
https://github.com/soakit/book-download
book-download
crawler html2epub nodejs novel-downloader
Last synced: 28 Dec 2024
https://github.com/yordadev/fenrisjs
A NodeJS application that scrapes any links from a given input and outputs the results nicely into one of two files, external or internal file for further analysis.
analysis crawler link-collection link-crawler nodejs nodejs-application
Last synced: 10 Jan 2025
https://github.com/woorim960/nate.com-comments-crawler
nate.com-comments-crawler
chromedriver crawler python3 selenium
Last synced: 28 Dec 2024
https://github.com/projectx3193275578/prjctxx8264
A simple, open-source, easy to use, and free download manager for malware samples.
crawler downloader malware manager samples
Last synced: 05 Jan 2025
https://github.com/droiddevgeeks/nodelearning
This is node learning demo. It has covered all basics of node.
crawler database ejs ejs-express mcv middleware-nodes mongodb node node-module nodejs nodemailer npm-package router sign
Last synced: 13 Jan 2025
https://github.com/davideferre/covid19-data-crawler-ita
Covid 19 italian data crawler
coronavirus covid19 crawler hacktoberfest hacktoberfest2021 python
Last synced: 11 Jan 2025
https://github.com/khadkarajesh/aptoide
Aptoide app crawler using beautifulsoup
beautifulsoup4 crawler flask python3 web-application
Last synced: 13 Jan 2025
https://github.com/orafaelfragoso/itunes-crawler
Retrieves information about an artist by crawling the iTunes API and iTunes Page
Last synced: 19 Dec 2024
https://github.com/suddi/fundscraper
Collection of web crawlers to scrape fund data using Scrapy
Last synced: 11 Oct 2024
https://github.com/zabuzard/wslotter
WSlotter is a Selenium driven tool for assigning to events on 'https://www.gruppe-w.de'.
Last synced: 12 Jan 2025
https://github.com/andmerk93/scrapy_parser_pep
Учебный проект на Scrapy, парсит PEP, выводит в 2х форматах
Last synced: 24 Jan 2025
https://github.com/dangdungcntt/crawl-fb-v2
Simple script to detect email and phone from facebook comment.
Last synced: 18 Jan 2025
https://github.com/naveenaidu/google-crawler
Google Crawler - Curates the search results
Last synced: 18 Jan 2025
https://github.com/karantyagi/web-crawler
BFS and DFS implementations for a wikipedia crawler
Last synced: 12 Jan 2025
https://github.com/par7133/splash-bot-crawler
Splash Bot creates splash on the fly of your websites - GPL License 🔥
bot crawler gallery open-source opensource php splash
Last synced: 12 Jan 2025
https://github.com/hoishing/selenium-crawler
a web crawler written in python, powered by Selenium and Tesseract OCR
Last synced: 18 Jan 2025
https://github.com/mmqnym/pyppeteer-use-case
Show how to do web crawl via pyppeteer
crawl crawler pyppeteer python
Last synced: 18 Jan 2025
https://github.com/fa7ad/aiub-notes-dl
Download all notes from AIUB's portal
Last synced: 24 Oct 2024
https://github.com/buren/site_health
Crawl a site and check various health indicators
Last synced: 28 Oct 2024
https://github.com/mahmoudgalalz/pupt
A starter for web crawling using Puppeteer
Last synced: 05 Jan 2025
https://github.com/somnisomni/trawler-csharp
The successor of https://github.com/somnisomni/twitter-account-data-crawler, written in .NET C#
crawler crawling csharp dotnet follower-tracker selenium selenium-csharp twitter twitter-crawler twitter-crawling twitter-scraper
Last synced: 05 Jan 2025
https://github.com/knourian/freelancer.com-category-scrapping
Scrapping Categories from Freelancer.com Using scrapy with number of project for each category
crawler freelancer python3 scrapy web-crawler
Last synced: 05 Jan 2025
https://github.com/chunkingz/youtubelinks-scraper
A python script that scrapes Youtube links from a predefined website of choice.
crawler python scraper spider websitescraper youtube
Last synced: 02 Jan 2025
https://github.com/loggerhead/dianping_crawler
基于 Scrapy (python 3.5) 的大众点评爬虫
Last synced: 24 Jan 2025
https://github.com/amirsorouri00/dsl-se
This is a MVP provided based on the "Search Engine And Data Mining" Course. The idea behind this project is the forked project which its link provided is
container crawler distributed-systems docker docker-compose elasticsearch pagerank search-engine
Last synced: 19 Jan 2025
https://github.com/arshadkazmi42/gh-crawl
Crawler for Github repositories. Finds all the broken links from the repositories
bug-bounty-recon crawl crawler gh-crawler github github-crawler githubcrawler python
Last synced: 21 Dec 2024
https://github.com/skylightqp/namu2csv
A namuwiki crawler that converts header to csv file for kartrider wiki
Last synced: 08 Dec 2024
https://github.com/richecr/pyhltv
Repository to extract information from the HLTV website.
crawler csgo hacktoberfest hltv hltv-api python3
Last synced: 20 Jan 2025
https://github.com/vietdoo/sg-property-hub
SG Property Hub is a comprehensive platform for managing and analyzing property data.
airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping
Last synced: 13 Dec 2024
https://github.com/yjg30737/pyqt-wikipedia-crawler
Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI
beautifulsoup4 crawler pyqt pyqt5 wikipedia
Last synced: 03 Jan 2025
https://github.com/mc256/node-static-webpage-crawler
download entire website with its directory structure.
cache-server crawler nodejs static-site
Last synced: 24 Jan 2025
https://github.com/toannd96/chromedp-example-login
chromedp crawler golang goquery
Last synced: 19 Jan 2025
https://github.com/hoanle396/py-iconnect
crawler flask flask-application image-processing python
Last synced: 14 Dec 2024
https://github.com/tsaohucn/crawler_fb_group
This is crawler use selenium for facebook groups
crawler facebook-groups rails ruby
Last synced: 20 Jan 2025
https://github.com/piopi/behatcrawler
A Behat extension that crawls links on a website and executes user-defined function on each one of them.
behat behat-extension crawler php selenium-webdriver
Last synced: 19 Dec 2024
https://github.com/cseas/shares-monitor
Web crawler to fetch and monitor shares details.
crawler python python3 scraper scraping-websites shares
Last synced: 27 Dec 2024
https://github.com/liebki/githubnet
This library allows you to retrieve several things from GitHub, things like trending repositories, profiles of users, the repositories of users and related information.
crawler crawling github github-trending htmlagilitypack microsoft
Last synced: 24 Jan 2025
https://github.com/eea/eea-crawler
EEA Crawler contains the tasks (DAGs) used by Apache Airflow to index content from various EEA-Eionet websites into a central Elasticsearch (aka content hub).
airflow-dags crawler elasticsearch etl-pipeline indexing
Last synced: 24 Jan 2025
https://github.com/basemax/jadi-net-blog
This Python script is used to extract posts from a WordPress blog (https://jadi.net/) and save them in HTML format. The script fetches the RSS feed, parses the posts, and saves each post as an individual HTML file.
blog-copier copier crawler crawler-python crawlers jadi-blog jadi-clone jadi-net-blog jadi-net-clone jadinet-blog py python python-crawler wordpress wp
Last synced: 24 Jan 2025
https://github.com/captain-woof/zhi-zhu
Zhi-Zhu is a multithreaded spidering script that recursively searches base webpages and all urls appearing in it, for specific (regex) words.
crawler crawler-python crawling-python python3
Last synced: 31 Dec 2024
https://github.com/jovijovi/ether-crawler
A transaction crawler for the Ethereum ecosystem.
blockchain crawler ether ethereum transaction
Last synced: 16 Jan 2025
https://github.com/kangoo13/textbroker-author-article-picker
Bot that automatically lock an order into a textbroker's author account.
author-textbroker automation bot colly crawler go gocolly golang scrapper spider textbroker textbroker-author textbroker-order-picker textbroker-orders textbroker-scrapper
Last synced: 22 Jan 2025
https://github.com/microlinkhq/ua
A simple redis primitives to incr() and top() user agents
crawler redis user-agent user-agent-parser
Last synced: 12 Jan 2025
https://github.com/willi-dev/dtcapp
dtcapp : distributed twitter crawler.
crawler distributed-systems hazelcast java twitter twitter-api
Last synced: 14 Jan 2025
https://github.com/tanja-4732/od-get
A Rust tool for recursively crawling & downloading data from open directories
cli crawler open-directory open-directory-downloader rust
Last synced: 14 Jan 2025
https://github.com/deployment-helper/api-template-crawler
API interface to crawl the templates
api crawler deployment-helper gcp gcp-cloud-run golang rest
Last synced: 14 Jan 2025
https://github.com/programming-with-love/skyeyesystem
天眼系统,每隔十分钟爬取各个平台的热搜数据并入库。包括原始热搜数据存入mysql。词频统计存入Redis。
crawler mysql redis skyeye skyeyewall springboot
Last synced: 16 Jan 2025
https://github.com/mazzasaverio/scrapy-playwright-scrapegraphai
Web crawler using Scrapy + Playwright for dynamic content, featuring YAML-based configuration, PostgreSQL storage via aiosql, structured logging with logfire, and complete Docker/Terraform infrastructure. Built with uv package manager and Python 3.11+.
aiosql crawler docker playwright scrapy scrapy-playwright terraform uv
Last synced: 14 Jan 2025
https://github.com/emarifer/search-engine
A mini Google. Custom web crawler & indexer written in Golang.
crawler dashboard deep-first-search fiber-framework full-text-search golang gorm-orm htmx htmx-go hyperscript indexer inverted-index response-caching search-engine templ worker-pool
Last synced: 17 Jan 2025
https://github.com/princed/specht
Check links found in html or js files by pattern
cli crawler html javascript streams
Last synced: 19 Jan 2025
https://github.com/yuchenq/comp90055-project
This is the lastest version of my project belong to Comp90055.
couchdb crawler data-visualization python3 textblob tweepy
Last synced: 19 Jan 2025
https://github.com/leegeunhyeok/python-gongucrawler
파이썬3 공유마당 이미지 및 상세정보 크롤러
Last synced: 22 Dec 2024
https://github.com/aminehsan/datamining-divar.ir
Analyzing and Extracting Insights from Ads on 'divar.ir'
crawler data-mining data-science divar-ir scraping
Last synced: 04 Dec 2024
https://github.com/allotmentandy/socialmedialinkextractor
php laravel package to extract social media links from an array of links for my spider, used as part of a spider for checking londinium.com website links
crawler extractor facebook laravel linked-list php social social-network spider twitter url youtube
Last synced: 23 Dec 2024
https://github.com/yosh1/mio-crawler
A crawler that acquires data usage of iijmio .
Last synced: 12 Jan 2025
https://github.com/jofaval/open-graph-visualizer
Web Scraping showcase of how crawlers retrieve site's details through the Open Graph Protocol
crawler javascript opengraph scraping web web-scraping
Last synced: 09 Dec 2024
https://github.com/hoan02/novel-crawler
Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn
Last synced: 20 Jan 2025
https://github.com/rutopio/crawler-2020-taiwanese-election-results
2020 台灣選舉結果爬蟲:以不分區政黨票為例
Last synced: 04 Dec 2024
https://github.com/brianbruggeman/vax
A vaccination signup tool
covid-19 crawler signup vaccination
Last synced: 16 Jan 2025
https://github.com/zenixls2/2chpreprocess
Dump messages from 2ch with some preprocessing for ML analysis
Last synced: 04 Dec 2024
https://github.com/jayzhan211/python-crawler-startups
python crawler learning
Last synced: 25 Jan 2025
https://github.com/vivekg13186/lucas
A web crawler
crawler crawler-engine crawling-framework java
Last synced: 09 Dec 2024
https://github.com/fritz-c/itunes-stats
Fetch info on podcasts, etc. from iTunes RSS data
Last synced: 02 Jan 2025
https://github.com/huakunshen/cron-crawler-template
Web Crawler Cron Job Template running with GitHub Action. Capable of sending email notifications.
Last synced: 17 Jan 2025
https://github.com/eklem/vinmonopolet-crawler
Crawling Vinmonopolet-data and indexing it to a norch search index
crawler dataset javascript norch search-engine
Last synced: 04 Dec 2024
https://github.com/tetreum/puppeteer-for-crawling
Daily use crawling methods for puppeteer
Last synced: 09 Dec 2024
https://github.com/jurooravec/knwldg
Datasets, scrapers, pipelines
companies crawler data dataset non-profit-organizations scraper scrapy
Last synced: 12 Jan 2025
https://github.com/edumucelli/rubybikes
A set of Bike Sharing System parsers in Ruby
Last synced: 24 Dec 2024
https://github.com/mstephen19/apify-click-events
Like TypeScript, but for clicking ;) Manage automated clicks, and ensure your Apify web-crawler is only clicking exactly what you allow it to
apify apify-sdk crawler scraper web-automation
Last synced: 10 Dec 2024