Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2025-01-29 00:06:32 UTC
- JSON Representation
https://github.com/teal33t/base_crawler
Simple scaffold for selenium based crawler bots
crawler scaffold-template selenium selenium-python
Last synced: 23 Jan 2025
https://github.com/arshadkazmi42/gh-crawl
Crawler for Github repositories. Finds all the broken links from the repositories
bug-bounty-recon crawl crawler gh-crawler github github-crawler githubcrawler python
Last synced: 21 Dec 2024
https://github.com/igorbrizack/web-scraper
Aplicação de raspagem de dados HTML, construída em python.
crawler pytest python3 scraper
Last synced: 26 Jan 2025
https://github.com/dylanhogg/cloud-products
A package for getting cloud products and product descriptions from a cloud provider website.
aws cloud-products crawler data text-processing
Last synced: 23 Jan 2025
https://github.com/willi-dev/dtcapp
dtcapp : distributed twitter crawler.
crawler distributed-systems hazelcast java twitter twitter-api
Last synced: 14 Jan 2025
https://github.com/tanja-4732/od-get
A Rust tool for recursively crawling & downloading data from open directories
cli crawler open-directory open-directory-downloader rust
Last synced: 14 Jan 2025
https://github.com/scrwdrv/siege-crawler
This CLI tool will find same domain urls in a web page and requesting them to find even more urls until server crash (or at the end of benchmark). It is used to test maximun capacity of server or finding for glitches that users might encounter.
benchmark cli crawler ddos debug siege tool
Last synced: 18 Dec 2024
https://github.com/dimo414/pycrawl
Simple Python web crawler, primarily designed for inspecting and diagnosing your own website
Last synced: 18 Dec 2024
https://github.com/kahsolt/allchan
An image crawler for xChan(4chan/8ch/...) image board.
4chan 4chan-downloader 8chan crawler image-crawler
Last synced: 03 Jan 2025
https://github.com/bkdev98/ebooks-crawler
Ebooks crawler for personal purpose using ReactJS.
crawler material-ui nodejs reactjs
Last synced: 01 Jan 2025
https://github.com/programming-with-love/skyeyesystem
天眼系统,每隔十分钟爬取各个平台的热搜数据并入库。包括原始热搜数据存入mysql。词频统计存入Redis。
crawler mysql redis skyeye skyeyewall springboot
Last synced: 16 Jan 2025
https://github.com/woorim960/nate.com-comments-crawler
nate.com-comments-crawler
chromedriver crawler python3 selenium
Last synced: 28 Dec 2024
https://github.com/mazzasaverio/scrapy-playwright-scrapegraphai
Web crawler using Scrapy + Playwright for dynamic content, featuring YAML-based configuration, PostgreSQL storage via aiosql, structured logging with logfire, and complete Docker/Terraform infrastructure. Built with uv package manager and Python 3.11+.
aiosql crawler docker playwright scrapy scrapy-playwright terraform uv
Last synced: 14 Jan 2025
https://github.com/dingpingzhang/papermedia
A scrapy-based crawler for crawling paper media.
Last synced: 22 Dec 2024
https://github.com/schbenedikt/web-crawler
A simple web crawler using Python that stores the metadata of each web page in a database.
crawler database mariadb mysql python python-crawler web
Last synced: 08 Nov 2024
https://github.com/ryanchao2012/okbot
A conversation retrieval engine based on PTT corpus
Last synced: 12 Jan 2025
https://github.com/duaraghav8/larry-crawler
Kayako Twitter challenge
crawler fetch-tweets hashtag nodejs pagination tweets twitter-api
Last synced: 22 Jan 2025
https://github.com/dean9703111/ithelp_total_count
計算 IT邦幫忙文章的瀏覽/Like/留言總數
crawler ithelp total-likes total-responses total-views
Last synced: 12 Jan 2025
https://github.com/liyun-li/meh-bot
Just a bot that clicks an image
bot crawler docker headless-firefox meh python python3 selenium twilio-sms-api
Last synced: 25 Jan 2025
https://github.com/m-osource/cassiopeiabot
C++ multithread Linux Web Crawler
algorithm berkeleydb bot cassiopeia cplusplus crawler download engine hashing html-parser information-retrieval link-analysis multithread open-source regex search web web-crawler webcrawler www
Last synced: 08 Jan 2025
https://github.com/piopi/behatcrawler
A Behat extension that crawls links on a website and executes user-defined function on each one of them.
behat behat-extension crawler php selenium-webdriver
Last synced: 19 Dec 2024
https://github.com/afuntw/misc-crawler
some small crawler for specific website
Last synced: 12 Jan 2025
https://github.com/projectx3193275578/prjctxx8264
A simple, open-source, easy to use, and free download manager for malware samples.
crawler downloader malware manager samples
Last synced: 05 Jan 2025
https://github.com/supadata-ai/js
Official TypeScript/JavaScript SDK for the Supadata API.
ai crawler llm markdown scraper transcript web-crawler youtube
Last synced: 27 Jan 2025
https://github.com/supadata-ai/py
Official Python SDK for the Supadata API.
ai api crawler llm markdown scraping sdk transcript web-scraper youtube
Last synced: 27 Jan 2025
https://github.com/opda0887/bahamut-crawler-to-gmail
發想:使用Python爬蟲取得巴哈姆特版面的最新論壇,並用gmail傳送這些訊息給自己。A thought: Use Python crawler to the latest forums in Bahamut, and use gmail to send these messages to myself.
Last synced: 26 Jan 2025
https://github.com/phanikmr/linkcrawler
A LinkCrawler is a Python module that takes a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.
async crawler linkcrawler parse python scrapy spider
Last synced: 27 Jan 2025
https://github.com/victorhuu/amazonmovieintegration
本仓库是同济大学数据仓库的第一个个人作业——利用爬虫与ETL工具整理Amazon的电影数据
crawler data-warehouse movies pandas scrapy xpath
Last synced: 26 Jan 2025
https://github.com/chunkingz/youtubelinks-scraper
A python script that scrapes Youtube links from a predefined website of choice.
crawler python scraper spider websitescraper youtube
Last synced: 02 Jan 2025
https://github.com/joeri-abbo/python-credly-scraper
This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an
badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling
Last synced: 15 Jan 2025
https://github.com/rogerluo410/gcrawler
Google search crawler for Ruby version. Crawling each links' text and url by keywords on Google.com.
Last synced: 02 Jan 2025
https://github.com/knourian/freelancer.com-category-scrapping
Scrapping Categories from Freelancer.com Using scrapy with number of project for each category
crawler freelancer python3 scrapy web-crawler
Last synced: 05 Jan 2025
https://github.com/pierlauro/mdbubing
From WARC records to MongoDB documents
bubing crawler crawling warc warc-files warc-format warc-record webarchive webarchiving
Last synced: 09 Dec 2024
https://github.com/efishery/wpi-kkp-crawler
This is crawler for fisheries price on wpi.kkp.go.id
Last synced: 02 Jan 2025
https://github.com/jimmy-ly00/dhe-prime-grabber
Grabs Diffie-Hellman primes from certificates using OpenSSL. Uses multiprocessing to collect over 50 million Diffie-Hellman primes.
certificate certificates crawler dhe-prime-grabber diffie-hellman ipv4 multiprocessing openssl prime prime-numbers python python-3
Last synced: 29 Dec 2024
https://github.com/appliedsoul/crawlmatic
Static and Dynamic website crawling library - a common promise based wrapper around node-crawler & hccrawler libraries.
Last synced: 30 Dec 2024
https://github.com/danielemoraschi/go-sitemap-common
Simple GO sitemap generator and crawler.
crawler golang sitemap sitemap-generator
Last synced: 31 Dec 2024
https://github.com/cseas/shares-monitor
Web crawler to fetch and monitor shares details.
crawler python python3 scraper scraping-websites shares
Last synced: 27 Dec 2024
https://github.com/mdazlaanzubair/amazon-scraper-api
A web scraper to crawl on amazon to extract products information and return in JSON format.
amazon crawler expressjs json-api nodejs webscraping
Last synced: 10 Jan 2025
https://github.com/ycrao/some-spider-code
some spider code 财经资讯以及基金股票外汇价格爬虫
crawler economics fin-eco-news finance forex fund-value spider stock-price
Last synced: 19 Nov 2024
https://github.com/droiddevgeeks/nodelearning
This is node learning demo. It has covered all basics of node.
crawler database ejs ejs-express mcv middleware-nodes mongodb node node-module nodejs nodemailer npm-package router sign
Last synced: 13 Jan 2025
https://github.com/basemax/jadi-net-blog
This Python script is used to extract posts from a WordPress blog (https://jadi.net/) and save them in HTML format. The script fetches the RSS feed, parses the posts, and saves each post as an individual HTML file.
blog-copier copier crawler crawler-python crawlers jadi-blog jadi-clone jadi-net-blog jadi-net-clone jadinet-blog py python python-crawler wordpress wp
Last synced: 30 Jan 2025
https://github.com/birdroad1/server-pinger
Server pinger for Minecraft written in C++
cpp crawler make minecraft minecraft-scanner postgres scanner server
Last synced: 21 Jan 2025
https://github.com/vitaee/laravelandcrawlers
php web crawler examples with oop concept and laravel project
Last synced: 26 Dec 2024
https://github.com/thomashirtz/douban-crawler
A simple crawler for retrieving information about movies or TV shows from the famous www.douban.com website.
Last synced: 25 Dec 2024
https://github.com/jorgeparavicini/medalytik-python
Python crawlers for a job mediation firm
Last synced: 07 Dec 2024
https://github.com/donuts-are-good/araknnid
GO GO TINY SPIDER!
crawler hacktoberfest search-engine spider
Last synced: 28 Dec 2024
https://github.com/hctilg/taaghche-dl
Save books purchased from taaghche.com !
crawler downloader pillow-library python3 selenium taaghche
Last synced: 09 Jan 2025
https://github.com/princed/specht
Check links found in html or js files by pattern
cli crawler html javascript streams
Last synced: 19 Jan 2025
https://github.com/khadkarajesh/aptoide
Aptoide app crawler using beautifulsoup
beautifulsoup4 crawler flask python3 web-application
Last synced: 13 Jan 2025
https://github.com/zhanymkanov/marketplace_parser
Products and Reviews Crawler
Last synced: 14 Jan 2025
https://github.com/purrproof/smartcrawl
An adaptable framework for gathering, aggregating and analyzing data, focusing on blockchain and smart contracts.
blockchain cli crawler explorer framework go golang hacktoberfest
Last synced: 27 Jan 2025
https://github.com/jfcherng/wiki-cgroup-crawler
此腳本用於抓取維基百科的公共轉換組詞庫,並將結果儲存為外部檔案。
crawler php-71 wiki-cgroup-crawler wikipedia
Last synced: 22 Jan 2025
https://github.com/h4r5h1t/crawlytics
A Python-based web crawling tool for data extraction and security analysis that supports various arguments for efficient crawling and outputs results in JSON format.
appsec crawler crawler-python mechanicalsoup security security-tools webcrawler
Last synced: 28 Dec 2024
https://github.com/marzzzello/gplaycrawler
(mirror) Discover apps by different mehtods. Mass download app packages and metadata.
crawler google-play google-play-store googleplay googleplaystore playstore playstoreapi scraper
Last synced: 23 Dec 2024
https://github.com/ilsonlasmar/inovamind
Desafio Inovamind - Crawler em Ruby on Rails com Sidekiq + Redis
Last synced: 10 Jan 2025
https://github.com/snuzi/devblogs-aggregator
The backend aggregator project of DevBlogs.net
aggregator blog crawler engineering engineering-blogs tech tech-blogs tech-companies tech-news
Last synced: 09 Nov 2024
https://github.com/iarsham/scrapify
Scrapify is a golang library that automates the process of bypassing CAPTCHAs, enabling efficient web scraping and data acquisition.
403-bypass arkose cloudflare crawler golang http-client scraper
Last synced: 12 Dec 2024
https://github.com/orafaelfragoso/itunes-crawler
Retrieves information about an artist by crawling the iTunes API and iTunes Page
Last synced: 19 Dec 2024
https://github.com/hantang/list-movies-top
豆瓣(douban.com)、IMDb(imdb.com)、时光网(mtime.com)、猫眼(maoyan.com)Top电影定时抓取
Last synced: 07 Jan 2025
https://github.com/bujosa/aldebaran
Example use APP ENGINE with Python3, ThreadPool and webScraping
appengine crawler flask gcp python3 thread-pool
Last synced: 21 Jan 2025
https://github.com/khilnani/spidey.py
Web spiders are usually disliked by websites, but useful for recursive API/page downloads for offline analysis.
cli crawler python scaper web-spider
Last synced: 30 Jan 2025
https://github.com/davideferre/covid19-data-crawler-ita
Covid 19 italian data crawler
coronavirus covid19 crawler hacktoberfest hacktoberfest2021 python
Last synced: 11 Jan 2025
https://github.com/pjullrich/link-crawler
Python Crawler that reports broken links on a given website and its sup-pages
asyncio breadth-first-search broken-links crawler python
Last synced: 23 Jan 2025
https://github.com/wondervictor/spiderman
2017 Software Course Project
crawler distribute-crawler zhihu-crawler
Last synced: 17 Jan 2025
https://github.com/estroz/seekret
Seekret is a sensitive data crawler for GitHub repositories
Last synced: 25 Dec 2024
https://github.com/dhchenx/quick-crawler
A toolkit for quickly performing crawler functions
Last synced: 29 Jan 2025
https://github.com/tcc0lin/magiccrawler
Collect all kinds of interesting crawler scripts and tackle them against the anti-climbing method :bowtie::heavy_check_mark::heavy_check_mark::heavy_check_mark:
Last synced: 18 Jan 2025
https://github.com/ghost---shadow/feature-extractor-from-codebase
Copies the target java file and all its dependencies recursively to another directory
Last synced: 16 Jan 2025
https://github.com/beanwei/zmt-post-crawler
Crawler the ZMT platform site ,put the author id, get the post list.This project is coding for my friend
Last synced: 28 Dec 2024
https://github.com/homuchen/instagram-crawler
Instagram crawler
crawler instagram nodejs-crawler
Last synced: 29 Jan 2025
https://github.com/geoffreybauduin/website-checker
Performs useful checks against a website, such as 404 errors reporting, structured data validation...
crawler seo structured-data web-spider website
Last synced: 25 Dec 2024