Projects in Awesome Lists tagged with webcrawler

https://github.com/crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider

Last synced: 14 May 2025

https://github.com/ssssssss-team/spider-flow

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

crawler jsoup spider spider-flow web-crawler web-spider webcrawler webspider xpath

Last synced: 14 May 2025

https://github.com/generalnewsextractor/generalnewsextractor

新闻网页正文通用抽取器 Beta 版.

python3 webcrawler webspider

Last synced: 14 May 2025

https://github.com/GeneralNewsExtractor/GeneralNewsExtractor

新闻网页正文通用抽取器 Beta 版.

python3 webcrawler webspider

Last synced: 24 Mar 2025

https://github.com/zorlan/skycaiji

蓝天采集器是一款开源免费的爬虫系统，仅需点选编辑规则即可采集数据，可运行在本地、虚拟主机或云服务器中，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

crawler crawling php spider webcrawler

Last synced: 14 May 2025

https://github.com/amirgamil/apollo

A Unix-style personal search engine and web crawler for your digital footprint.

personal-search poseidon search unix-like webcrawler

Last synced: 08 Apr 2025

https://github.com/scrapinghub/scrapyrt

HTTP API for Scrapy spiders

crawler crawling hacktoberfest hacktoberfest2021 python scraper scrapy twisted webcrawler webcrawling

Last synced: 15 May 2025

https://github.com/3nock/spidersuite

Advance web security spider/crawler

bugbounty cplusplus crawler gui information-gathering osint-tool pentest qt5 recon security-tools spider web-spider webcrawler

Last synced: 29 Oct 2025

https://github.com/z0m31en7/uscrapper

Uscrapper Vanta: Dive deeper into the web with this powerful open-source tool. Extract valuable insights with ease and efficiency, from both surface and deep web sources. Empower your data mining and analysis with Vanta's advanced capabilities. Fast, reliable, and user-friendly, Uscrapper Vanta is the ultimate choice for researchers and analysts.

darkweb darkweb-crawler information-extraction information-gathering osint osint-python osint-tool python reconnaissance selenium selenium-webscraper tor web-scraping webcra webcrawler webscraping website-scraper websites

Last synced: 15 May 2025

https://github.com/z0m31en7/Uscrapper

Uscrapper Vanta: Dive deeper into the web with this powerful open-source tool. Extract valuable insights with ease and efficiency, from both surface and deep web sources. Empower your data mining and analysis with Vanta's advanced capabilities. Fast, reliable, and user-friendly, Uscrapper Vanta is the ultimate choice for researchers and analysts.

darkweb darkweb-crawler information-extraction information-gathering osint osint-python osint-tool python reconnaissance selenium selenium-webscraper tor web-scraping webcra webcrawler webscraping website-scraper websites

Last synced: 05 May 2025

https://github.com/jaeksoft/opensearchserver

Open-source Enterprise Grade Search Engine Software

crawler custom-search enterprise indexing java lucene ocr opensearchserver search search-engine synonyms webcrawler webcrawling

Last synced: 04 Apr 2025

https://github.com/kingname/sourcecodeofbook

《Python爬虫开发从入门到实战》配套源代码。

python python3 requests scrapy webcrawler

Last synced: 05 Apr 2025

https://github.com/salimk/rcrawler

An R web crawler and scraper

crawler crawlers r rpackage scraper webcrawler webscraper webscraping webscrapping

Last synced: 12 Apr 2025

https://github.com/salimk/Rcrawler

An R web crawler and scraper

crawler crawlers r rpackage scraper webcrawler webscraper webscraping webscrapping

Last synced: 14 Mar 2025

https://github.com/adrianosferreira/afrodite.json

O maior livro de receitas culinárias em língua portuguesa

javascript mongodb nodejs webcrawler

Last synced: 12 Apr 2025

https://github.com/mehmetozkaya/dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 11 May 2025

https://github.com/mehmetozkaya/DotnetCrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 18 Apr 2025

https://github.com/dedsecinside/gotor

This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.

cli command-line command-line-tool docker go golang golang-server hacktoberfest http-server information-extraction osint osint-tools rest-api service tor torbot webcrawler webcrawling webscraping

Last synced: 09 Apr 2025

https://github.com/alex-on-ai/WebReaper

AI-native web scraper. Single binary with a bundled Claude Code skill. MIT-licensed alternative to Firecrawl.

ai-agents-automation claude-code crawler dotnet firecrawl-alternative llm markdown mcp parser parsing scraper scraping scraping-api scraping-web scraping-websites webcrawler webscraping

Last synced: 14 Jun 2026

https://github.com/voliveirajr/seleniumcrawler

An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site

asp-net python scraper scraping scraping-websites scrapper scrapy selenium selenium-webdriver webcrawler webcrawling

Last synced: 11 Oct 2025

https://github.com/pavlovtech/WebReaper

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.

crawler datamining parser parsing scraper scraping scraping-api scraping-data scraping-tool scraping-web scraping-websites webcrawler webscraping

Last synced: 08 Apr 2025

https://github.com/aavache/llmwebcrawler

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.

api distributed-computing fastapi huggingface large-language-models llm machine-learning milvus nlp pydantic python rag ray raylib transformer vector-database webcrawler webcrawling

Last synced: 23 Oct 2025

https://github.com/shenxiangzhuang/pythondataanalysis

The data and code that used in my book.

data-science python3 webcrawler

Last synced: 08 Aug 2025

https://github.com/shenxiangzhuang/PythonDataAnalysis

The data and code that used in my book.

data-science python3 webcrawler

Last synced: 26 Mar 2025

https://github.com/hfreire/browser-as-a-service

A web browser :earth_americas: hosted as a service, to render your JavaScript web pages as HTML

browser browser-as-a-service crawler docker github-actions javascript puppeteer rest-api scraper server webcrawler

Last synced: 11 Sep 2025

https://github.com/robsonbittencourt/gafanhoto

Bot para monitoramento de promoções no fórum do Hardmob http://www.hardmob.com.br/promocoes/

chatbot gafanhoto hardmob promocoes telegram webcrawler

Last synced: 10 Apr 2025

https://github.com/deuxhuithuit/algolia-webcrawler

Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

algolia algolia-webcrawler indexing javascript search-engine webcrawler

Last synced: 09 Jul 2025

https://github.com/DeuxHuitHuit/algolia-webcrawler

Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

algolia algolia-webcrawler indexing javascript search-engine webcrawler

Last synced: 14 Mar 2025

https://github.com/Conso1eCowb0y/Deepminer

Deep web crawler and search engine

crawler crawling dark-web data-mining deepminer deepweb github hacking onion osint python-web-scraper python3 search-engine security security-tools spider the-onion-router tor tor-network webcrawler

Last synced: 20 Apr 2025

https://github.com/kshru9/web-crawler

A multithreaded web crawler using two mechanism - single lock and thread safe data structures

concurrency concurrent-data-structure cpp crawler data-structures html-parser lock multithreading openssl pagerank pthread reader-writer-lock search-engine socket threading threadsafe webcrawler website-downloader

Last synced: 23 Mar 2025

https://github.com/opencharles/charles

Java web crawling library

dynamic selenium webcrawler webdriver

Last synced: 08 Apr 2025

https://github.com/parth-vader/fb-spider

Accepts a page name and shows latest posts and comments in a new browser window.

facebook-api graph graph-api spider webcrawler

Last synced: 15 Apr 2025

https://github.com/marcel0024/cococrawler

An declarative and easy to use web crawler and scraper in C#

cococrawler crawler crawling-tool csharp dotnet dotnetcore scraper scraping-tool webcrawler webcrawler-csharp webcrawling webscraper

Last synced: 10 Apr 2025

https://github.com/gdgd009xcd/RequestRecorder

A ZAPROXY Add-on that allows testing of web application vulnerabilities by recording complex multi-step sequences. You can test applications that need to access pages in a specific order, such as shopping carts or registration of member information.

activescan addon authentication csrf multistep multistep-form security security-testing security-tools vulnerability-scanners web-security webcrawler websecurity zap-extension zaproxy

Last synced: 31 Oct 2025

https://github.com/waynechang65/ptt-crawler

ptt-crawler is a web crawler module designed to scarpe data from Ptt.

api crawl crawler javascript nodejs ptt scrape scraper scraping spider typescript web-crawler webcrawler

Last synced: 08 Oct 2025

https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

data-extraction multithreading python web-crawler webcrawler

Last synced: 12 Jan 2026

https://github.com/bkeepers/spiderman

your friendly neighborhood web crawler

crawler crawler-engine http httprb nokogiri ruby spider spider-framework web-crawler web-scraping webcrawler webscraping

Last synced: 14 Oct 2025

https://github.com/code-yeongyu/trackpurchase

단 몇줄의 코드로 다양한 쇼핑 플랫폼에서 결제 내역을 긁어오자!

crawlwer puppeteer webcrawler webscraper webscraping

Last synced: 14 Aug 2025

https://github.com/ddayto21/lead-scraper

Repository contains a web crawler that searches for emails in a webpage, along with a webscraping script that collects leads from various webpages online filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.

beautifulsoup4 python requests webcrawler webscraper yellow-pages

Last synced: 03 Sep 2025

https://github.com/raspi/scrapy-intel-ark

Web crawler for Intel ARK (ark.intel.com)

hardware intel python scrapy spider webcrawler

Last synced: 05 Oct 2025

https://github.com/yufree/scifetch

webpage crawling tools for pubmed, google scholar and rss

google-scholar pubmed r rss webcrawler

Last synced: 18 Mar 2025

https://github.com/deep5050/abosar

অবসর 📚 A collection of short Bengali stories web scraped from various Bengali eMagazines and eNewspapers.

bengali cron-jobs stories web-scraper web-scraping webcrawler

Last synced: 14 Jul 2025

https://github.com/geminidsystems/googlenewsscraper

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)

crawler googleautomator googlenews googlenewsscraper googlescraper python scraper scraping selenium web-scraping webcrawler webdriver webscraper

Last synced: 13 Aug 2025

https://github.com/jacraig/spidey

A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.

crawler webcrawler

Last synced: 12 Aug 2025

https://github.com/nomomon/kamernet-puppeteer

:house: Automatic message sender to new adverts on Kamernet using puppeteer.

automation kamernet message-sender netherlands node nodejs puppeteer webcrawler

Last synced: 12 Apr 2025

https://github.com/cutta/eksiseyler

Sample MVP project uses jsoup-web-crawl like API

android dagger2 dagger2-mvp glide jsoup mvp retrofit2 rxandroid2 rxjava2 webcrawler

Last synced: 24 Jul 2025

https://github.com/bjoern-hempel/php-web-crawler

A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.

crawler mit-license php recursive webcrawler webscraper xpath

Last synced: 11 Apr 2025

https://github.com/kingname/crawlerutility

Simplify the development of your webcrawler

python3 requests scrapy webcrawler

Last synced: 11 Jul 2025

https://github.com/michaelradu/web-crawler

A Web Crawler developed in Python.

crawler crawler-python crawlers python python-3 python-script python3 script scripting scripting-language scripts web web-crawler web-crawler-python web-crawlers web-crawling webcrawl webcrawler webcrawling

Last synced: 25 Jul 2025

https://github.com/0memo07/web-crawler

Web Crawler with Python

beautifulsoup4 bs4 crawler crawlers crawling crawling-python web-crawler web-crawler-python web-crawling webcrawler

Last synced: 24 Apr 2025

https://github.com/lewisakura/spiderboi

A web crawling library written in TypeScript.

spider typescript typescript3 web-crawler web-crawling web-spider webcrawler

Last synced: 12 Apr 2025

https://github.com/luizppa/web-crawler

A web crawler that collects and indexes web pages. Made with chilkat and gumbo parser.

chilkat cpp crawler webcrawler

Last synced: 17 Aug 2025

https://github.com/sreesh-mallya/bookmyshow-notify

A Python command-line app that notifies you when a show is available on bookmyshow.com.

beautifulsoup4 bookmyshow cli notifies python webcrawler

Last synced: 28 Jul 2025

https://github.com/madexploits/madrawler

Web crawler for finding easy endpoint

webcrawler webhacking

Last synced: 04 Jul 2025

https://github.com/ahmard/queliwrap

QueryList PHP web scrapper wrapper

php querylist webcrawler webscraper

Last synced: 18 Mar 2025

https://github.com/vmarcosp/supervise-crawler

:male_detective: Supervise crawler

crawler esy ocaml reasonml webcrawler

Last synced: 13 May 2025

https://github.com/mcstreetguy/crawler

An advanced web-crawler written in PHP.

composer composer-library crawler crawler-engine guzzle http-requests php php-7 php-library web-crawler webcrawler

Last synced: 09 Apr 2025

https://github.com/faulander/720dl

720pier phpbb torrent webcrawler

Last synced: 27 Apr 2026

https://github.com/lucasmendesl/mugiwara

:tophat: a simple web scraping to extract and download videos from animesproject.com

anime-downloader cli nodejs rxjs webcrawler webscraping

Last synced: 27 Feb 2026

https://github.com/shirokovnv/webcrawler

The service for crawling websites.

cassandra elixir-phoenix parser webcrawler

Last synced: 21 Jul 2025

https://github.com/moehmeni/ezweb

Easy to use web page analyzer

analyzer crawler scraper text-analysis text-classification text-mining webcrawler webcrawling webpage webscraper webscraping www

Last synced: 06 Apr 2025

https://github.com/manigandand/crawler

A simple web crawler in Go.

go golang webcrawler

Last synced: 30 Apr 2025

https://github.com/leelow/nightmare-screenshot-selector

👻 📷 A Nightmare plugin to easily take screenshots.

crawler headless-browsers javascript js nightmare nightmarejs nodejs plugin webcrawler

Last synced: 12 Apr 2025

https://github.com/leonardovff/socialbot

A robot to search pictures with hashtags in facebook and instagram

facebook hahstags instagram nodejs robot webcrawler

Last synced: 11 Apr 2025

https://github.com/robmch/mindfactory_crawling

A Python 3 Crawler for Mindfactory.de

crawler crawling data webcrawler webcrawling

Last synced: 07 May 2025

https://github.com/farkaskid/webcrawler

Simple and fast web crawler.

crawler go golang goroutines web webcrawler

Last synced: 14 Jan 2026

https://github.com/asabeneh/python

dictionaries loop python python3 regular-expression tuples webcrawler

Last synced: 13 Jun 2025

https://github.com/n3wjack/sitecrawler

A command-line based web crawler

crawler tool webcrawler webcrawling webdevelopment

Last synced: 07 Mar 2026

https://github.com/waynechang65/baha-crawler

baha-crawler is a web crawler module designed to scarp data from Bahamut Forum.

bahamut crawler javascript nodejs scraper spider webcrawler

Last synced: 22 Apr 2025

https://github.com/datacollectionspecialist/web-crawler-in-python

Learn how to build a web crawler in Python with this step-by-step guide for 2025.

webcrawler webcrawlerpython

Last synced: 09 Mar 2026

https://github.com/simonsdave/cloudfeaster

Cloudfeaster Spider Development

docker python selenium-webdriver spider webcrawler

Last synced: 16 Mar 2026

https://github.com/elektrostudios/fhm-crawler-freehardmusic.com

Crawls download urls of albums from freehardmusic.com website

albums crawl crawler crawling desktop-app desktop-application dotnet music web-crawler web-crawling web-scraper web-scraping webcrawler webcrawling webscraper webscraping windows windows-app windowsapp winforms

Last synced: 19 Jul 2025

https://github.com/havardnyboe/dagenidag

Gjenskapning av NRKs side 199 fra Tekst-TV

dagenidag nrk tekst-tv webcrawler

Last synced: 04 Aug 2025

https://github.com/odynvolk/bing-me-links

A simple node module for scraping Baidu, Bing, StartPage, Yahoo and Qwant

baidu bing javascript nodejs scraper startpage webcrawler yahoo

Last synced: 09 Oct 2025

https://github.com/victoralessander/smith

A toolkit to make easy web scraping the world.

beautifulsoup bot extract-information python python3 telegram webcrawler webscraping

Last synced: 15 Apr 2025

https://github.com/sadatrafsanjani/spider-web-crawler

A web crawler that implements breadth first search algorithm and built with maven.

breadth-first-search jsoup webcrawler

Last synced: 15 May 2026

https://github.com/aimlpm/markcrawl

Fast Python web crawler for RAG and AI ingestion. Extracts clean Markdown from any site for LLMs and vector stores.

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm markdown-extraction openai pgvector python rag sitemap-crawler structured-data supabase vector-database webcrawler

Last synced: 23 Apr 2026

https://github.com/bitebait/curry

🍛 Curry é um WebCrawler escrito em Golang com finalidade de verificar o valor do câmbio de Dólar para Real (USDxBRL) em algumas lojas no Paraguay.

api brasil crawler currency-exchange-rates go golang paraguay webcrawler

Last synced: 15 Jan 2026

https://github.com/0000xffff/webgrab

web page: crawler / file scanner / downloader

crawler download downloader scrape scraper webcrawler

Last synced: 17 Apr 2026

https://github.com/mominurr/social-media-scraping

Social Media Scraping – Scrapes data from TikTok, LinkedIn, Facebook, and Twitter (X.com), including user profiles, posts, engagement metrics, and comments.

datascraping facebook-scraper linkedin-scraper pandas python scraper scraping selenium tiktok-scraper twitter-scraper webcrawler webcrawling webscraping

Last synced: 13 Apr 2026

https://github.com/agarwalkaushal/higher-education-recommendation

Higher Education Recommendation system using Python with Selenium API.

education pycharm-ide python recommender-system selenium-webdriver webcrawler

Last synced: 18 Feb 2026

https://github.com/raspi/scrapy-kuntavaalit2021-yle

Fetch YLE kuntavaalit 2021 data

crawler mirror python scrapy spider webcrawler

Last synced: 26 Apr 2025

https://github.com/congcoi123/crawler-sheis

A small crawler for getting data from the website: https://sheis.vn

crawler webcrawler webcrawling webscraper webscraping

Last synced: 25 Feb 2026

https://github.com/dearopen/django-easy-scraper

Django apps to scrape data from web page easily

automation django django-rest-framework python python3 webcrawler webcrawling webscraper webscraping

Last synced: 14 May 2026

https://github.com/moredure/drum

Golang implementation of the disk repository with update management (DRUM) framework as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in the paper "IRLbot: Scaling to 6 Billion Pages and Beyond"

drum golang url webcrawler

Last synced: 24 May 2026

https://github.com/lucasmendesl/cinepolis-movies-extractor

A reactive command line tool that extract infos from cinepolis peru website

axios nodejs rxjs typescript webcrawler webscraping

Last synced: 17 Apr 2026

https://github.com/codera21/webcrawl-js

A simple web crawler - axios and cherrio

axios cheerio javascript webcrawler webscraping

Last synced: 11 Nov 2025

https://github.com/th3-c0der/web-crawler

A simple WebCrawler for exploring and downloading content from web pages within a given domain/url.

th3-c0der th3-coder th3c0der th3coder tool tools web-tool webcrawl webcrawler webcrawlers webcrawling

Last synced: 19 Mar 2026

https://github.com/ibz-04/hudgent

Official code implementation for my ready tensor publication, an ai agent that retrieves data from an islamic website -> uses the data as alignment criteria to answer the user

ai-agent ai-alignment cython islamic-ai-agent open-source python search-agent turkish-nlp webcrawler whoosh

Last synced: 03 Oct 2025

https://github.com/jshyunbin/comment_crawler

Web crawler for online shopping mall comments using python selectolax and requests.

python webcrawler

Last synced: 13 Oct 2025

https://github.com/antoinegagne/treewalker

A web crawler in Erlang that respects `robots.txt`.

crawler erlang webcrawler

Last synced: 11 Feb 2026

https://github.com/sgowdaks/nichirin

RAG and Webcrawler in a single package

llm rag retrieval-augmented-generation scraping webcrawler

Last synced: 26 Jan 2026

https://github.com/nikola352/cirilizator

Web app with tools for using Cyrillic script on the Serbian side od the Internet

flask postgresql python react webcrawler

Last synced: 10 Oct 2025

https://github.com/doomspork/maartz

A refactor of Maartz's web scrapper. Context: https://twitter.com/maartz4/status/1248133734760615937

asynchronous-tasks elixir webcrawler

Last synced: 17 Feb 2026

https://github.com/galarzaa90/tibiakt

Kotlin library to fetch and parse Tibia.com pages.

jsoup jvm kotlin ktor tibia webcrawler

Last synced: 13 Jul 2025

https://github.com/nobrainghost/golamv2

Lightweight Web Crawler for Emails,Keywords,Deadlinks,Dead Domains written in Go. Suitable for low resource environments

golang webcrawler webcrawling

Last synced: 16 Jun 2025

https://github.com/rrmerugu/trawler

A data gathering/trawling framework to search and get information from web sources like bing

crawler-engine python search webcrawler

Last synced: 14 Jan 2026

https://github.com/elektrostudios/bt4g-torrent-magnet-scraper

Scrapes BT4G magnet links using configurable search and filtering rules.

bt4g command-line console-applications crawler dotnet magnet magnet-link scraper scraping searchengine torrent torrents vbnet web-crawler web-spider webcrawler webspider windows windows-10 windows-app

Last synced: 24 Jun 2026

https://github.com/gappeah/nike_web_crawler

This project involves web scraping Nike's product pages to extract product names, prices and links. The project showcases three different implementations of the web crawler using Selenium and BeautifulSoup. It also includes visualisation of the scraped data using Matplotlib and Seaborn.

beautifulsoup data-analysis data-visualization python selenium web-crawler web-scraper webcrawler webscraper webscraping webscraping-beautifulsoup

Last synced: 04 Jul 2025

https://github.com/je-chen/python_webcrawler_je

python webcrawler

Last synced: 29 Jun 2025