Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

GitHub: https://github.com/topics/crawler
Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
Last updated: 2025-02-07 00:06:39 UTC
JSON Representation

https://github.com/spider-rs/spider-clients

Clients to use with the hosted spider service - spider.cloud

ai ai-agents ai-scraping crawler html-to-markdown llm-webcrawler scraper spider web-scraping

Last synced: 05 Nov 2024

https://github.com/brianmacintosh/wikicrawler

Sandbox project for manipulating Wikimedia wikis

c-sharp crawler mediawiki-bot wikipedia-bot

Last synced: 30 Dec 2024

https://github.com/tatamiya/gas-new-books-crawler

Crawling new book information from 版元ドットコム(https://www.hanmoto.com/)

crawler gas

Last synced: 21 Jan 2025

https://github.com/ark930/douban-movie-crawler

豆瓣影评爬虫

crawler douban movie python

Last synced: 24 Jan 2025

https://github.com/jpleorx/tagblender

A simple java API to retrieve hashtags from https://www.tagblender.net/

api crawler hashtags java jsoup parser

Last synced: 25 Jan 2025

https://github.com/istador/mediaindexer

Software for a cronjob to crawl the ViMP media center and generate an index for it as a static website.

crawler website

Last synced: 22 Jan 2025

https://github.com/gxjansen/website-to-pdf

Creates a PDF based on the content of a website/subomain

claude-3-sonnet crawler python3

Last synced: 05 Feb 2025

https://github.com/thejoin95/free-proxies.info

API service for get anonymous and non proxy, filter by latency, country, updatetime and more

api crawler http-proxy proxy proxy-list python scraper

Last synced: 06 Jan 2025

https://github.com/kehiy/prawler

Pactus P2P Network Crawler

crawler crawling metrics networking p2p pactus

Last synced: 28 Dec 2024

https://github.com/xiangronglin/novel2go

Android app to create pdf from website and send to your kindle

android crawler jetpack kotlin pdf-generation readability

Last synced: 21 Dec 2024

https://github.com/eivindarvesen/naive-spider

A minimal web crawler

crawler python spider

Last synced: 17 Jan 2025

https://github.com/lucasfogliarini/minhaentradacrawler.consoleapp

Web crawler em C# que usa a biblioteca AngleSharp para extrair detalhes de eventos do site "https://minhaentrada.com.br". Ele analisa o HTML da página e recupera informações como título, data, local e links dos eventos.

anglesharp crawler minhaentrada

Last synced: 31 Dec 2024

https://github.com/sedrubal/webcrawler

Crawl sites and search for security issues.

crawler script security website-auditing

Last synced: 24 Jan 2025

https://github.com/ahsouza/iquizz-api

API RESTfull developed in Node.Js with MongoDB

animations cluster crawler docker docker-compose ejs-templates es8 font-awesome grunt-task helmet-detection heroku javascript jquery material-design mongodb nodejs passport-strategy passportjs pusher token-authetication

Last synced: 05 Feb 2025

https://github.com/timpletin/comming-soon

Coming Soon Page - Simple and clean design fully responsive on all screen, Count the days, hours, minutes and seconds for coming event

crawler css java javaweb nextjs nextjs-boilerplate nextjs-typescript nextjs14-typescript object-detection paypal python tailwindui tensorflow typescript

Last synced: 21 Jan 2025

https://github.com/bramtenhove/issue-crawler

Crawls Drupal issues and keeps stats

crawler

Last synced: 29 Dec 2024

https://github.com/rcmilan/ex-web-scraping

Web Scraping com F#

crawler f-sharp fsharp fsharp-data scraper web-scraping xplot

Last synced: 17 Jan 2025

https://github.com/kaymen99/imdb-scraper

IMDB scraper allows to collect movies and tv shows data from the imdb website

crawler python scraper scraping scrapy

Last synced: 22 Jan 2025

https://github.com/m1/smap

smap is a site-mapping engine written in Go.

crawler go go-library go-package golang golang-library golang-package golang-tools sitemap sitemap-generator web-crawler web-crawling

Last synced: 05 Feb 2025

https://github.com/yukihirai0505/streamcrawler

akka stream × crawler

akka-streams crawler elasticsearch instagram sbt scala

Last synced: 13 Jan 2025

https://github.com/waived/google-drive-crawler

Proxy-based crawler to expose public (shared) Google Drive links

crawler crawler-python file-crawler google-drive-api shared-folders web-spider

Last synced: 01 Feb 2025

https://github.com/ronniery/crawler.synom

A crawler for the sinonimo.com.br website that saves the words into mongodb database.

bot crawler html html5 javascript mongodb nodejs nosql npm scraper thesaurus typescript web website xml

Last synced: 21 Dec 2024

https://github.com/lin-jun-xiang/python-crawler

Using CloudScraper, Requests, API, Thread, Async... for scrape the data

async cloudscraper crawler multithreading python requests scraper selenium

Last synced: 21 Dec 2024

https://github.com/lopins/article-crawler

一个简单的网页文章爬取工具，可以自定义抽取自己所需要的字段内容，简单容易上手。

article crawler ftp mysql python sqlite3

Last synced: 21 Dec 2024

https://github.com/briangershon/crawlee-playwright

Browser-based automations with Crawlee and Playwright using Vite tooling and TypeScript

crawlee crawler playwright starter-template typescript vite

Last synced: 20 Dec 2024

https://github.com/georgynet/crawler

Web Crawler

crawler go golang web-crawler

Last synced: 04 Jan 2025

https://github.com/hackthedev/botnet

Tool to find IP's on the Web and check SSH availability and brute force login with a wordlist. Educationally only !!!

botnet bruteforce crawler education educational ip malicious proof-of-concept ssh testing web

Last synced: 23 Jan 2025

https://github.com/kimseogyu/crawling-music-ranks

음원순위 크롤링 코드

crawler jest typescript

Last synced: 21 Dec 2024

https://github.com/josepedrodias/naivebot

attempt to mimic googlebot behaviour in nodejs with nightmarejs

crawler googlebot nightmarejs nodejs robots

Last synced: 21 Jan 2025

https://github.com/qqxs/usda_pomological_watercolors

爬取美国农业部果树水彩的数据

crawler koa2 nodejs watercolors

Last synced: 18 Jan 2025

https://github.com/shunk031/amebloscraper

Scraper for Ameblo in Scrapy

ameblo crawler scraper scrapy

Last synced: 10 Jan 2025

https://github.com/manikantasanjay/stackoverflow_tag_generator_webcrawler

StackOverFlow Tag Generator Using a WebCrawler.

crawler python

Last synced: 22 Dec 2024

https://github.com/stephanebruckert/gocrawl

Crawl every pages and assets of a web domain

crawler python

Last synced: 21 Dec 2024

https://github.com/billy0402/scrapy-tutorial

A learning project from the book 'Scrapy一本就精通'.

course crawler docker mongodb mysql proxy python redis scrapy splash sqlite ubuntu

Last synced: 14 Jan 2025

https://github.com/billy0402/python-application

A learning project from the book 'Python 技術者們'.

course crawler matplotlib opencv pandas python requests selenium sklearn

Last synced: 14 Jan 2025

https://github.com/billy0402/tibame-python-data-analysis

A learning project from TibaMe Python data analysis course.

ai course crawler jupyter-notebook matplotlib pandas python requests

Last synced: 14 Jan 2025

https://github.com/tetreum/xupopter_client

Simple interface to manage Xupopter recipes aswell as it's runners.

crawler scrapper scrapping webscraper

Last synced: 17 Dec 2024

https://github.com/tetreum/xupopter_runner

Executes crawling recipes coming from Xupopter Chrome Extension.

crawler scrapper scrapping webscraper

Last synced: 17 Dec 2024

https://github.com/mirusu400/berryz-dl

Batch download berryz webshare files recursively!

berryz berryz-webshare crawler downloader scraper

Last synced: 26 Dec 2024

https://github.com/mg98/ipfs-replicate

Replicate IPFS' distributed data structure locally, based on network traces.

crawler dag ipfs redisgraph scraper

Last synced: 29 Jan 2025

https://github.com/discountry/crawler-microservice

crawler microservice

crawler

Last synced: 14 Dec 2024

https://github.com/jonasrenault/pubchem-api-crawler

Python client for PubChem's API to crawl compounds and their properties using a molecular formula search query.

chemistry crawler molecular-formula pubchem python

Last synced: 27 Jan 2025

https://github.com/mawkler/go-web-crawler

Toy web server written in Go

crawler go

Last synced: 31 Jan 2025

https://github.com/pvital/cra-cra

Another web crawler

crawler python

Last synced: 23 Jan 2025

https://github.com/not-raspberry/aio_crawler

AIO single website crawler

asyncio crawler python3

Last synced: 29 Jan 2025

https://github.com/danielvigaru/easyreach

crawler for faster amazon reach

amazon crawler python

Last synced: 01 Jan 2025

https://github.com/joyceannie/moviespider

This project is used to crawl movie data from IMDb. Scrapy framework is used to extract relevant information like movie title, datePublished, summary, genres, director etc.

crawler datascience python scrapy spider webscraper

Last synced: 29 Jan 2025

https://github.com/dnknth/robot.py

Simple web spider

crawler curio python

Last synced: 23 Jan 2025

https://github.com/yuchenq/comp90055-project

This is the lastest version of my project belong to Comp90055.

couchdb crawler data-visualization python3 textblob tweepy

Last synced: 19 Jan 2025

https://github.com/fritz-c/itunes-stats

Fetch info on podcasts, etc. from iTunes RSS data

crawler itunes

Last synced: 02 Jan 2025

https://github.com/juangesino/ah-bonus-crawler

React + Express application that crawls Albert Heijn's promotions.

crawler crawling express expressjs headless-chrome nodejs react reactjs

Last synced: 23 Jan 2025

https://github.com/eneax/web-crawler

A web crawler built in Node.js

crawler javascript nodejs web-crawler

Last synced: 22 Dec 2024

https://github.com/jjpaulo2/crawler-financeiro

Módulo em Python que extrai dados públicos de planos de previdência do portal da SUSEP.

crawler docker ocr python selenium tesseract

Last synced: 21 Nov 2024

https://github.com/sajjadanwar0/booking.com-scraping

Scraping booking.com using Selenium and Beautiful Soup

crawler data python scraping selenium

Last synced: 14 Jan 2025

https://github.com/govau/warcraider

Convert WARC files into Avro for big data processing

avro bigquery crawler rust warc

Last synced: 21 Jan 2025

https://github.com/semoal/pythoncrawler

Python crawler with XMLRPC & BeautifulSoap

beautifulsoup crawler python wordpress xmlrpc

Last synced: 15 Dec 2024

https://github.com/waived/pastebin-ripper

Scrape all pastes from pastebin page + sub-pages

crawler mass-downloader pastebin-ripper pastebin-scraper python3 ripper scraper

Last synced: 29 Jan 2025

https://github.com/radityaharya/sitesweeper

Sitesweeper is a python package to help you automate your web scraping process, outputting pages to a file

crawler pdf python website-crawler

Last synced: 01 Feb 2025

https://github.com/edumucelli/rubybikes

A set of Bike Sharing System parsers in Ruby

bike-sharing crawler ruby

Last synced: 24 Dec 2024

https://github.com/bwh1270/allrecipes-scraper

crawler food-computing scraper scraping scrapy

Last synced: 24 Jan 2025

https://github.com/sanskar107/c-subject-predictor

Predicts topic of a code.

crawler nlp rnn

Last synced: 21 Jan 2025

https://github.com/codegram01/go-ai-crawl

Golang Web Crawl with AI

ai chromedp crawler golang ollama

Last synced: 23 Jan 2025

https://github.com/yaoshanliang/linkedinspider

Crawl job information from LinkedIn for data analysis

big-data crawler python social-network-analysis

Last synced: 05 Feb 2025

https://github.com/anshiii/pixder

🤔 A spider for pixiv.net

crawler pixiv spider

Last synced: 23 Jan 2025

https://github.com/robin98sun/structured-web-data-crawler

crawler multi-thread structured-web-data

Last synced: 23 Jan 2025

https://github.com/lig8t555/ecommerce

MERN Stack Ecommerce Store | Running In Production | MVP

baidu-tieba baotu bootstrap crawler douban-music ecommerce-platform fofa mongoose quanjing redux shopping-cart shopping-cart-solution stripe taobao-spider

Last synced: 29 Jan 2025

https://github.com/xoraus/revieworacle

The proposed system assists users in deciding which product to buy. It gathers reviews along with the details from multiple websites, which sell the product. Other than that the system is trained to analyze the polarity of the product.

ai crawler datascience machinelearning scrappy selenium-webdriver

Last synced: 13 Jan 2025

https://github.com/claudio-code/nap-web-crawler

Created It crawler to find broken links in docs of framework and languages

crawler

Last synced: 05 Feb 2025

https://github.com/iyowei/fs-deep-walk

专注于深度扫描指定磁盘位置。

crawler directory file folder folder-tooling fs nodejs recursively-search scan scandir scandir-recursive scanner walker

Last synced: 29 Dec 2024

https://github.com/iamtonmoy0/sitemap-crawler

site map crawler with golang and goquery

crawler

Last synced: 05 Jan 2025

https://github.com/tetreum/price-crawler

Article price crawler

crawler nodejs

Last synced: 17 Dec 2024

https://github.com/roc41d/http-web-crawler

Http web crawler with Nodejs + TDD

crawler http javascript jest jest-test nodejs webcrawler

Last synced: 21 Jan 2025

https://github.com/igor-karpukhin/web-crawler

Web site crawler

crawler go website

Last synced: 03 Feb 2025

https://github.com/apexcaptain/allergy-alert

오늘 날짜를 기준으로 모 대학의 학교 홈페이지에서 제공하는 식당 정보를 Crawling하여 회관별/메뉴 분류 별로 메뉴들과 메뉴 별 알러지 유발 식품에 대한 정보를 알려줍니다.

crawler docker expressjs puppeteer reactjs sqlite typescript

Last synced: 29 Jan 2025

https://github.com/qzcool/uscis-case-status-estimation-system-stat-ez

Estimates time of case results arrival, for applicants who are waiting for their USCIS case results with the receipt numbers at hand.

beautifulsoup crawler immigration web

Last synced: 21 Jan 2025

https://github.com/g-ongenae/morphalou-crawler

A Crawler for CNRTL's Morphologie words

crawler french lexical-databases list-of-words words

Last synced: 15 Oct 2024

https://github.com/eklem/vinmonopolet-crawler

Crawling Vinmonopolet-data and indexing it to a norch search index

crawler dataset javascript norch search-engine

Last synced: 01 Feb 2025

https://github.com/splorg/sage

A scraper to get every quote from a book off of Goodreads.

books crawler datamining goodreads goodreads-data python scraper scrapy webcrawling webscraping

Last synced: 21 Jan 2025

https://github.com/ariefrahmansyah/crawler

Simple website crawler using Go programming language.

crawler go

Last synced: 01 Feb 2025

https://github.com/chenbingwei1201/threads_scraper

A Python package for scraping Threads posts.

chromedriver crawler csv-format pypi pypi-package python python3 scraper scraping-websites

Last synced: 05 Feb 2025

https://github.com/ryu1kn/procedural-page-crawler

Page Crawler. Tell it where to go and what to look for.

crawler npm-package scraper

Last synced: 03 Feb 2025

https://github.com/kyagara/lol-match-crawler

Very simple crawler for League of Legends matches.

crawler golang league-of-legends pgx postgres riot-games sql

Last synced: 29 Jan 2025

https://github.com/vishaalpkumar/skysift

A distributed search engine from scratch

aws crawler css distributed-systems html java search-engine

Last synced: 22 Dec 2024

https://github.com/igapyon/selecrawler

Simple selenium based web crawler

chrome crawler java selenium web

Last synced: 06 Jan 2025

https://github.com/kofj/octopus

Octopus an open source software to collect data from web pages.

crawler

Last synced: 27 Jan 2025

https://github.com/apurvsikka/mediaverse

MediaVerse is a versatile search engine for various media types such as anime, books and drama

anime anime-api anime-api-free api-rest bun crawler extensions extensions-pack free-manga kdrama lightnovel manga manga-api manga-api-free manga-crawler manga-reader movies netflix ts tv

Last synced: 03 Feb 2025

https://github.com/tomfran/crawler

A web crawler written in Rust

bloom-filter crawler rust simhash

Last synced: 06 Jan 2025

https://github.com/intina47/ee_error

implementation of a web crawler using c++

cpp crawler curl gumbo libcurl stanford-nlp web

Last synced: 01 Feb 2025

https://github.com/nick121212/crawler.v5

crawler nodejs

Last synced: 27 Jan 2025

https://github.com/joaooliveirapro/trawlergo

Basic HTTP Crawler in Golang

crawler go golang http

Last synced: 13 Jan 2025

https://github.com/agucova/needs-seeding

🌱 A script that downloads a list of .torrent files from a website, checks their health and lists the ones that need more seeding.

crawler sci-hub torrents

Last synced: 09 Jan 2025

https://github.com/phatpham9/scraper.fun

Building, using & sharing HTML scraper are way funnier!

crawler html-scraper scraper

Last synced: 30 Jan 2025

https://github.com/seanghay/wpget

⚡️wpget - A tool for downloading all posts from a WordPress website via public JSON API

crawler wordpress wp-json

Last synced: 22 Nov 2024

https://github.com/ryoii/hook

A declarative Java crawler framework

crawler declarative java java-crawler-framework jdk11

Last synced: 24 Jan 2025

https://github.com/viper373/xovideos

一个为用户打造的个性化视频下载工具

crawler downloader githubactions m3u8 mongodb mp4 pornhub python

Last synced: 23 Jan 2025

https://github.com/rutopio/crawler-2020-taiwanese-election-results

2020 台灣選舉結果爬蟲：以不分區政黨票為例

crawler python

Last synced: 31 Jan 2025

https://github.com/allancapistrano/anime-sheets

Crawler que pega as informações dos animes e salva numa planilha.

anime crawler google-sheets google-sheets-api

Last synced: 23 Jan 2025

https://github.com/allancapistrano/steam.py

An API wrapper for Steam written in Python.

crawler python steam

Last synced: 23 Jan 2025

https://github.com/bradsec/gomine

A Go CLI tool to quickly crawl and mine (download) specific file types from websites.

cli crawler golang terminal-based

Last synced: 22 Dec 2024

https://github.com/pyohei/rirakkuma-crawller

Crawler for my hobby.🐻

crawler python rirakkuma

Last synced: 29 Dec 2024

Crawler Awesome Lists

awesome-crawler 101 awesome-python-primer 68 awesome-fingerprinting 48 awesome-digital-preservation 45 awesome-web-scraping 115

Crawler Categories

Core Libraries 61 2.6 机器学习 50 Research 31 Python 18 Replay tools 18 1.1 语言基础 16 Libraries & Projects 13 Fingerprinting Evasion 13 Sites 12 2.4 Web 前端 10 2.1 爬虫基础 9 3\. 数据库 8 2.5 数据分析 7 Web archiving 7 Java 7 4\. 异步IO 6 Other digital objects 6 2.2 Flask 框架 4 2.3 Django 框架 4 Standards and specifications 4