Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

GitHub: https://github.com/topics/crawler
Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
Last updated: 2025-01-16 00:06:25 UTC
JSON Representation

https://github.com/raspi/scrapy-corsair

Web crawler for Corsair (corsair.com)

crawler hardware memory scrapy spider

Last synced: 08 Jan 2025

https://github.com/jarircse16/bot_detection_firewall

Detects and Blocks generic crawlers from your website.

bot crawler php

Last synced: 30 Dec 2024

https://github.com/raspi/scrapy-amigaremix

amiga crawler music python scrapy spider

Last synced: 08 Jan 2025

https://github.com/dominikrys/web-scraper

🎬 IMDB Web Scraper in Go

crawler go mongodb

Last synced: 10 Jan 2025

https://github.com/raspi/scrapy-crucial

Web crawler for Crucial (crucial.com)

crawler hardware memory scrapy spider

Last synced: 08 Jan 2025

https://github.com/viko16/hatcher

🐣[WIP] Provides APIs by simple configuration.

api api-server cli crawler koa-middleware nodejs spider

Last synced: 01 Oct 2024

https://github.com/jakubboucek/blog.cz-backup-robot

crawler

Last synced: 08 Jan 2025

https://github.com/coding-dream/aspider

A spider run on Android Platform

crawler jsoup spider

Last synced: 11 Jan 2025

https://github.com/zhanziyuan/webdownloader

Download elements from the specified website.

crawler downloader image image-downloader python python-crawler web

Last synced: 08 Jan 2025

https://github.com/kevincolemaninc/mm-crawler

Scrapes meetme user profiles

crawler docker fake-data meetme ruby scraper sidekiq

Last synced: 01 Jan 2025

https://github.com/raspi/scrapy-kuntavaalit2021-almamedia

Fetch Almamedia kuntavaalit 2021 data

crawler mirror python scrapy spider webcrawler

Last synced: 10 Nov 2024

https://github.com/estavadormir/scrappist

A web scrapper that takes an URL/URLs and converts into a PDF.

bun cli crawler pdf-generation

Last synced: 11 Jan 2025

https://github.com/smikodanic/dex8-sdk

DEX8 SDK is software development kit for DEX8.com platform.

crawler crawler-engine data-extraction dex8 scraper scraping-websites spider

Last synced: 26 Dec 2024

https://github.com/raspi/scrapy-kuntavaalit2021-sanoma

Fetch Sanoma kuntavaalit 2021 data

crawler mirror python scrapy spider webcrawler

Last synced: 10 Nov 2024

https://github.com/raspi/scrapy-transcend

Crawler for transcend (us.transcend-info.com)

crawler hardware memory scrapy spider

Last synced: 08 Jan 2025

https://github.com/raspi/scrapy-kuntavaalit2021-keskisuomalainen

Fetch Keskisuomalainen kuntavaalit 2021 data

crawler mirror python scrapy spider webcrawler

Last synced: 10 Nov 2024

https://github.com/wafflecomposite/yggdrasil-crawler-python

Small Yggdrasil network crawler with CLI, written in Python3

crawler mesh-networks no-dependencies python python3 yggdrasil yggdrasil-api yggdrasil-network

Last synced: 23 Nov 2024

https://github.com/semoal/pythoncrawler

Python crawler with XMLRPC & BeautifulSoap

beautifulsoup crawler python wordpress xmlrpc

Last synced: 15 Dec 2024

https://github.com/tatamiya/gas-new-books-crawler

Crawling new book information from 版元ドットコム(https://www.hanmoto.com/)

crawler gas

Last synced: 20 Nov 2024

https://github.com/govau/warcraider

Convert WARC files into Avro for big data processing

avro bigquery crawler rust warc

Last synced: 20 Nov 2024

https://github.com/mindfiredigital/deepscanbot

It allows you to crawl websites with various configurations, including crawl depth, timeout settings, proxy support, and output options.

bot crawl crawler go golang google webcrawler

Last synced: 28 Dec 2024

https://github.com/jyasskin/pbot-crawler

Crawler for PBOT's website to show what has changed.

crawler

Last synced: 30 Nov 2024

https://github.com/terminaldweller/crawley

A creepy crawler that runs as a sleepy daemon.

crawler daemon python3

Last synced: 26 Dec 2024

https://github.com/nblthree/python-url-crawler

Simple web crawler

crawler python3

Last synced: 03 Dec 2024

https://github.com/pmuens/crawler

Multi-threaded Web crawler with support for custom fetching and persisting logic

crawler crawler-engine rust rust-lang web-crawler web-crawling

Last synced: 26 Dec 2024

https://github.com/bujosa/aldebaran

Example use APP ENGINE with Python3, ThreadPool and webScraping

appengine crawler flask gcp python3 thread-pool

Last synced: 20 Nov 2024

https://github.com/bwh1270/allrecipes-scraper

crawler food-computing scraper scraping scrapy

Last synced: 24 Nov 2024

https://github.com/sanskar107/c-subject-predictor

Predicts topic of a code.

crawler nlp rnn

Last synced: 20 Nov 2024

https://github.com/zigai/crawlwright

Web crawling framework powered by Playwright

crawler crawling playwright python scraping wrighter

Last synced: 07 Dec 2024

https://github.com/mg98/ipfs-replicate

Replicate IPFS' distributed data structure locally, based on network traces.

crawler dag ipfs redisgraph scraper

Last synced: 30 Nov 2024

https://github.com/anshiii/pixder

🤔 A spider for pixiv.net

crawler pixiv spider

Last synced: 22 Nov 2024

https://github.com/amirsorouri00/crawler

Page-Rank Public python2 projects whice have been turned into python3.

crawler page-rank python

Last synced: 18 Nov 2024

https://github.com/jjpaulo2/crawler-financeiro

Módulo em Python que extrai dados públicos de planos de previdência do portal da SUSEP.

crawler docker ocr python selenium tesseract

Last synced: 21 Nov 2024

https://github.com/ark930/douban-movie-crawler

豆瓣影评爬虫

crawler douban movie python

Last synced: 24 Nov 2024

https://github.com/gabrielolobo/crawley

This project is designed to run crawlers and process the results based on the specified output format. It takes command-line arguments to select the crawler and output format.

crawler poetry python scrapping

Last synced: 11 Jan 2025

https://github.com/liebki/githubnet

This library allows you to retrieve several things from GitHub, things like trending repositories, profiles of users, the repositories of users and related information.

crawler crawling github github-trending htmlagilitypack microsoft

Last synced: 24 Nov 2024

https://github.com/eneax/web-crawler

A web crawler built in Node.js

crawler javascript nodejs web-crawler

Last synced: 22 Dec 2024

https://github.com/okwilkins/web-crawler

This program will crawl through entire domains, exporting every link it can find into a txt file.

crawler crawling files html htmlparser python python3 reader scraper threading threads web writer

Last synced: 20 Nov 2024

https://github.com/gnehs/twse-financial-ratios-crawler

透過指定的股票代號清單從公開資訊觀測站自動抓取財務比率資訊，並自動計算平均

crawler nodejs

Last synced: 26 Dec 2024

https://github.com/yuchenq/comp90055-project

This is the lastest version of my project belong to Comp90055.

couchdb crawler data-visualization python3 textblob tweepy

Last synced: 18 Nov 2024

https://github.com/dpbm/opendatasus-crawler

A simple crawler using puppeteer

brazil chrome crawler csv datasus nodejs opendatasus pdf puppeteer screenshot sus

Last synced: 18 Nov 2024

https://github.com/lilchen96/pokemon-crawler

Crawl JSON-formatted data for Pokémon, based on the PokeAPI.

crawler pokemon

Last synced: 18 Nov 2024

https://github.com/ggteixeira/corpus-cleaner

Linguistic tool (made by a linguist, for linguists) that scraps corpora, automatically cleans it up, and generates n-grams.

beautifulsoup4 bs4 corpora corpus corpus-linguistics crawler linguistics nlp python scraper web-scraping

Last synced: 11 Jan 2025

https://github.com/not-raspberry/aio_crawler

AIO single website crawler

asyncio crawler python3

Last synced: 01 Dec 2024

https://github.com/grayhat12/grawler

A web based Crawler that takes two inputs(search item, number of sites to search)and curently displays Readable Content in Text Format but the Code can be modified to display the HTML code.

crawler scraping scraping-websites scrapper scrapy-crawler

Last synced: 06 Dec 2024

https://github.com/zaneh/ocw-crawler

Crawl MIT OpenCourseWare courses with Kimurai. Not affiliated.

crawler kimurai mit ocw opencourseware spider

Last synced: 15 Jan 2025

https://github.com/beckkramer/puppeteer-traverse

Puppeteer utility to easily run a function you define per route on a set of routes.

crawler crawling nodejs puppeteer

Last synced: 18 Nov 2024

https://github.com/leonardopinho/instagramfeed

Image list based on a tag for the Instagram feed.

crawler instagram python

Last synced: 07 Dec 2024

https://github.com/jlenon7/sef_automation

📑 Crawler that automatically enrol in open vacancies in SEF website.

athenna crawler esm nodejs playwright portugal residence sef typescript

Last synced: 13 Dec 2024

https://github.com/appliedsoul/headless-screenshot

High-level library for taking screenshot of websites based on headless chrome (puppeteer)

crawler headless-chromium javascript nodejs scrapper screenshot testing

Last synced: 18 Nov 2024

https://github.com/arman-aminian/divar-text-exploring

The first practice of Dr. Asgari's NLP lesson - Data Exploration

crawler natural-language-processing nlp preprocessing scrapy

Last synced: 08 Jan 2025

https://github.com/fscotto/noahcrawler

A simple web crawler written in Java to support a database of Italian regions.

crawler java jsoup-library

Last synced: 21 Nov 2024

https://github.com/nick121212/crawler.v5

crawler nodejs

Last synced: 28 Nov 2024

https://github.com/josepedrodias/naivebot

attempt to mimic googlebot behaviour in nodejs with nightmarejs

crawler googlebot nightmarejs nodejs robots

Last synced: 20 Nov 2024

https://github.com/hyancat/netease-music-api

api crawler music netease

Last synced: 06 Jan 2025

https://github.com/curegit/nominium

個人間取引サイトの新着商品をメールなどで通知するクローラーシステム

c2c chromium crawler ecommerce firefox selenium shopping webdriver

Last synced: 17 Nov 2024

https://github.com/qqxs/usda_pomological_watercolors

爬取美国农业部果树水彩的数据

crawler koa2 nodejs watercolors

Last synced: 17 Nov 2024

https://github.com/ndoolan360/go-crawler

A simple web crawling program written in Go in an afternoon. 🕷️🕸️

afternoon-project crawler scraper

Last synced: 17 Nov 2024

https://github.com/der3318/daily-pixiv

Integrated Flow - Line Notification of Top Ranked Pixiv Illustrations

crawler line-notify pixiv workflow

Last synced: 13 Jan 2025

https://github.com/roc41d/http-web-crawler

Http web crawler with Nodejs + TDD

crawler http javascript jest jest-test nodejs webcrawler

Last synced: 21 Nov 2024

https://github.com/bockstaller/europarl-crawler

Crawler for the documents published by the European Parliament

crawler datamining elasticsearch europarl-crawler european european-parliament opendata parliament union

Last synced: 06 Jan 2025

https://github.com/jackfsuia/chats-crawler

Discourse chat data crawling and on-the-way parsing straight for LLM instruction finetuning. 论坛数据爬取和解析，直接用于对话微调。

crawler fine-tuning finetune-llm gpt html-css-javascript instruction-tuning llm llm-training llms nlp nlp-parsing parser

Last synced: 13 Jan 2025

https://github.com/jamesponddotco/wikiextract

[READ-ONLY] A word extractor for Wikipedia articles.

crawler crawling diceware go wikipedia wikipedia-crawler word-extraction

Last synced: 21 Nov 2024

https://github.com/pyohei/rirakkuma-crawller

Crawler for my hobby.🐻

crawler python rirakkuma

Last synced: 29 Dec 2024

https://github.com/qzcool/uscis-case-status-estimation-system-stat-ez

Estimates time of case results arrival, for applicants who are waiting for their USCIS case results with the receipt numbers at hand.

beautifulsoup crawler immigration web

Last synced: 21 Nov 2024

https://github.com/jonasrenault/pubchem-api-crawler

Python client for PubChem's API to crawl compounds and their properties using a molecular formula search query.

chemistry crawler molecular-formula pubchem python

Last synced: 28 Nov 2024

https://github.com/matheusfaustino/jazzmaster_crawler

It is a crawling for getting the audio programs from a specific radio program called Jazzmaster

crawler python scrapy

Last synced: 28 Dec 2024

https://github.com/pjt3591oo/spider-base_crawler

scrapy 기반 크롤러 만들기

crawler python scrapy spider

Last synced: 26 Dec 2024

https://github.com/homuchen/instagram-crawler

Instagram crawler

crawler instagram nodejs-crawler

Last synced: 01 Dec 2024

https://github.com/joyceannie/moviespider

This project is used to crawl movie data from IMDb. Scrapy framework is used to extract relevant information like movie title, datePublished, summary, genres, director etc.

crawler datascience python scrapy spider webscraper

Last synced: 01 Dec 2024

https://github.com/ryoii/hook

A declarative Java crawler framework

crawler declarative java java-crawler-framework jdk11

Last synced: 24 Nov 2024

https://github.com/949886/pixiv-crawler

Pixiv illustration info crawler to local MySQL database.

crawler mysql pixiv

Last synced: 28 Dec 2024

https://github.com/tssujt/async-crawler-sample

A simple crawler sample based on asyncio~

aiohttp asyncio crawler

Last synced: 21 Nov 2024

https://github.com/matheusfaustino/phrawl

Phrawl: A web crawling framework in PHP (or it seems so)

crawler crawling crawling-framework php scraper wip

Last synced: 28 Dec 2024

https://github.com/jjeffcaii/ok-spider

a simple web crawler like scrapy

crawler nodejs scrapy spider

Last synced: 25 Dec 2024

https://github.com/humbertodias/go-nie-crawler

Simple crawler that extract some useful informations from sede.administracionespublicas.gob.es.

crawler golang

Last synced: 13 Jan 2025

https://github.com/splorg/sage

A scraper to get every quote from a book off of Goodreads.

books crawler datamining goodreads goodreads-data python scraper scrapy webcrawling webscraping

Last synced: 20 Nov 2024

https://github.com/zawlinnnaing/my-wiki-crawler

A simple program for crawling Burmese wikipedia using Media wiki API.

crawler myanmar-tools python wikipedia-api

Last synced: 25 Dec 2024

https://github.com/nowshad-sust/corona

A simple data endpoint for coronavirus updates

api corona coronavirus-updates crawler dcoker-compose excel nodejs

Last synced: 23 Nov 2024

https://github.com/sahaavi/web-scraping

Learn Web-Scraping using BeautifulSoup, Selenium and Scrapy with hands on projects!

beautifulsoup4 crawler headless-mode pagination scrapy selenium spider splash web-scraper web-scraping

Last synced: 26 Dec 2024

https://github.com/bingxyz/btcethcrawler

telegram 比特幣、乙太幣廣播頻道

bash bash-script crawler telegram-bot

Last synced: 21 Nov 2024

https://github.com/tormol/zenphoto-dl

A script for recursively downloading all pictures from zenphoto-based photo albums.

crawler python-script

Last synced: 03 Dec 2024

https://github.com/miiraak/scrapc

C# WinForms - Crawler & Scraper Web content

crawler csharp html scraper url web windows-forms

Last synced: 13 Oct 2024

https://github.com/daviddavo/blogspot-crawler

Crawler for blogspot and blogger with beautifulsoup

crawler hacktoberfest python

Last synced: 23 Nov 2024

https://github.com/naem1023/comic-crawler

Comic crawler.

beautifulsoup crawler python3

Last synced: 27 Nov 2024

https://github.com/kyagara/lol-match-crawler

Very simple crawler for League of Legends matches.

crawler league-of-legends pgx postgres riot-games sql

Last synced: 01 Dec 2024

https://github.com/vishaalpkumar/skysift

A distributed search engine from scratch

aws crawler css distributed-systems html java search-engine

Last synced: 22 Dec 2024

https://github.com/joaooliveirapro/trawlergo

Basic HTTP Crawler in Golang

crawler go golang http

Last synced: 13 Jan 2025

https://github.com/intina47/ee_error

implementation of a web crawler using c++

cpp crawler curl gumbo libcurl stanford-nlp web

Last synced: 06 Dec 2024

https://github.com/manikantasanjay/stackoverflow_tag_generator_webcrawler

StackOverFlow Tag Generator Using a WebCrawler.

crawler python

Last synced: 22 Dec 2024

https://github.com/tomfran/crawler

A web crawler written in Rust

bloom-filter crawler rust simhash

Last synced: 06 Jan 2025

https://github.com/alphabs/navercafeclient

네이버 카페 글 목록 크롤링을 위한 닷넷 라이브러리

crawler crawling dotnet naver naver-api naver-cafe web-scraper web-scraping

Last synced: 29 Nov 2024

https://github.com/lsongdev/node-crawler

simple crawler

crawler node-crawler

Last synced: 02 Jan 2025

https://github.com/licoy/win4000-images-crawler

基于scrapy爬取&下载win4000.com的图片壁纸

crawler python scraper

Last synced: 08 Dec 2024

https://github.com/bennettdams/vace-it-crawler

Python (Scrapy) crawler to access data of FACEIT.com

crawler python scrapy

Last synced: 13 Jan 2025

https://github.com/sxoxgxi/webcrawler

A multi threaded web crawler

crawler python webcrawling

Last synced: 25 Nov 2024

https://github.com/krishpranav/gozap

⚡️ Multiple target ZAP Scanning made in go

cli crawler go go-crawler golang zap

Last synced: 06 Dec 2024

https://github.com/pinpox/go-random-downloader

Download Html using "Random Page"

crawler golang

Last synced: 29 Nov 2024

https://github.com/guanbinrui/img-crawler

A image crawler.

crawler

Last synced: 26 Dec 2024

https://github.com/istador/mediaindexer

Software for a cronjob to crawl the ViMP media center and generate an index for it as a static website.

crawler website

Last synced: 22 Nov 2024

Crawler Awesome Lists

awesome-crawler 101 awesome-python-primer 68 awesome-digital-preservation 45 awesome-fingerprinting 48

Crawler Categories

2.6 机器学习 50 Research 31 Python 18 Replay tools 18 1.1 语言基础 14 Libraries & Projects 13 Fingerprinting Evasion 13 Sites 12 2.4 Web 前端 10 2.1 爬虫基础 9 3\. 数据库 8 Web archiving 7 Java 7 2.5 数据分析 7 Other digital objects 6 4\. 异步IO 6 Standards and specifications 4 Social Networks 4 2.3 Django 框架 4 2.2 Flask 框架 4