Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

GitHub: https://github.com/topics/crawler
Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
Last updated: 2024-11-18 00:06:04 UTC
JSON Representation

https://github.com/jean-baptiste-camps/iiif-crawler

Interrogate IIIF servers and get images of manuscripts

crawler iiif iiif-image manuscripts

Last synced: 11 Oct 2024

https://github.com/feedeo/youtube-channel-crawler

YouTube Channel :tv: Crawler

crawler youtube youtube-channel

Last synced: 11 Oct 2024

https://github.com/tikazyq/github-crawler

Github repositories crawler

crawler scrapy

Last synced: 11 Oct 2024

https://github.com/giscafer/ziroom-crawler

自如友家租房，房源爬虫，房源状态监听，目的是抢房

crawler nodejs

Last synced: 17 Nov 2024

https://github.com/aprilnea/xjtlu

This is how to get all the network resources of XJTLU.

crawler gateway http-auth python spider web-crawler xjtlu

Last synced: 15 Nov 2024

https://github.com/marzzzello/appstore_crawler

(mirror) download the IDs and metadata of all apps in the apple appstore

apple appstore crawler metadata scrapy

Last synced: 05 Nov 2024

https://github.com/spencerlepine/readme-crawler

A Node.js web crawler to download README files and follow contained links. Fetch repositories from a valid GitHub URL

crawler javascript node nodejs readme scraper web-crawler webcrawer

Last synced: 13 Nov 2024

https://github.com/mcstreetguy/crawler

An advanced web-crawler written in PHP.

composer composer-library crawler crawler-engine guzzle http-requests php php-7 php-library web-crawler webcrawler

Last synced: 12 Oct 2024

https://github.com/moehmeni/ezweb

Easy to use web page analyzer

analyzer crawler scraper text-analysis text-classification text-mining webcrawler webcrawling webpage webscraper webscraping www

Last synced: 05 Nov 2024

https://github.com/manuel-lang/autonomous-semantic-search-engine

Submission for HackDataKIBots 2018 - Web crawler combined with document analysis

crawler hackathon machine-learning mannheim microsoft natural-language-processing natural-language-understanding nextiteration rnv semantic-search textract

Last synced: 13 Nov 2024

https://github.com/iml1111/toonkor_collector

툰코 만화 수집기

crawler python

Last synced: 21 Oct 2024

https://github.com/wenyalintw/job-scraper-bot

幫朋友做好玩的Telegram機器人，已部署到Heroku

amazon-web-services aws-s3 boto3 crawler google-drive google-drive-api heroku heroku-deployment python-telegram-bot scraper scraping scrapy telegram telegram-bot telegram-bot-api web-scraping

Last synced: 11 Nov 2024

https://github.com/cuerz/douban-top

Golang爬虫爬取豆瓣榜单

crawler douban golang goquery

Last synced: 08 Nov 2024

https://github.com/robmch/mindfactory_crawling

A Python 3 Crawler for Mindfactory.de

crawler crawling data webcrawler webcrawling

Last synced: 17 Nov 2024

https://github.com/giscafer/airlevel-crawler

a demo of crawler for air-level.com

crawler java nodejs

Last synced: 17 Nov 2024

https://github.com/stopka/fedicrawl

Collect feeds to follow on Fediverse nodes.

crawler docker fediverse nodejs prisma typescript

Last synced: 05 Nov 2024

https://github.com/pjt3591oo/golang-crawler

golang으로 크롤러 만들기

crawler golang

Last synced: 06 Nov 2024

https://github.com/pjt3591oo/news-crawler

crawler data python

Last synced: 06 Nov 2024

https://github.com/vinouno/BilibiliDanmuCrawler

一个从 bilibili.com 爬取弹幕并生成词云的 Python 项目

crawler python

Last synced: 27 Oct 2024

https://github.com/itszeeshan/crawlinit

A web crawler written in python3

appsec bugbounty bugbounty-tool bugbountytips crawler crawler-python enumeration infosec python recon reconnaissance scanner url web

Last synced: 12 Oct 2024

https://github.com/mirocow/yii2-crawler

Http concurrent crawler for Yii2

concurrency crawler guzzle yii2-extension

Last synced: 16 Nov 2024

https://github.com/foolin/scrago

An simpe, fast, extensible crawl page framework for golang

crawler go scrago scrapy

Last synced: 09 Nov 2024

https://github.com/roccomuso/is-bing

Verify that a request is from Bing crawlers using Bing's DNS verification steps

bing bot check crawler dns ip js nodejs verify

Last synced: 17 Oct 2024

https://github.com/leomaurodesenv/smm-course-search

A package to searching courses - Super Mario Maker

bookmark-site crawler javascript json mario-game mario-maker nodejs

Last synced: 02 Nov 2024

https://github.com/leelow/nightmare-screenshot-selector

👻 📷 A Nightmare plugin to easily take screenshots.

crawler headless-browsers javascript js nightmare nightmarejs nodejs plugin webcrawler

Last synced: 15 Nov 2024

https://github.com/danielmorell/se_bot_checker

Validate search engine user agents and IP addresses.

crawler googlebot python search-engine spider

Last synced: 15 Oct 2024

https://github.com/cr0hn/feed-to-exporter

Get RSS Feed and export as Wordpress Post

crawler feed rss wordpress

Last synced: 07 Nov 2024

https://github.com/floscha/genius-lyrics-crawler

A concurrent crawler to retrieve song lyrics from Genius

celery crawler fluentd genius lyrics mongodb python

Last synced: 09 Nov 2024

https://github.com/holmofy/spring-spider

Spring Spider App Utility Library.

crawler java spider spring spring-spider

Last synced: 27 Oct 2024

https://github.com/coghost/iparse

To extract HTML/json content identified by CSS selectors(with bs4) with yaml config support

crawler parser parser-library python xkcd yaml

Last synced: 09 Nov 2024

https://github.com/licoy/java-crawler

通过java使用jsoup爬虫框架爬取数据

crawler java jsoup

Last synced: 19 Oct 2024

https://github.com/sayakie/pixiv-crawler

Crawls images from Pixiv 🚀

crawler nodejs pixiv typescript

Last synced: 28 Oct 2024

https://github.com/mrrfv/webarchive

Crawls websites and saves found URLs to a file.

archive archiveteam archiving crawler crawling ia internet-archive scraper web-archiving web-scraping

Last synced: 27 Oct 2024

https://github.com/ivan-alone/instastories-saver-cpp

Program to saving Instagram Stories - Rewritten to C++

api backup crawler grambler gramblr insta instagram instagram-stories instastories-saver instastory stories

Last synced: 31 Oct 2024

https://github.com/kernelerr/pixivsync

Pixiv图片下载及同步工具

crawler pixiv pixiv-crawler python

Last synced: 12 Oct 2024

https://github.com/code-inside/sloader

Worker that loads and retrieves data from "slow" endpoints.

crawler drop json yml

Last synced: 16 Nov 2024

https://github.com/dist1ll/hltv-rust

A client to fetch and parse data from HLTV.org

api crawler hltv parser rust

Last synced: 14 Oct 2024

https://github.com/liyifeng1994/go-crawler

基于golang的分布式爬虫项目

crawler elastic elasticsearch golang

Last synced: 12 Nov 2024

https://github.com/birkhofflee/blizzard_forum.js

An unofficial Node.js API for Blizzard Forums. (works in 2019)

api crawler web

Last synced: 18 Nov 2024

https://github.com/haxzie-xx/crode.js-node-web-crawler

Node.js Crawler built for open FTP sites for movie link collection.

crawler nodejs

Last synced: 01 Nov 2024

https://github.com/vinitkumar/pycrawler

Crawler in Python 3.7, 3.8. 3.9. Pypy3

crawler python python35 python36 utils

Last synced: 28 Oct 2024

https://github.com/frectonz/rampilo

A telegram crawler

crawler rust telegram telegram-crawler

Last synced: 14 Nov 2024

https://github.com/ernesto-jimenez/crawler

Easily crawl websites in Go.

crawler golang

Last synced: 13 Oct 2024

https://github.com/surelle-ha/dogma

Dogma is a CLI tool that enables interaction with the GitHub API for the purpose of searching .env files with specified keywords. You can configure a GitHub token and use the crawler to search for keys in .env files across public repositories.

cli crawler github nodejs

Last synced: 10 Nov 2024

https://github.com/vmdang/historycrawler

The OOP project collects historical data in Vietnam and displays

crawler gson java javafx jsoup

Last synced: 11 Oct 2024

https://github.com/zurdi15/nbz

Bot to automate internet browsing

automation bot browser-automation browsermob-proxy crawler selenium testing web

Last synced: 15 Oct 2024

https://github.com/juliandavidmr/raptor

Lightweight tool for scanning web sites, works as spider. Once executed, starts scanning pages looking for websites to visit, with automatic indexing.

crawler kotlin mysql spider

Last synced: 09 Nov 2024

https://github.com/leo9960/bilibili_live_danmu_crawler

b站直播的弹幕抓取

bilibili crawler danmu live

Last synced: 10 Nov 2024

https://github.com/alishahbazi81/jobcrawler

Job crawler robot which finds jobs on job board platforms like LinkedIn, Glassdoor, and indeed based on their post time and send them to a telegram channel

asp-net-core crawler jobs jobsearch telegram telegram-bot

Last synced: 11 Nov 2024

https://github.com/vivekg13186/easy_web_crawler

Web crawler around puppeteer to crawler ajax/java script enabled pages.

crawler spider web

Last synced: 28 Oct 2024

https://github.com/ozansz/github-crawler

A basic utility for crawling users and e-mails of users

crawler github python python3

Last synced: 16 Oct 2024

https://github.com/yakuza8/coronavirus-timeseries-predictor

Timeseries analyzer for coronavirus with recurrent neural network

asyncio beautifulsoup4 corona coronavirus coronavirus-analysis coronavirus-crawler coronavirus-dataset covid covid-19 covid19-data crawler python-3-6 python3 python36 rnn web-scrapper

Last synced: 12 Oct 2024

https://github.com/dylanhogg/legaldata

Provides access to Australian legal data

crawler data law lawtech legal legaltech

Last synced: 27 Oct 2024

https://github.com/roccomuso/is-baidu

Verify that a request is from Baidu crawlers using DNS verification

baidu crawler dns ip js nodejs verification

Last synced: 17 Oct 2024

https://github.com/roccomuso/is-duckduck

Verify that a request is from DuckDuckBot, the Web crawler for DuckDuckGo

crawler duckduck duckduckbot duckduckgo ip js nodejs verify web

Last synced: 17 Oct 2024

https://github.com/arshadkazmi42/github-scanner-local

Locally scan all the repositories of a github organization

bounty bug bug-bounty crawler github local no-api scanner

Last synced: 28 Oct 2024

https://github.com/arshadkazmi42/scraplink

Scraplink library, for scraping links and images url from a webpage

crawler mongdb nodejs scraplink url web

Last synced: 28 Oct 2024

https://github.com/dnlzrgz/winzig

A tiny search engine for personal use.

async cli crawler feeds lofi python python3 rss-feed rss-reader sqlalchemy sqlite sqlite3

Last synced: 05 Nov 2024

https://github.com/chenmozhijin/mediawikiextractor

一个用于从 MediaWiki 网站中提取数据并保存为json的 Python 脚本。|A Python script for extracting data from a MediaWiki website and saving it as json.

crawler crawler-python crawling extractor json mediawiki python regex web-crawler

Last synced: 09 Oct 2024

https://github.com/xdk78/grabbi

grabbi a simple web scraper/crawler

crawler html scraper web-scraper

Last synced: 23 Oct 2024

https://github.com/testica/a3hrgo-sdk

a3HRgo sdk to automatize your reports

a3hrgo crawler javascript puppeteer

Last synced: 10 Oct 2024

https://github.com/thaddeusjiang/campcat

キャンプ場予約情報監視 Bot

bot crawler telegram

Last synced: 25 Oct 2024

https://github.com/ruedigervoigt/salted

Smart, Asynchronous Link Tester with Database backend: works with HTML, Markdown and TeX files

asyncio crawler html-files hyperlinks latex linkchecker markdown pandoc python

Last synced: 11 Oct 2024

https://github.com/jmkim/stock-crawler

Universal Stock Crawler

crawler stock stock-market yahoo-finance

Last synced: 13 Oct 2024

https://github.com/tokenmill/crawling-framework-example

Demonstration on how to use the Crawling Framework to setup a simple science news crawler and store results in ElasticSearch. Use this configuration to set up your own crawler.

crawler crawling-framework elasticsearch storm-crawler

Last synced: 10 Nov 2024

https://github.com/hrvadl/goweekly

Application for querying top articles from https://golangweekly.com/, translating them to Ukrainian and sending to the telegram channel

article chatgpt crawler go golang openai-api telegram telegram-bot

Last synced: 13 Oct 2024

https://github.com/erikjiang/book_crawler

:lizard: book_crawler

crawler douban golang

Last synced: 14 Oct 2024

https://github.com/gatenlp/wpextract

Create datasets from WordPress sites for research or archiving

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Last synced: 13 Nov 2024

https://github.com/ayusharma/rss-parser

A simple crawler in ReactJS

crawler reactjs rss-parser

Last synced: 13 Oct 2024

https://github.com/hangyan/generate-cs-word-dict

Generate a word dict for CS from stackoverflow/github tags

crawler dict github python word

Last synced: 15 Oct 2024

https://github.com/waynechang65/baha-crawler

baha-crawler is a web crawler module designed to scarp data from Bahamut Forum.

bahamut crawler javascript nodejs scraper spider webcrawler

Last synced: 19 Oct 2024

https://github.com/glutexo/onigumo

Parallel web scraping framework

crawler

Last synced: 26 Oct 2024

https://github.com/thiiagoms/dict-crawler

Simple crawler on UOL dictionary

beautifulsoup4 crawler dic python pythonic

Last synced: 15 Nov 2024

https://github.com/leo9960/waimai_crawler

抓取外卖平台商户信息

crawler

Last synced: 10 Nov 2024

https://github.com/agmmnn/nis-scraper

Scrapy script to scrape nisanyansozluk.com

cli crawler python scraper

Last synced: 04 Nov 2024

https://github.com/elliotxx/readnewspaper

自动获取电子版报纸，方便每天阅读

crawler lxml newspaper pypdf2 python requests

Last synced: 06 Nov 2024

https://github.com/hctilg/pinterest-crawler

Downloads all images suitable for search

crawler pinterest

Last synced: 07 Nov 2024

https://github.com/bitebait/curry

🍛 Curry é um WebCrawler escrito em Golang com finalidade de verificar o valor do câmbio de Dólar para Real (USDxBRL) em algumas lojas no Paraguay.

api brasil crawler currency-exchange-rates go golang paraguay webcrawler

Last synced: 14 Nov 2024

https://github.com/sauerbraten/chef

Cube 2: Sauerbraten spy bot: collects IP-name combinations from extinfo and provides a web interface to search them.

crawler extinfo go sauerbraten spy stalker

Last synced: 14 Nov 2024

https://github.com/rodyherrera/cdrake-se

✨ Search through the internet for free and unlimited without APIs involved. Find videos, images, sites, books, among more resources using the different engines provided by the library such as Bing, Google Yahoo, Wikipedia, Youtube... Browse safely and privately with the CodexDrake Search Engine =).

bing crawler engine google images javascript metasearch metasearch-engine news nodejs privacy search-engine searx videos webscraping websearch websearchengine whoogle wikipedia youtube

Last synced: 06 Nov 2024

https://github.com/rimiti/ping-urls

🏓 Ping URLs by batch.

cache crawler ping prerender prerendering seo

Last synced: 07 Nov 2024

https://github.com/oldkingcone/pbandj

PasteBin Crawler, crawls the url https://pastebin.com/archive

crawler headless headless-chrome python python-crawler selenium-python selenium-webdriver

Last synced: 16 Nov 2024

https://github.com/vaibhavpandeyvpz/cbse-scraper

This script scrapes information about schools affiliated with CBSE for a given state.

cbse crawler data schools scraper

Last synced: 09 Nov 2024

https://github.com/sieep-coding/web-crawler

A simple web crawler implemented in Go.

crawler go golang web-crawler

Last synced: 08 Nov 2024

https://github.com/pyaesoneaungrgn/2d-crawler

2D crawler for set.or.th

2d 2d-crawler crawler myanmar php

Last synced: 09 Nov 2024

https://github.com/maicss/1024img

1024 image nodejs crawler

1024 crawler nodejs

Last synced: 08 Nov 2024

https://github.com/a-x-/scian

Simple cian stat

cian crawler static-site

Last synced: 12 Nov 2024

https://github.com/restuwahyu13/node-scraper-content

example node scraper all content programming using puppeteer

crawler nodejs puppeter scrapper

Last synced: 09 Nov 2024

https://github.com/capturr/price-extract

Performant way to extract price amount and metadatas (currency, decimal & thousands separator) from any string.

amount crawler crawling currencies currency extract extractor javascript nodejs parser parsing price scraper scraping spider typescript

Last synced: 10 Nov 2024

https://github.com/1uc1f3r616/dark-net-websites-dataset

Dataset of Onion Websites

crawler darknet data-analysis dataset onion search-engine website

Last synced: 11 Nov 2024

https://github.com/archan937/webhead

An easy-to-use Node web crawler storing cookies, following redirects, traversing pages and submitting forms.

api cookies crawler fetch file-uploads forms headless json node redirects scraper spider traversing

Last synced: 10 Nov 2024

https://github.com/huzecong/film-spider

Spiders crawling for film listing websites.

crawler

Last synced: 12 Nov 2024

https://github.com/qin2dim/istockphoto-go

📸 Gracefully download dataset from iStockPhoto.

colly crawler istockphoto

Last synced: 31 Oct 2024

https://github.com/chenyangguang/hundun

crawler go gocolly

Last synced: 14 Nov 2024

https://github.com/reycn/china-drug-trials-crawler

A web crawler for Chinadrugtrials.org.cn, written in Python 3.6+.

china crawler drug python scraper

Last synced: 14 Nov 2024

https://github.com/spa5k/quick-scraper

An easy, lightweight scraper built using typescript for good developer experience.

crawler dx easy-to-use esbuild scraper typescript

Last synced: 13 Nov 2024

https://github.com/igeligel/TeamFortressOutpostApi

:repeat: An API wrapper for the TF2 Outpost platform. A platform to find great deals for your Team Fortress 2, Counter-Strike: Global Offensive and Dota 2 items with zero hassle.

bot bot-framework crawler steam steam-api steambot teamfortress2

Last synced: 13 Nov 2024

https://github.com/cls1991/gank.io

抓取干货集中营图片资源 (http://gank.io)

crawler curl gankio picture

Last synced: 11 Nov 2024

https://github.com/v-braun/hero-scrape

Find the hero (main) image of an URL

crawler fastimage hero hero-image opengraph webscraping

Last synced: 15 Nov 2024

https://github.com/oxylabs/web-crawler

Web Crawler is a tool used to discover target URLs, select the relevant content, and have it delivered in bulk. It crawls websites in real-time and at scale to quickly deliver all content or only the data you need based on your chosen criteria.

api crawler github-python scraper web-crawler web-crawler-python web-scraping web-scraping-api webscraping

Last synced: 17 Nov 2024

Crawler Awesome Lists

awesome-crawler 101 awesome-python-primer 68 awesome-digital-preservation 45

Crawler Categories

2.6 机器学习 50 Replay tools 18 Python 18 1.1 语言基础 13 2.4 Web 前端 10 2.1 爬虫基础 9 3\. 数据库 8 2.5 数据分析 7 Web archiving 7 Java 7 4\. 异步IO 6 Other digital objects 6 2.3 Django 框架 4 2.2 Flask 框架 4 Standards and specifications 4 Social Networks 4 C# 3 1.2 语言进阶 3 Organizations 2 Related lists 2