Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

GitHub: https://github.com/topics/crawler
Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
Last updated: 2025-01-27 00:06:15 UTC
JSON Representation

https://github.com/panagiks/asset

ASynchronous Spidering Essential Tool (ASSET).

async asyncio crawler graph reporting spider

Last synced: 06 Dec 2024

https://github.com/simonrichardson/crwlr

Crawl all the things!

crawler meshuggah

Last synced: 01 Dec 2024

https://github.com/ammirsm/data-grabber-cnn-twitter

Basic setup to get data from twitter and CNN with a keyword.

cnn crawler django scrapyd twitter

Last synced: 09 Dec 2024

https://github.com/dean9703111/shopee_find_mac

用最快的速度找到便宜符合自己要求規格的mac

argparse crawler mac pip python python2 xlsxwriter

Last synced: 12 Jan 2025

https://github.com/schbenedikt/web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.

crawler database mariadb mysql python python-crawler web

Last synced: 08 Nov 2024

https://github.com/woorim960/nate.com-comments-crawler

nate.com-comments-crawler

chromedriver crawler python3 selenium

Last synced: 28 Dec 2024

https://github.com/marzzzello/gplaycrawler

(mirror) Discover apps by different mehtods. Mass download app packages and metadata.

crawler google-play google-play-store googleplay googleplaystore playstore playstoreapi scraper

Last synced: 23 Dec 2024

https://github.com/baerwang/sec_craw

一个方便安全研究人员获取每日安全日报的爬虫，目前爬取范围包括90sec、看雪论坛、v2ex、精易论坛、52破解论坛等实验室博客，持续更新中。

crawler security security-tools threat threat-intelligence

Last synced: 21 Jan 2025

https://github.com/beanwei/zmt-post-crawler

Crawler the ZMT platform site ,put the author id, get the post list.This project is coding for my friend

crawler golang golang-ui

Last synced: 28 Dec 2024

https://github.com/saketh7382/smartcrawler

Package for crawling items from webpages and store them as json file

crawler crawler-python open-source pip python3 scraper selenium selenium-webdriver webdriver-manager

Last synced: 08 Dec 2024

https://github.com/openpj/manifoldcf-sdk

Apache ManifoldCF SDK is a Maven project focused on helping developers to extend ManifoldCF with new connectors and extensions

apache crawler docker ecm extensions integrations manifoldcf migration sdk search

Last synced: 25 Jan 2025

https://github.com/richecr/pyhltv

Repository to extract information from the HLTV website.

crawler csgo hacktoberfest hltv hltv-api python3

Last synced: 20 Jan 2025

https://github.com/citiususc/polypus

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis

analytics bigdata crawler scraper sentiment-analysis twitter

Last synced: 02 Dec 2024

https://github.com/davideferre/covid19-data-crawler-ita

Covid 19 italian data crawler

coronavirus covid19 crawler hacktoberfest hacktoberfest2021 python

Last synced: 11 Jan 2025

https://github.com/snuzi/devblogs-aggregator

The backend aggregator project of DevBlogs.net

aggregator blog crawler engineering engineering-blogs tech tech-blogs tech-companies tech-news

Last synced: 09 Nov 2024

https://github.com/myconsciousness/metis

Metis main repository.

application client crawler crawling crawlwebpage educatable gui lerning logging programming-language python scrape scraping scraping-websites tkinter tkinter-gui tkinter-python

Last synced: 08 Dec 2024

https://github.com/marcinrek/sauron

Basic page crawler written in Node.js

crawler json node-js nodejs requests

Last synced: 29 Nov 2024

https://github.com/leomaurodesenv/smm-maker-profile

A package to fetching the maker profile - Super Mario Maker

crawler javascript json mario-maker nodejs

Last synced: 02 Nov 2024

https://github.com/sebyx07/active_proxy

Ruby proxy fetcher, retries request until completed, provides user agent🚀🚀

crawler http proxy rails ruby

Last synced: 28 Dec 2024

https://github.com/joeri-abbo/python-credly-scraper

This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an

badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling

Last synced: 15 Jan 2025

https://github.com/amirsorouri00/dsl-se

This is a MVP provided based on the "Search Engine And Data Mining" Course. The idea behind this project is the forked project which its link provided is

container crawler distributed-systems docker docker-compose elasticsearch pagerank search-engine

Last synced: 19 Jan 2025

https://github.com/amirespahbodi/url_crawler

url crawler

crawler fastapi pydantic python3 sqlalchemy

Last synced: 02 Jan 2025

https://github.com/machu-gwu/crawlib-project

tool set for crawler project.

crawler framework mongodb python scrapy

Last synced: 18 Jan 2025

https://github.com/pierlauro/mdbubing

From WARC records to MongoDB documents

bubing crawler crawling warc warc-files warc-format warc-record webarchive webarchiving

Last synced: 09 Dec 2024

https://github.com/im-perativa/public_crawler

A collection of crawler project for Indonesia dataset

crawler indonesia indonesia-api scrapy

Last synced: 25 Jan 2025

https://github.com/buren/site_health

Crawl a site and check various health indicators

crawler rubygem site-health

Last synced: 28 Oct 2024

https://github.com/gnaneshkunal/book-miner

Web crawler for Book reviews (Goodreads)

crawler goodreads typescript

Last synced: 16 Dec 2024

https://github.com/pnguyen215/instagram-crawler

Instagram Crawler is a Python script to download posts from a specified Instagram account.

crawler crawling-python instagram instagram-crawler scraper scraping-python scraping-websites scrapper scrapy-crawler

Last synced: 12 Jan 2025

https://github.com/fa7ad/aiub-notes-dl

Download all notes from AIUB's portal

aiub beautifulsoup4 crawler

Last synced: 24 Oct 2024

https://github.com/codelegant/movie-crawler-api

淘宝，猫眼，格瓦拉影票信息抓取接口

async await crawler mongoose request

Last synced: 18 Dec 2024

https://github.com/vitaee/laravelandcrawlers

php web crawler examples with oop concept and laravel project

crawler laravel php

Last synced: 26 Dec 2024

https://github.com/ycrao/some-spider-code

some spider code 财经资讯以及基金股票外汇价格爬虫

crawler economics fin-eco-news finance forex fund-value spider stock-price

Last synced: 19 Nov 2024

https://github.com/soakit/book-download

book-download

crawler html2epub nodejs novel-downloader

Last synced: 28 Dec 2024

https://github.com/geoffreybauduin/website-checker

Performs useful checks against a website, such as 404 errors reporting, structured data validation...

crawler seo structured-data web-spider website

Last synced: 25 Dec 2024

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 13 Dec 2024

https://github.com/opda0887/bahamut-crawler-to-gmail

發想：使用Python爬蟲取得巴哈姆特版面的最新論壇，並用gmail傳送這些訊息給自己。A thought: Use Python crawler to the latest forums in Bahamut, and use gmail to send these messages to myself.

crawler crawler-python

Last synced: 26 Jan 2025

https://github.com/excaliburhan/littlenews

A news app via electron

crawler electron rss-feed

Last synced: 29 Nov 2024

https://github.com/loggerhead/dianping_crawler

基于 Scrapy (python 3.5) 的大众点评爬虫

crawler python-3-5

Last synced: 24 Jan 2025

https://github.com/hoanle396/py-iconnect

crawler flask flask-application image-processing python

Last synced: 14 Dec 2024

https://github.com/shiritai/wallpaper_master

My first individual project!

crawler file-explorer javafx-application maven-shade mini-system wallpaper wallpaper-master

Last synced: 01 Jan 2025

https://github.com/liyun-li/meh-bot

Just a bot that clicks an image

bot crawler docker headless-firefox meh python python3 selenium twilio-sms-api

Last synced: 25 Jan 2025

https://github.com/dean9703111/ithelp_total_count

計算 IT邦幫忙文章的瀏覽/Like/留言總數

crawler ithelp total-likes total-responses total-views

Last synced: 12 Jan 2025

https://github.com/mahmoudgalalz/pupt

A starter for web crawling using Puppeteer

crawler nodejs scraping

Last synced: 05 Jan 2025

https://github.com/m-osource/cassiopeiabot

C++ multithread Linux Web Crawler

algorithm berkeleydb bot cassiopeia cplusplus crawler download engine hashing html-parser information-retrieval link-analysis multithread open-source regex search web web-crawler webcrawler www

Last synced: 08 Jan 2025

https://github.com/scrwdrv/siege-crawler

This CLI tool will find same domain urls in a web page and requesting them to find even more urls until server crash (or at the end of benchmark). It is used to test maximun capacity of server or finding for glitches that users might encounter.

benchmark cli crawler ddos debug siege tool

Last synced: 18 Dec 2024

https://github.com/thomashirtz/douban-crawler

A simple crawler for retrieving information about movies or TV shows from the famous www.douban.com website.

crawler douban

Last synced: 25 Dec 2024

https://github.com/dimo414/pycrawl

Simple Python web crawler, primarily designed for inspecting and diagnosing your own website

crawler python

Last synced: 18 Dec 2024

https://github.com/duaraghav8/larry-crawler

Kayako Twitter challenge

crawler fetch-tweets hashtag nodejs pagination tweets twitter-api

Last synced: 22 Jan 2025

https://github.com/ryanchao2012/okbot

A conversation retrieval engine based on PTT corpus

chatbot crawler django ptt

Last synced: 12 Jan 2025

https://github.com/bkdev98/ebooks-crawler

Ebooks crawler for personal purpose using ReactJS.

crawler material-ui nodejs reactjs

Last synced: 01 Jan 2025

https://github.com/weaming/simple-crawler

my simple crawler

crawler

Last synced: 12 Jan 2025

https://github.com/vindecodex/automated-crawler-wget

Using wget to crawl site

crawler shell-script

Last synced: 01 Jan 2025

https://github.com/mazzasaverio/scrapy-playwright-scrapegraphai

Web crawler using Scrapy + Playwright for dynamic content, featuring YAML-based configuration, PostgreSQL storage via aiosql, structured logging with logfire, and complete Docker/Terraform infrastructure. Built with uv package manager and Python 3.11+.

aiosql crawler docker playwright scrapy scrapy-playwright terraform uv

Last synced: 14 Jan 2025

https://github.com/zephyrpersonal/github-trending-crawler

transform github-trending repos to json data

cheerio crawler fetch github node repository spider trending

Last synced: 26 Jan 2025

https://github.com/programming-with-love/skyeyesystem

天眼系统，每隔十分钟爬取各个平台的热搜数据并入库。包括原始热搜数据存入mysql。词频统计存入Redis。

crawler mysql redis skyeye skyeyewall springboot

Last synced: 16 Jan 2025

https://github.com/tanja-4732/od-get

A Rust tool for recursively crawling & downloading data from open directories

cli crawler open-directory open-directory-downloader rust

Last synced: 14 Jan 2025

https://github.com/jorgeparavicini/medalytik-python

Python crawlers for a job mediation firm

crawler python scrapy

Last synced: 07 Dec 2024

https://github.com/willi-dev/dtcapp

dtcapp : distributed twitter crawler.

crawler distributed-systems hazelcast java twitter twitter-api

Last synced: 14 Jan 2025

https://github.com/birdroad1/server-pinger

Server pinger for Minecraft written in C++

cpp crawler make minecraft minecraft-scanner postgres scanner server

Last synced: 21 Jan 2025

https://github.com/hctilg/taaghche-dl

Save books purchased from taaghche.com !

crawler downloader pillow-library python3 selenium taaghche

Last synced: 09 Jan 2025

https://github.com/maxgio92/package-crawler

A package crawler for most known Linux distros

crawler go linux package

Last synced: 26 Jan 2025

https://github.com/linux0hat/cpp-web-crawler

Explore the web.

cpp crawler sqlite3

Last synced: 12 Jan 2025

https://github.com/nelcifranmagalhaes/web_crawler

A web crawler for all Naruto characters

anime beautifulsoup characters crawler naruto python

Last synced: 03 Dec 2024

https://github.com/roccomuso/is-apple

Verify that a request is from Apple crawlers using DNS verification steps

apple bot crawler dns ip js nodejs

Last synced: 22 Jan 2025

https://github.com/hanifdwyputras/se-scraper

Search Engine scraper with PHP

crawler scraper seo seo-crawler

Last synced: 06 Dec 2024

https://github.com/idlesign/gallerycrawler

Generic crawling for galleries

crawler gallery images python3

Last synced: 17 Dec 2024

https://github.com/yjg30737/pyqt-wikipedia-crawler

Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI

beautifulsoup4 crawler pyqt pyqt5 wikipedia

Last synced: 03 Jan 2025

https://github.com/lysagxra/eromedownloader

Erome albums and profile downloader

bulk bulk-downloader concurrent-processing crawler downloader erome erome-download erome-downloader parallel-processing profile-downloader python python3

Last synced: 17 Jan 2025

https://github.com/hudson-newey/user-web-crawler

The Archive.org Crawler works through volunteering users who install an extension on their browsers. When the user visits a webpage, the URL is anonymously added to the Archive.org database.

archive crawler open-internet

Last synced: 10 Jan 2025

https://github.com/jyasskin/pbot-crawler

Crawler for PBOT's website to show what has changed.

crawler

Last synced: 30 Nov 2024

https://github.com/nblthree/python-url-crawler

Simple web crawler

crawler python3

Last synced: 03 Dec 2024

https://github.com/grayhat12/grawler

A web based Crawler that takes two inputs(search item, number of sites to search)and curently displays Readable Content in Text Format but the Code can be modified to display the HTML code.

crawler scraping scraping-websites scrapper scrapy-crawler

Last synced: 06 Dec 2024

https://github.com/bockstaller/europarl-crawler

Crawler for the documents published by the European Parliament

crawler datamining elasticsearch europarl-crawler european european-parliament opendata parliament union

Last synced: 06 Jan 2025

https://github.com/pyohei/rirakkuma-crawller

Crawler for my hobby.🐻

crawler python rirakkuma

Last synced: 29 Dec 2024

https://github.com/ryoii/hook

A declarative Java crawler framework

crawler declarative java java-crawler-framework jdk11

Last synced: 24 Jan 2025

https://github.com/joaooliveirapro/trawlergo

Basic HTTP Crawler in Golang

crawler go golang http

Last synced: 13 Jan 2025

https://github.com/nick121212/crawler.v5

crawler nodejs

Last synced: 27 Jan 2025

https://github.com/intina47/ee_error

implementation of a web crawler using c++

cpp crawler curl gumbo libcurl stanford-nlp web

Last synced: 06 Dec 2024

https://github.com/tomfran/crawler

A web crawler written in Rust

bloom-filter crawler rust simhash

Last synced: 06 Jan 2025

https://github.com/krishpranav/gozap

⚡️ Multiple target ZAP Scanning made in go

cli crawler go go-crawler golang zap

Last synced: 06 Dec 2024

https://github.com/pinpox/go-random-downloader

Download Html using "Random Page"

crawler golang

Last synced: 29 Nov 2024

https://github.com/kofj/octopus

Octopus an open source software to collect data from web pages.

crawler

Last synced: 27 Jan 2025

https://github.com/der3318/daily-pixiv

Integrated Flow - Line Notification of Top Ranked Pixiv Illustrations

crawler line-notify pixiv workflow

Last synced: 13 Jan 2025

https://github.com/vaenow/chromeless-coursera-caption

Chromeless crawler coursera video's caption / subtitle

caption chromeless coursera crawler crx subtitle

Last synced: 13 Dec 2024

https://github.com/vaenow/crawler-chromeless

A chromeless crawler for coursera

chromeless coursera crawler puppeteer

Last synced: 13 Dec 2024

https://github.com/igapyon/selecrawler

Simple selenium based web crawler

chrome crawler java selenium web

Last synced: 06 Jan 2025

https://github.com/jackfsuia/chats-crawler

Discourse chat data crawling and on-the-way parsing straight for LLM instruction finetuning. 论坛数据爬取和解析，直接用于对话微调。

crawler fine-tuning finetune-llm gpt html-css-javascript instruction-tuning llm llm-training llms nlp nlp-parsing parser

Last synced: 13 Jan 2025

https://github.com/splorg/sage

A scraper to get every quote from a book off of Goodreads.

books crawler datamining goodreads goodreads-data python scraper scrapy webcrawling webscraping

Last synced: 21 Jan 2025

https://github.com/apexcaptain/allergy-alert

오늘 날짜를 기준으로 모 대학의 학교 홈페이지에서 제공하는 식당 정보를 Crawling하여 회관별/메뉴 분류 별로 메뉴들과 메뉴 별 알러지 유발 식품에 대한 정보를 알려줍니다.

crawler docker expressjs puppeteer reactjs sqlite typescript

Last synced: 01 Dec 2024

https://github.com/g-ongenae/morphalou-crawler

A Crawler for CNRTL's Morphologie words

crawler french lexical-databases list-of-words words

Last synced: 15 Oct 2024

https://github.com/machinecyc/lotteryinsight

Use crawler to collect Taiwan Lotto data, and save data into local MySQL server.

crawler data docker lottery mysql-database python3 taiwan

Last synced: 05 Dec 2024

https://github.com/tigercosmos/web-crawler

Web Crawler in Java Maven Project

crawler

Last synced: 05 Dec 2024

https://github.com/mohitk05/drstrange

A simple breadth-first search web crawler

bfs crawler

Last synced: 05 Dec 2024

https://github.com/tetreum/price-crawler

Article price crawler

crawler nodejs

Last synced: 17 Dec 2024

https://github.com/iyowei/fs-deep-walk

专注于深度扫描指定磁盘位置。

crawler directory file folder folder-tooling fs nodejs recursively-search scan scandir scandir-recursive scanner walker

Last synced: 29 Dec 2024

https://github.com/xoraus/revieworacle

The proposed system assists users in deciding which product to buy. It gathers reviews along with the details from multiple websites, which sell the product. Other than that the system is trained to analyze the polarity of the product.

ai crawler datascience machinelearning scrappy selenium-webdriver