Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

GitHub: https://github.com/topics/crawler
Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
Last updated: 2025-02-05 00:06:37 UTC
JSON Representation

https://github.com/sirius-mhlee/naver-cafe-crawler

NAVER Cafe Crawler using pandas, tqdm, Selenium, BeautifulSoup4

beautifulsoup4 crawler pandas selenium tqdm

Last synced: 14 Jan 2025

https://github.com/yjg30737/pyqt-wikipedia-crawler

Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI

beautifulsoup4 crawler pyqt pyqt5 wikipedia

Last synced: 03 Jan 2025

https://github.com/baerwang/sec_craw

一个方便安全研究人员获取每日安全日报的爬虫，目前爬取范围包括90sec、看雪论坛、v2ex、精易论坛、52破解论坛等实验室博客，持续更新中。

crawler security security-tools threat threat-intelligence

Last synced: 21 Jan 2025

https://github.com/mkfsn/chronos

A light cron-like container service - create cron job easily.

crawler cron cronjob golang

Last synced: 22 Jan 2025

https://github.com/excaliburhan/littlenews

A news app via electron

crawler electron rss-feed

Last synced: 28 Jan 2025

https://github.com/pierlauro/mdbubing

From WARC records to MongoDB documents

bubing crawler crawling warc warc-files warc-format warc-record webarchive webarchiving

Last synced: 03 Feb 2025

https://github.com/1970mr/link-crawler

Web Link Crawler: A Python script to crawl websites and collect links based on a regex pattern. Efficient and customizable.

clawler crawler crawler-python link-crawler link-crawler-python link-scraper link-scraper-python links python scraper scraper-python website-crawler website-scraper

Last synced: 11 Nov 2024

https://github.com/maxiroellplenty/gs-robot

NodeJs tool to scrap gelbe-seiten

axios cheerio crawler gelbe-seiten nodejs scraper yargs

Last synced: 23 Jan 2025

https://github.com/flavien-hugs/scrapy-test

Manipulation de la librairie Scrapy. Mini script permet d'extraire l'ensemble des personnages de dessin animé sur Wikipedia.

crawler python scraping scrapy

Last synced: 03 Feb 2025

https://github.com/moontai0724/auto-notify-pu-courses-quota

A small crawler to fetch remains quota of a list of courses in Providence University every 2 to 10 minutes, then send webhook when change.

crawler javascript nodejs

Last synced: 01 Feb 2025

https://github.com/srx-2000/swaiter

a programe to wait until the selenium element has loaded——selenium模拟器元素等待程序

crawler selenium selenium-python

Last synced: 22 Jan 2025

https://github.com/comigor/balances

Your checking and savings accounts balances on banks and brokers.

balance bank broker crawler node

Last synced: 03 Feb 2025

https://github.com/hamidrabedi/digikala-crawler

a crawler for digikala with django framework, selenium and rest api. also scraping data from gathered urls

crawler digikala digikala-crawler django python scraper

Last synced: 14 Dec 2024

https://github.com/emarifer/search-engine

A mini Google. Custom web crawler & indexer written in Golang.

crawler dashboard deep-first-search fiber-framework full-text-search golang gorm-orm htmx htmx-go hyperscript indexer inverted-index response-caching search-engine templ worker-pool

Last synced: 17 Jan 2025

https://github.com/greatdrake/contributecounter

crawl Wikipedia for contributers

crawler python scraping

Last synced: 14 Dec 2024

https://github.com/arihantbansal/cybersec-python

Cybersec/CTF practice problems solved in Python

crawler cryptography ctf cybersecurity sockets webscraping

Last synced: 03 Feb 2025

https://github.com/cryptoc1/earl

Earl is looking for URLs in your area.

crawler middleware nuget webscraping

Last synced: 27 Jan 2025

https://github.com/deployment-helper/api-template-crawler

API interface to crawl the templates

api crawler deployment-helper gcp gcp-cloud-run golang rest

Last synced: 14 Jan 2025

https://github.com/hanifdwyputras/se-scraper

Search Engine scraper with PHP

crawler scraper seo seo-crawler

Last synced: 01 Feb 2025

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 13 Dec 2024

https://github.com/ccrashzer0/web_crawler

A python based web crawler

crawler internet python python3 webcrawler

Last synced: 27 Jan 2025

https://github.com/microlinkhq/ua

A simple redis primitives to incr() and top() user agents

crawler redis user-agent user-agent-parser

Last synced: 12 Jan 2025

https://github.com/kangoo13/textbroker-author-article-picker

Bot that automatically lock an order into a textbroker's author account.

author-textbroker automation bot colly crawler go gocolly golang scrapper spider textbroker textbroker-author textbroker-order-picker textbroker-orders textbroker-scrapper

Last synced: 22 Jan 2025

https://github.com/priyakdey/github-api-crawler

A crawler to crawl and save the APIs found in the Public APIs github repo - https://github.com/public-apis/public-apis. Visit README for details.

api crawler mongo python3

Last synced: 02 Feb 2025

https://github.com/kimi0230/crawlerimage

crawler python python3

Last synced: 16 Jan 2025

https://github.com/bingxyz/blackcat

使用telegram bot查詢黑貓物流

crawler nodejs telegram-bot

Last synced: 22 Jan 2025

https://github.com/dizys/weibo-crawler

A nodejs weibo crawler

crawler nodejs typescript weibo-spider

Last synced: 27 Dec 2024

https://github.com/jovijovi/ether-crawler

A transaction crawler for the Ethereum ecosystem.

blockchain crawler ether ethereum transaction

Last synced: 16 Jan 2025

https://github.com/captain-woof/zhi-zhu

Zhi-Zhu is a multithreaded spidering script that recursively searches base webpages and all urls appearing in it, for specific (regex) words.

crawler crawler-python crawling-python python3

Last synced: 31 Dec 2024

https://github.com/richecr/pyhltv

Repository to extract information from the HLTV website.

crawler csgo hacktoberfest hltv hltv-api python3

Last synced: 20 Jan 2025

https://github.com/anjackson/scrapy-url-frontier

A Scrapy module for URL Frontier integration

crawler frontier scrapy spider

Last synced: 05 Jan 2025

https://github.com/buren/site_health

Crawl a site and check various health indicators

crawler rubygem site-health

Last synced: 28 Oct 2024

https://github.com/saketh7382/smartcrawler

Package for crawling items from webpages and store them as json file

crawler crawler-python open-source pip python3 scraper selenium selenium-webdriver webdriver-manager

Last synced: 03 Feb 2025

https://github.com/idlesign/gallerycrawler

Generic crawling for galleries

crawler gallery images python3

Last synced: 17 Dec 2024

https://github.com/jonasrenault/cprex

Chemical Properties Relation Extraction

chemistry crawler deep-learning information-extraction machine-learning named-entity-recognition nlp pubchem relation-extraction scientific-articles spacy transformers

Last synced: 14 Oct 2024

https://github.com/marcinrek/sauron

Basic page crawler written in Node.js

crawler json node-js nodejs requests

Last synced: 29 Nov 2024

https://github.com/thecloer/crawler-himym

How I met your mother script PDF generator for learning English

crawler pdf pdf-generation typescript web-scraping webscraping

Last synced: 04 Feb 2025

https://github.com/xcrypt0r/xcrawler

✂️ A crawling example for maplestory with various languages using multi-threading

crawler crawling multithreading parsing regexp

Last synced: 09 Jan 2025

https://github.com/willi-dev/dtcapp

dtcapp : distributed twitter crawler.

crawler distributed-systems hazelcast java twitter twitter-api

Last synced: 14 Jan 2025

https://github.com/tanja-4732/od-get

A Rust tool for recursively crawling & downloading data from open directories

cli crawler open-directory open-directory-downloader rust

Last synced: 14 Jan 2025

https://github.com/ryanking13/bellorin

Multi-threaded Social Media Crawler 🔍

crawler instagram social-media

Last synced: 02 Feb 2025

https://github.com/jorgeparavicini/medalytik-python

Python crawlers for a job mediation firm

crawler python scrapy

Last synced: 02 Feb 2025

https://github.com/skylightqp/namu2csv

A namuwiki crawler that converts header to csv file for kartrider wiki

crawler rust

Last synced: 02 Feb 2025

https://github.com/programming-with-love/skyeyesystem

天眼系统，每隔十分钟爬取各个平台的热搜数据并入库。包括原始热搜数据存入mysql。词频统计存入Redis。

crawler mysql redis skyeye skyeyewall springboot

Last synced: 16 Jan 2025

https://github.com/mazzasaverio/scrapy-playwright-scrapegraphai

Web crawler using Scrapy + Playwright for dynamic content, featuring YAML-based configuration, PostgreSQL storage via aiosql, structured logging with logfire, and complete Docker/Terraform infrastructure. Built with uv package manager and Python 3.11+.

aiosql crawler docker playwright scrapy scrapy-playwright terraform uv

Last synced: 14 Jan 2025

https://github.com/konradlinkowski/mailcrawler

Crawler to find emails in the websites

crawler scraper

Last synced: 26 Jan 2025

https://github.com/pnguyen215/instagram-crawler

Instagram Crawler is a Python script to download posts from a specified Instagram account.

crawler crawling-python instagram instagram-crawler scraper scraping-python scraping-websites scrapper scrapy-crawler

Last synced: 12 Jan 2025

https://github.com/weaming/simple-crawler

my simple crawler

crawler

Last synced: 12 Jan 2025

https://github.com/ryanchao2012/okbot

A conversation retrieval engine based on PTT corpus

chatbot crawler django ptt

Last synced: 12 Jan 2025

https://github.com/konradlinkowski/wikipediafinder

Find words in wikipage

crawler scraper wikipedia

Last synced: 26 Jan 2025

https://github.com/duaraghav8/larry-crawler

Kayako Twitter challenge

crawler fetch-tweets hashtag nodejs pagination tweets twitter-api

Last synced: 22 Jan 2025

https://github.com/victorhuu/amazonmovieintegration

本仓库是同济大学数据仓库的第一个个人作业——利用爬虫与ETL工具整理Amazon的电影数据

crawler data-warehouse movies pandas scrapy xpath

Last synced: 26 Jan 2025

https://github.com/dean9703111/ithelp_total_count

計算 IT邦幫忙文章的瀏覽/Like/留言總數

crawler ithelp total-likes total-responses total-views

Last synced: 12 Jan 2025

https://github.com/liyun-li/meh-bot

Just a bot that clicks an image

bot crawler docker headless-firefox meh python python3 selenium twilio-sms-api

Last synced: 25 Jan 2025

https://github.com/songjiayang/china_repos

github repo 爬虫

china crawler statistics

Last synced: 01 Feb 2025

https://github.com/pxlrbt/website-diff

Utility tool that bundles a crawler and BackstopJS for visual regression testing.

backstopjs crawler visual-regression-testing

Last synced: 26 Jan 2025

https://github.com/arshadkazmi42/gh-crawl

Crawler for Github repositories. Finds all the broken links from the repositories

bug-bounty-recon crawl crawler gh-crawler github github-crawler githubcrawler python

Last synced: 21 Dec 2024

https://github.com/maxmindlin/swarm

Go crawler that searches and aggregates information relevant to your interests. WIP for learning Go crawling.

crawler golang mongodb

Last synced: 01 Feb 2025

https://github.com/fa7ad/aiub-notes-dl

Download all notes from AIUB's portal

aiub beautifulsoup4 crawler

Last synced: 24 Oct 2024

https://github.com/camilamaia/crawl4us

[WIP] A Python web crawler looking wildly for tables 🕵️‍♀️

beautifulsoup4 crawler crawling pypi python-3 python-module scraper scraping tables web-scraping

Last synced: 02 Feb 2025

https://github.com/bitscoper/bitscoper_crawler

Crawls the titles of webpages in series by number and creates a list of the available links.

crawler lister

Last synced: 01 Feb 2025

https://github.com/hudson-newey/user-web-crawler

The Archive.org Crawler works through volunteering users who install an extension on their browsers. When the user visits a webpage, the URL is anonymously added to the Archive.org database.

archive crawler open-internet

Last synced: 10 Jan 2025

https://github.com/rflcnunes/crawler_email_py

In this project I'm creating a web crawler to check email boxes and handle incoming messages.

aws-bucket aws-bucket-s3 aws-s3 crawler crawler-python email python rabbitmq

Last synced: 01 Feb 2025

https://github.com/joeri-abbo/python-credly-scraper

This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an

badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling

Last synced: 15 Jan 2025

https://github.com/zephyrpersonal/github-trending-crawler

transform github-trending repos to json data

cheerio crawler fetch github node repository spider trending

Last synced: 26 Jan 2025

https://github.com/openpj/manifoldcf-sdk

Apache ManifoldCF SDK is a Maven project focused on helping developers to extend ManifoldCF with new connectors and extensions

apache crawler docker ecm extensions integrations manifoldcf migration sdk search

Last synced: 25 Jan 2025

https://github.com/geoffreybauduin/website-checker

Performs useful checks against a website, such as 404 errors reporting, structured data validation...

crawler seo structured-data web-spider website

Last synced: 25 Dec 2024

https://github.com/eea/eea-crawler

EEA Crawler contains the tasks (DAGs) used by Apache Airflow to index content from various EEA-Eionet websites into a central Elasticsearch (aka content hub).

airflow-dags crawler elasticsearch etl-pipeline indexing

Last synced: 24 Jan 2025

https://github.com/liebki/githubnet

This library allows you to retrieve several things from GitHub, things like trending repositories, profiles of users, the repositories of users and related information.

crawler crawling github github-trending htmlagilitypack microsoft

Last synced: 24 Jan 2025

https://github.com/tsaohucn/crawler_fb_group

This is crawler use selenium for facebook groups

crawler facebook-groups rails ruby

Last synced: 20 Jan 2025

https://github.com/ycrao/some-spider-code

some spider code 财经资讯以及基金股票外汇价格爬虫

crawler economics fin-eco-news finance forex fund-value spider stock-price

Last synced: 19 Nov 2024

https://github.com/gozeon/weibo-crawler

微博爬虫

crawler web-crawler

Last synced: 26 Jan 2025

https://github.com/jiamingla/mvdis_i18n

機車駕照預約考試多語友善版 Non-official

crawler jquery koa koajs nodejs supertest

Last synced: 26 Jan 2025

https://github.com/linux0hat/cpp-web-crawler

Explore the web.

cpp crawler sqlite3

Last synced: 12 Jan 2025

https://github.com/toannd96/chromedp-example-login

chromedp crawler golang goquery

Last synced: 19 Jan 2025

https://github.com/mc256/node-static-webpage-crawler

download entire website with its directory structure.

cache-server crawler nodejs static-site

Last synced: 24 Jan 2025

https://github.com/chunkingz/youtubelinks-scraper

A python script that scrapes Youtube links from a predefined website of choice.

crawler python scraper spider websitescraper youtube

Last synced: 02 Jan 2025

https://github.com/knourian/freelancer.com-category-scrapping

Scrapping Categories from Freelancer.com Using scrapy with number of project for each category

crawler freelancer python3 scrapy web-crawler

Last synced: 05 Jan 2025

https://github.com/victorpre/erlich

Erlich Bachman - Hacker Hostel

chatbot crawler elixir housing umbrella

Last synced: 02 Feb 2025

https://github.com/aleclarson/recrawl

Filesystem crawler

crawler fs nodejs

Last synced: 09 Jan 2025

https://github.com/machu-gwu/crawlib-project

tool set for crawler project.

crawler framework mongodb python scrapy

Last synced: 18 Jan 2025

https://github.com/hctilg/taaghche-dl

Save books purchased from taaghche.com !

crawler downloader pillow-library python3 selenium taaghche

Last synced: 09 Jan 2025

https://github.com/roccomuso/is-apple

Verify that a request is from Apple crawlers using DNS verification steps

apple bot crawler dns ip js nodejs

Last synced: 22 Jan 2025

https://github.com/fnkr/gocrawl

Simple web crawler.

crawler http-client

Last synced: 28 Jan 2025

https://github.com/amirsorouri00/dsl-se

This is a MVP provided based on the "Search Engine And Data Mining" Course. The idea behind this project is the forked project which its link provided is

container crawler distributed-systems docker docker-compose elasticsearch pagerank search-engine