Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/f-ca7/movie-cat

A website displaying movies

crawler golang website

Last synced: 03 Jan 2025

https://github.com/hudson-newey/user-web-crawler

The Archive.org Crawler works through volunteering users who install an extension on their browsers. When the user visits a webpage, the URL is anonymously added to the Archive.org database.

archive crawler open-internet

Last synced: 10 Jan 2025

https://github.com/geoffreybauduin/website-checker

Performs useful checks against a website, such as 404 errors reporting, structured data validation...

crawler seo structured-data web-spider website

Last synced: 25 Dec 2024

https://github.com/dylanhogg/cloud-products

A package for getting cloud products and product descriptions from a cloud provider website.

aws cloud-products crawler data text-processing

Last synced: 23 Jan 2025

https://github.com/schbenedikt/web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.

crawler database mariadb mysql python python-crawler web

Last synced: 08 Nov 2024

https://github.com/kahsolt/allchan

An image crawler for xChan(4chan/8ch/...) image board.

4chan 4chan-downloader 8chan crawler image-crawler

Last synced: 03 Jan 2025

https://github.com/christopher-besch/therapy_search

Compute Call Times from arztsuche-bw into a Calendar.

appointments calendar crawler gatsby therapy time-management typescript

Last synced: 28 Dec 2024

https://github.com/beomi/pycon2017

2017 파이콘 발표자료: <처음부터 알아보는 웹 크롤러>

crawler pyconkr python

Last synced: 11 Jan 2025

https://github.com/yordadev/fenrisjs

A NodeJS application that scrapes any links from a given input and outputs the results nicely into one of two files, external or internal file for further analysis.

analysis crawler link-collection link-crawler nodejs nodejs-application

Last synced: 10 Jan 2025

https://github.com/songjiayang/china_repos

github repo 爬虫

china crawler statistics

Last synced: 11 Jan 2025

https://github.com/machu-gwu/crawlib-project

tool set for crawler project.

crawler framework mongodb python scrapy

Last synced: 18 Jan 2025

https://github.com/projectx3193275578/prjctxx8264

A simple, open-source, easy to use, and free download manager for malware samples.

crawler downloader malware manager samples

Last synced: 05 Jan 2025

https://github.com/droiddevgeeks/nodelearning

This is node learning demo. It has covered all basics of node.

crawler database ejs ejs-express mcv middleware-nodes mongodb node node-module nodejs nodemailer npm-package router sign

Last synced: 13 Jan 2025

https://github.com/khadkarajesh/aptoide

Aptoide app crawler using beautifulsoup

beautifulsoup4 crawler flask python3 web-application

Last synced: 13 Jan 2025

https://github.com/orafaelfragoso/itunes-crawler

Retrieves information about an artist by crawling the iTunes API and iTunes Page

api crawler itunes itunes-api

Last synced: 19 Dec 2024

https://github.com/suddi/fundscraper

Collection of web crawlers to scrape fund data using Scrapy

crawler funds scraper scrapy

Last synced: 11 Oct 2024

https://github.com/zabuzard/wslotter

WSlotter is a Selenium driven tool for assigning to events on 'https://www.gruppe-w.de'.

bot crawler gruppe-w

Last synced: 12 Jan 2025

https://github.com/andmerk93/scrapy_parser_pep

Учебный проект на Scrapy, парсит PEP, выводит в 2х форматах

crawler scrapy

Last synced: 24 Jan 2025

https://github.com/dangdungcntt/crawl-fb-v2

Simple script to detect email and phone from facebook comment.

crawler facebook

Last synced: 18 Jan 2025

https://github.com/naveenaidu/google-crawler

Google Crawler - Curates the search results

beautifulsoup crawler scraper

Last synced: 18 Jan 2025

https://github.com/karantyagi/web-crawler

BFS and DFS implementations for a wikipedia crawler

beautifulsoup crawler

Last synced: 12 Jan 2025

https://github.com/par7133/splash-bot-crawler

Splash Bot creates splash on the fly of your websites - GPL License 🔥

bot crawler gallery open-source opensource php splash

Last synced: 12 Jan 2025

https://github.com/hoishing/selenium-crawler

a web crawler written in python, powered by Selenium and Tesseract OCR

crawler python selenium

Last synced: 18 Jan 2025

https://github.com/mmqnym/pyppeteer-use-case

Show how to do web crawl via pyppeteer

crawl crawler pyppeteer python

Last synced: 18 Jan 2025

https://github.com/fa7ad/aiub-notes-dl

Download all notes from AIUB's portal

aiub beautifulsoup4 crawler

Last synced: 24 Oct 2024

https://github.com/buren/site_health

Crawl a site and check various health indicators

crawler rubygem site-health

Last synced: 28 Oct 2024

https://github.com/mahmoudgalalz/pupt

A starter for web crawling using Puppeteer

crawler nodejs scraping

Last synced: 05 Jan 2025

https://github.com/somnisomni/trawler-csharp

The successor of https://github.com/somnisomni/twitter-account-data-crawler, written in .NET C#

crawler crawling csharp dotnet follower-tracker selenium selenium-csharp twitter twitter-crawler twitter-crawling twitter-scraper

Last synced: 05 Jan 2025

https://github.com/knourian/freelancer.com-category-scrapping

Scrapping Categories from Freelancer.com Using scrapy with number of project for each category

crawler freelancer python3 scrapy web-crawler

Last synced: 05 Jan 2025

https://github.com/chunkingz/youtubelinks-scraper

A python script that scrapes Youtube links from a predefined website of choice.

crawler python scraper spider websitescraper youtube

Last synced: 02 Jan 2025

https://github.com/loggerhead/dianping_crawler

基于 Scrapy (python 3.5) 的大众点评爬虫

crawler python-3-5

Last synced: 24 Jan 2025

https://github.com/amirsorouri00/dsl-se

This is a MVP provided based on the "Search Engine And Data Mining" Course. The idea behind this project is the forked project which its link provided is

container crawler distributed-systems docker docker-compose elasticsearch pagerank search-engine

Last synced: 19 Jan 2025

https://github.com/arshadkazmi42/gh-crawl

Crawler for Github repositories. Finds all the broken links from the repositories

bug-bounty-recon crawl crawler gh-crawler github github-crawler githubcrawler python

Last synced: 21 Dec 2024

https://github.com/skylightqp/namu2csv

A namuwiki crawler that converts header to csv file for kartrider wiki

crawler rust

Last synced: 08 Dec 2024

https://github.com/richecr/pyhltv

Repository to extract information from the HLTV website.

crawler csgo hacktoberfest hltv hltv-api python3

Last synced: 20 Jan 2025

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 13 Dec 2024

https://github.com/yjg30737/pyqt-wikipedia-crawler

Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI

beautifulsoup4 crawler pyqt pyqt5 wikipedia

Last synced: 03 Jan 2025

https://github.com/mc256/node-static-webpage-crawler

download entire website with its directory structure.

cache-server crawler nodejs static-site

Last synced: 24 Jan 2025

https://github.com/tsaohucn/crawler_fb_group

This is crawler use selenium for facebook groups

crawler facebook-groups rails ruby

Last synced: 20 Jan 2025

https://github.com/40uf411/sillybot

SillyBot is a wrapper for the selenium library

bot crawler python scraper selenium web wrapper

Last synced: 19 Dec 2024

https://github.com/piopi/behatcrawler

A Behat extension that crawls links on a website and executes user-defined function on each one of them.

behat behat-extension crawler php selenium-webdriver

Last synced: 19 Dec 2024

https://github.com/cseas/shares-monitor

Web crawler to fetch and monitor shares details.

crawler python python3 scraper scraping-websites shares

Last synced: 27 Dec 2024

https://github.com/liebki/githubnet

This library allows you to retrieve several things from GitHub, things like trending repositories, profiles of users, the repositories of users and related information.

crawler crawling github github-trending htmlagilitypack microsoft

Last synced: 24 Jan 2025

https://github.com/eea/eea-crawler

EEA Crawler contains the tasks (DAGs) used by Apache Airflow to index content from various EEA-Eionet websites into a central Elasticsearch (aka content hub).

airflow-dags crawler elasticsearch etl-pipeline indexing

Last synced: 24 Jan 2025

https://github.com/basemax/jadi-net-blog

This Python script is used to extract posts from a WordPress blog (https://jadi.net/) and save them in HTML format. The script fetches the RSS feed, parses the posts, and saves each post as an individual HTML file.

blog-copier copier crawler crawler-python crawlers jadi-blog jadi-clone jadi-net-blog jadi-net-clone jadinet-blog py python python-crawler wordpress wp

Last synced: 24 Jan 2025

https://github.com/captain-woof/zhi-zhu

Zhi-Zhu is a multithreaded spidering script that recursively searches base webpages and all urls appearing in it, for specific (regex) words.

crawler crawler-python crawling-python python3

Last synced: 31 Dec 2024

https://github.com/jovijovi/ether-crawler

A transaction crawler for the Ethereum ecosystem.

blockchain crawler ether ethereum transaction

Last synced: 16 Jan 2025

https://github.com/bingxyz/blackcat

使用telegram bot查詢黑貓物流

crawler nodejs telegram-bot

Last synced: 22 Jan 2025

https://github.com/microlinkhq/ua

A simple redis primitives to incr() and top() user agents

crawler redis user-agent user-agent-parser

Last synced: 12 Jan 2025

https://github.com/willi-dev/dtcapp

dtcapp : distributed twitter crawler.

crawler distributed-systems hazelcast java twitter twitter-api

Last synced: 14 Jan 2025

https://github.com/tanja-4732/od-get

A Rust tool for recursively crawling & downloading data from open directories

cli crawler open-directory open-directory-downloader rust

Last synced: 14 Jan 2025

https://github.com/programming-with-love/skyeyesystem

天眼系统,每隔十分钟爬取各个平台的热搜数据并入库。包括原始热搜数据存入mysql。词频统计存入Redis。

crawler mysql redis skyeye skyeyewall springboot

Last synced: 16 Jan 2025

https://github.com/mazzasaverio/scrapy-playwright-scrapegraphai

Web crawler using Scrapy + Playwright for dynamic content, featuring YAML-based configuration, PostgreSQL storage via aiosql, structured logging with logfire, and complete Docker/Terraform infrastructure. Built with uv package manager and Python 3.11+.

aiosql crawler docker playwright scrapy scrapy-playwright terraform uv

Last synced: 14 Jan 2025

https://github.com/princed/specht

Check links found in html or js files by pattern

cli crawler html javascript streams

Last synced: 19 Jan 2025

https://github.com/alancesar/crawler

HTML crawler

crawler docker spider

Last synced: 03 Dec 2024

https://github.com/s3rgeym/wscrap

Command line web scraping tool.

crawler scraping

Last synced: 23 Dec 2024

https://github.com/discountry/crawler-microservice

crawler microservice

crawler

Last synced: 14 Dec 2024

https://github.com/jpleorx/tagblender

A simple java API to retrieve hashtags from https://www.tagblender.net/

api crawler hashtags java jsoup parser

Last synced: 25 Jan 2025

https://github.com/yuchenq/comp90055-project

This is the lastest version of my project belong to Comp90055.

couchdb crawler data-visualization python3 textblob tweepy

Last synced: 19 Jan 2025

https://github.com/leegeunhyeok/python-gongucrawler

파이썬3 공유마당 이미지 및 상세정보 크롤러

crawler python

Last synced: 22 Dec 2024

https://github.com/aminehsan/datamining-divar.ir

Analyzing and Extracting Insights from Ads on 'divar.ir'

crawler data-mining data-science divar-ir scraping

Last synced: 04 Dec 2024

https://github.com/allotmentandy/socialmedialinkextractor

php laravel package to extract social media links from an array of links for my spider, used as part of a spider for checking londinium.com website links

crawler extractor facebook laravel linked-list php social social-network spider twitter url youtube

Last synced: 23 Dec 2024

https://github.com/danielvigaru/easyreach

crawler for faster amazon reach

amazon crawler python

Last synced: 01 Jan 2025

https://github.com/yosh1/mio-crawler

A crawler that acquires data usage of iijmio .

crawler iijmio mio ruby

Last synced: 12 Jan 2025

https://github.com/dnknth/robot.py

Simple web spider

crawler curio python

Last synced: 23 Jan 2025

https://github.com/raspi/scrapy-amp

Crawler for Amiga Music Preservation (AMP) site

amiga crawler mod module music python s3m scrapy spider tracker

Last synced: 08 Jan 2025

https://github.com/raspi/scrapy-corsair

Web crawler for Corsair (corsair.com)

crawler hardware memory scrapy spider

Last synced: 08 Jan 2025

https://github.com/jofaval/open-graph-visualizer

Web Scraping showcase of how crawlers retrieve site's details through the Open Graph Protocol

crawler javascript opengraph scraping web web-scraping

Last synced: 09 Dec 2024

https://github.com/hoan02/novel-crawler

Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn

crawler python

Last synced: 20 Jan 2025

https://github.com/mawkler/go-web-crawler

Toy web server written in Go

crawler go

Last synced: 04 Dec 2024

https://github.com/rutopio/crawler-cpbl-player-data

針對中華職棒官網的球員資料進行爬蟲與整理。

cpbl crawler crawling python

Last synced: 04 Dec 2024

https://github.com/rutopio/crawler-2020-taiwanese-election-results

2020 台灣選舉結果爬蟲:以不分區政黨票為例

crawler python

Last synced: 04 Dec 2024

https://github.com/brianbruggeman/vax

A vaccination signup tool

covid-19 crawler signup vaccination

Last synced: 16 Jan 2025

https://github.com/zenixls2/2chpreprocess

Dump messages from 2ch with some preprocessing for ML analysis

2ch crawler python

Last synced: 04 Dec 2024

https://github.com/jayzhan211/python-crawler-startups

python crawler learning

crawler python

Last synced: 25 Jan 2025

https://github.com/dylancl/sitemap-crawler

Verify the status of each url in a (hosted) sitemap XML file.

crawler parser scraper sitemap xml

Last synced: 27 Dec 2024

https://github.com/fritz-c/itunes-stats

Fetch info on podcasts, etc. from iTunes RSS data

crawler itunes

Last synced: 02 Jan 2025

https://github.com/huakunshen/cron-crawler-template

Web Crawler Cron Job Template running with GitHub Action. Capable of sending email notifications.

crawler github-actions python

Last synced: 17 Jan 2025

https://github.com/raspi/scrapy-crucial

Web crawler for Crucial (crucial.com)

crawler hardware memory scrapy spider

Last synced: 08 Jan 2025

https://github.com/eklem/vinmonopolet-crawler

Crawling Vinmonopolet-data and indexing it to a norch search index

crawler dataset javascript norch search-engine

Last synced: 04 Dec 2024

https://github.com/sajjadanwar0/booking.com-scraping

Scraping booking.com using Selenium and Beautiful Soup

crawler data python scraping selenium

Last synced: 14 Jan 2025

https://github.com/ark930/douban-movie-crawler

豆瓣影评爬虫

crawler douban movie python

Last synced: 24 Jan 2025

https://github.com/tetreum/puppeteer-for-crawling

Daily use crawling methods for puppeteer

crawler crawling puppeteer

Last synced: 09 Dec 2024

https://github.com/dalthviz/csapp

Crawler-Scrapper for the playstore

crawler csapp keyword nlp playstore rating review scrapper

Last synced: 12 Jan 2025

https://github.com/edumucelli/rubybikes

A set of Bike Sharing System parsers in Ruby

bike-sharing crawler ruby

Last synced: 24 Dec 2024

https://github.com/mstephen19/apify-click-events

Like TypeScript, but for clicking ;) Manage automated clicks, and ensure your Apify web-crawler is only clicking exactly what you allow it to

apify apify-sdk crawler scraper web-automation

Last synced: 10 Dec 2024