An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with web-crawling

A curated list of projects in awesome lists tagged with web-crawling .

https://github.com/apifytech/apify-js

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 06 Jul 2025

https://github.com/apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 03 Nov 2025

https://github.com/apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping

Last synced: 06 Mar 2026

https://github.com/turnersoftware/infinitycrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 21 Jun 2025

https://github.com/TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 25 Mar 2025

https://github.com/ayakashi-io/ayakashi

:zap: Ayakashi.io - The next generation web scraping framework

automation data-mining headless-chrome web-crawling web-scraping

Last synced: 11 Apr 2025

https://github.com/scrapinghub/scrapy-training

Scrapy Training companion code

python scrapy training web-crawling web-scraping

Last synced: 25 Apr 2025

https://github.com/my8100/scrapyd-cluster-on-heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:

cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping

Last synced: 14 Jun 2025

https://github.com/dongweiming/daenerys

Scraping and Web Crawling Framework For Zhihu Live

scraping web-crawling zhihu zhihulive

Last synced: 20 Mar 2025

https://github.com/pinkpixel-dev/deep-research-mcp

A Model Context Protocol (MCP) compliant server designed for comprehensive web research. It uses Tavily's Search and Crawl APIs to gather detailed information on a given topic, then structures this data in a format perfect for LLMs to create high-quality markdown documents.

ai-tools data-aggregation deep-research documentation-generation information-retrieval knowledge-base llm mcp mcp-server model-context-protocol model-context-protocol-servers nodejs research-assistant search-api tavily typescript web-crawling web-research

Last synced: 09 May 2026

https://github.com/leopardslab/crawlerx

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.

django-backend elasticsearch firebase-auth message-broker mongodb-server vuejs web-crawling

Last synced: 29 Aug 2025

https://github.com/superbrucejia/dynamic-web-crawlering-python

This repo is mainly for dynamic web (Ajax Tech) crawling using Python, taking China's NSTL websites as an example.

dynamic-web-crawler dynamic-website nstl python python-crawler web-crawler-python web-crawling

Last synced: 21 Apr 2025

https://github.com/hubertroy/seen

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:

easy-to-use javasciprt lightweight-framework python3 spider-framework support-javascript web-crawling

Last synced: 25 Mar 2025

https://github.com/innovinati/microwler

A micro-framework for asynchronous deep crawls and web scraping with Python

aiohttp asyncio micro-framework nuxt parsel python quart web-crawling web-scraping

Last synced: 14 Jan 2026

https://github.com/talaatmagdyx/socials_regex

🪡 Social account detection and extraction in ruby, e.g. for crawling/scraping.

ruby web-crawling web-scraping

Last synced: 07 May 2025

https://github.com/andredarcie/best-games-of-all-time-data-based

🏆 Definite Best Games Of All Time Data Based by multiple sources

best critics data dataset game rank video-game video-games web-crawling web-scraping

Last synced: 28 Apr 2025

https://github.com/pps-22-scooby/pps-22-scooby

Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.

crawler crawlers internal-dsl scala scraper scrapers web web-crawler web-crawling web-scraper web-scrapers

Last synced: 24 Oct 2025

https://github.com/joe-stifler/crawler

Crawler is a Python package that crawls web pages and converts their content into Markdown format, making it easy to create documentation, notes, or other text-based representations. It features domain restrictions, flexible output options, and graph visualization.

context-extraction conversion file-system-crawling github-repository-crawling large-language-models llm markdown python web-crawling

Last synced: 24 Jul 2025

https://github.com/lewisakura/spiderboi

A web crawling library written in TypeScript.

spider typescript typescript3 web-crawler web-crawling web-spider webcrawler

Last synced: 12 Apr 2025

https://github.com/thesp0nge/nightcrawler-mitm

A python program that crawls a website and tries to stress it, polluting forms with bogus data

crawler offensive-scripts offensive-security stress-test web-crawler web-crawling

Last synced: 30 Apr 2025

https://github.com/lekhmanrus/real-shot-pdf

RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

ai-assistant angular browser-extension chrome-extension data-preservation gpt-integration knowledge-base knowledgebase link-parsing local-data-processing pdf pdf-downloader pdf-generation pdf-generator pdf-merger web-content-capture web-crawling web-data-extraction web-scraping webpage-to-pdf

Last synced: 14 Sep 2025

https://github.com/ahmed-alnassif/net-spider

Net-Spider is a web scraping tool designed to retrieve the source code for a web page, including front-end elements such as JavaScript, CSS, images, and fonts. It allows you to crawl and download the source code from a target website.

beautifulsoup4 command-line-interface front-end-web-development python3 source-code-extraction web-automation web-crawling web-development-tool web-optimization web-scraping

Last synced: 26 Jun 2025

https://github.com/amienbou121/crawl4ai-mcp-server

🕷️ Enable AI agents to scrape and crawl the web effortlessly with this lightweight Model Context Protocol server, integrating seamlessly into your workflows.

agentic-ai agentic-rag agentic-workflow ai-agents claude-code crawl4ai crawling firecrawl-alternative keyword-search mcp model-context-protocol-servers openai-agents-sdk pydantic-ai reranking semantic-search uvx web-crawling web-scraping

Last synced: 02 May 2026

https://github.com/afrontend/dongnelibrary

도서관 책을 빌릴 수 있는지 확인하는 유틸리티

public-library web-crawling

Last synced: 26 Jun 2025

https://github.com/mirtia/inappropriate-youtube

This repository contains some of the scripts used to obtain channel YouTube features and analyze potential disturbing channels.

data-mining web-crawling youtube

Last synced: 16 Sep 2025

https://github.com/osamikoyo/geass

web crawler for you, with some api function and configuration

docker go golang searching url web-crawler web-crawling web-scraping

Last synced: 14 Jul 2025

https://github.com/fern-aerell/web-crawling-to-txt

Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.

beautifulsoup4 crawling python requests scraping txt web-crawling web-scraping

Last synced: 03 May 2025

https://github.com/joeri-abbo/python-credly-scraper

This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an

badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling

Last synced: 23 Sep 2025

https://github.com/kluhan/kraken

Kraken is a generic, mid-scale web crawler specifically built to crawl vertical data-sources, like Youtube or the Google Play Store.

celery crawler google-play-store python web-crawling

Last synced: 07 Sep 2025

https://github.com/feluelle/pastebin-crawler-lib

A library for web crawling http://pastebin.com

pastebin web-crawling

Last synced: 12 Jun 2025

https://github.com/mstephen19/apify-global-store

Easily access, manage, and manipulate a global state/multiple global states for an Apify actor's run

apify apify-sdk state state-management typescript web-crawling web-scraping

Last synced: 15 Mar 2025

https://github.com/yjg30737/wiki-offline

convert Wikipedia html into txt which makes it able to read offline

beautifulsoup python python3 python37 python38 urllib web-crawler web-crawling wiki wikipedia

Last synced: 14 Dec 2025

https://github.com/ahnsv/go-subscription-commerce

A commerce service for subscriptionable products implemented by Go

golang subscription-commerce web-crawling

Last synced: 16 Apr 2026

https://github.com/marcuwynu23/bahagi

a scraping tool to get website table data and convert into a file

bs4 console datasets requests scraping tool tools web-crawling

Last synced: 14 May 2025

https://github.com/solangeug/webscraper

A web scraping project in Python using Scrapy, an open source and collaborative framework for extracting data from websites.

data-mining python27 scrapy web-crawling web-scraping

Last synced: 17 Jun 2025

https://github.com/ankush-chander/github-crawler

Crawl information from github in friendly manner.

human-resource-analytics web-crawling

Last synced: 30 Aug 2025

https://github.com/savinrazvan/pagerank

This project implements the PageRank algorithm to rank web pages by importance using two approaches: a sampling method with the Markov Chain random surfer model and an iterative method with a recursive mathematical expression.

alogrithm convergence data-science graph-theory iterative-methods markov-chain mathematical-modelling pagerank pagerank-algorithm python random-surfer-model recursive-algorithm sampling-methods search-engine simulation web-crawling

Last synced: 06 May 2026

https://github.com/18520339/web-scraping-with-scrapy

Python web scraping with Scrapy

scrapy web-crawling web-scraping

Last synced: 30 Mar 2025

https://github.com/justserpapi/web-html

JustSerpAPI Crawl Webpage HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

crawler google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api html-api justserpapi python serp-api web-crawling web-html-api web-scraping

Last synced: 08 Jun 2026

https://github.com/justserpapi/web-rendered-html

JustSerpAPI Crawl Webpage Rendered HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api javascript-rendering justserpapi python rendered-html-api serp-api web-crawling web-rendering web-scraping

Last synced: 08 Jun 2026

https://github.com/kunalpisolkar24/ir_lab

Collection of practical codes for Savitribai Phule Pune University's Information Retrieval Lab (410247) .

cosine-similarity information-retrieval map-reduce pagerank sppu-computer-engineering text-preprocessing web-crawling

Last synced: 09 Jun 2026

https://github.com/blocklet/snap-kit

Snap Kit is a powerful, Puppeteer-based service designed for seamless web automation. It enables you to effortlessly capture high-fidelity web page screenshots and efficiently scrape web content for precise data extraction.

browser-automation puppeteer self-hosting web-crawling web-screenshot

Last synced: 10 Aug 2025

https://github.com/ffiruzi/website-rag-qa-assistant

Transform any website into an intelligent AI assistant with ONE Docker command. Complete RAG system with web crawling, vector embeddings, and beautiful chat interface. Built with FastAPI, React, and LangChain.

artificial-intelligence chatbot docker fastapi full-stack langchain openai python question-answering rag rag-chatbot react typescript vector-database web-crawling

Last synced: 09 Apr 2026

https://github.com/manu-sh/http_normalizer_parts

http url normalization utilities for web crawlers

http-url library normalization spiders web-crawling web-scraping

Last synced: 05 Aug 2025

https://github.com/harr1424/go-crawl

A utility to crawl specified domains and download .zip files

go golang web-crawling

Last synced: 19 Aug 2025

https://github.com/pmuens/crawler

Multi-threaded Web crawler with support for custom fetching and persisting logic

crawler crawler-engine rust rust-lang web-crawler web-crawling

Last synced: 15 May 2025

https://github.com/osandadeshan/web-crawler

This is a simple application to crawl your web pages.

app java8 maven selenium-webdriver swing-gui web-crawler web-crawling

Last synced: 01 May 2026

https://github.com/pv-912/scrapy

Crawling upcoming and recent movies data from IMDB website using python-scrapy.

python-scrapy scrapy-spider web-crawling

Last synced: 26 May 2026

https://github.com/hasdata/find-urls-from-any-domain

This repository provides practical examples of website link scraping using Python and Node.js.

ai-extraction crawler hasdata-api nodejs python sitemap-parser url-extraction web-crawling web-scraping

Last synced: 06 May 2026

https://github.com/elanora96/occlucrawlee

A Crawlee Web Spider for Horg.com, collects information on known Occlupanids.

cheerio crawlee nix occlupanid typescript web-crawling

Last synced: 20 Apr 2026