Projects in Awesome Lists tagged with web-crawling
A curated list of projects in awesome lists tagged with web-crawling .
https://github.com/apifytech/apify-js
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Last synced: 06 Jul 2025
https://github.com/apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Last synced: 03 Nov 2025
https://github.com/apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping
Last synced: 06 Mar 2026
https://github.com/brightdata/brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping
Last synced: 16 Jan 2026
https://github.com/omkarcloud/botasaurus
The All in One Framework to Build Undefeatable Scrapers
anti-bot anti-detect anti-detect-browser anti-detection antidetect-browser bot-detection bypass-cloudflare cloudflare-bypass cloudflare-scrape python-scraper python-web-scraper python-web-scraping scraping-framework scraping-python scraping-tool undetectable undetected undetected-chromedriver web-crawling web-scraping-python
Last synced: 13 May 2025
https://github.com/crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping
Last synced: 15 May 2025
https://github.com/jrbadiabo/bet-on-sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
algorithms beautifulsoup machine-learning machine-learning-algorithms machinelearning predictive-analysis python python-2 scikit-learn selenium sports-stats sportsanalytics web-crawling web-scraping
Last synced: 13 Apr 2025
https://github.com/turnersoftware/infinitycrawler
A simple but powerful web crawler library for .NET
crawler robots-txt spider web-crawler web-crawling
Last synced: 21 Jun 2025
https://github.com/TurnerSoftware/InfinityCrawler
A simple but powerful web crawler library for .NET
crawler robots-txt spider web-crawler web-crawling
Last synced: 25 Mar 2025
https://github.com/spyboy-productions/omnisci3nt
Unveiling the Hidden Layers of the Web – A Comprehensive Web Reconnaissance Tool
admin-login-finder admin-panel-finder admin-panel-finder-of-any-website directory-enumeration dmarc-record-examination dns-enumeration ip-lookup osint pentesting-tools port-scanning reconnaissance-tool social-media-and-email-discovery ssl-certificate subdomain-enumeration technology-analysis wayback-machine-access web-crawling web-reconnaissance website-hacking whois
Last synced: 04 Apr 2025
https://github.com/ayakashi-io/ayakashi
:zap: Ayakashi.io - The next generation web scraping framework
automation data-mining headless-chrome web-crawling web-scraping
Last synced: 11 Apr 2025
https://github.com/serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping
Last synced: 06 Apr 2025
https://github.com/scrapinghub/scrapy-training
Scrapy Training companion code
python scrapy training web-crawling web-scraping
Last synced: 25 Apr 2025
https://github.com/my8100/scrapyd-cluster-on-heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:
cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping
Last synced: 14 Jun 2025
https://github.com/maxvalue/terpene-profile-parser-for-cannabis-strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
analysis aromatherapy bioinformatics biological-data biological-data-analysis cannabis cannabis-strains crawler data-science database health plants python python-3 scrapy terpene-profile terpenes web-crawler web-crawler-python web-crawling
Last synced: 22 Apr 2025
https://github.com/maxmindlin/scout-lang
A web crawling programming language
dsl programming-language scraper scraping scraping-websites web-crawling web-scraping
Last synced: 06 Apr 2025
https://github.com/alyakhtar/Katastrophe
Command Line Tool to download torrents
bittorrent command-line deluge kickass-torrents python screenshot torrent web-crawling
Last synced: 19 Jul 2025
https://github.com/scrapingant/amazon_scraper
Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt
amazon amazon-scraper amazon-scraping-library data-mining js node-js price-scraper price-scraping scrape-products scraper scraping scraping-api scraping-data scraping-python scraping-web scraping-websites web-crawler web-crawlers web-crawling
Last synced: 22 Aug 2025
https://github.com/ScrapingAnt/amazon_scraper
Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt
amazon amazon-scraper amazon-scraping-library data-mining js node-js price-scraper price-scraping scrape-products scraper scraping scraping-api scraping-data scraping-python scraping-web scraping-websites web-crawler web-crawlers web-crawling
Last synced: 06 Apr 2025
https://github.com/spyboy-productions/phantomcrawler
Boost website hits by generating requests from multiple proxy IPs.
ddos-attack-tools proxy proxy-configuration proxy-rotation web-crawling web-scrapping website-analytics website-hits
Last synced: 12 May 2025
https://github.com/dongweiming/daenerys
Scraping and Web Crawling Framework For Zhihu Live
scraping web-crawling zhihu zhihulive
Last synced: 20 Mar 2025
https://github.com/jgujerry/python-frameworks
Another curated list of Python frameworks
api artificial-intelligence cms data-workflow deep-learning devops distributed-computing enterprise-integrations frameworks machine-learning messaging parallel-computing pipeline python task-queue web-crawling webapp
Last synced: 26 Mar 2025
https://github.com/mohamedhmini/tweetsolaping
implementing an end-to-end tweets ETL/Analysis pipeline.
analysis api-client cube-analysis datawarehouse datawarehousing etl-pipeline google-api-client multi-dimensional-analysis multithreading powerbi-report ssas-multidimensional ssis tweets tweets-classification tweets-scraper twitter-api web-crawling
Last synced: 10 Jun 2025
https://github.com/spyboy-productions/PhantomCrawler
Boost website hits by generating requests from multiple proxy IPs.
ddos-attack-tools proxy proxy-configuration proxy-rotation web-crawling web-scrapping website-analytics website-hits
Last synced: 20 Apr 2025
https://github.com/mike-gee/webtranspose
Web scraping API for building AI applications.
chatbots crawling crawling-python python scraping scraping-python web-crawling web-scraping web-scraping-python
Last synced: 29 Apr 2025
https://github.com/cheng-lin-li/knowledgegraph
This repository for Web Crawling, Information Extraction, and Knowledge Graph build up.
cdr conditional conditional-random-fields crfsuite facebook-crawler facebook-graph-api information-extraction jsonlines knowledge-graph python python3 web-crawling
Last synced: 09 Mar 2026
https://github.com/scrapingant/zoominfo_scraper
Zoominfo scraper with using of rotating proxies and headless Chrome from ScrapingAnt
datamining leadgen leadgeneration python scraper scraping scraping-api scraping-data scraping-tool scraping-websites web-crawler web-crawler-python web-crawling web-harvesting zoominfo-client
Last synced: 11 Jun 2025
https://github.com/zytedata/spidyquotes
Example site for web scraping tutorials
crawling playground scraping tutorials web-crawling web-scraping web-scraping-tutorials
Last synced: 01 Apr 2026
https://github.com/pinkpixel-dev/deep-research-mcp
A Model Context Protocol (MCP) compliant server designed for comprehensive web research. It uses Tavily's Search and Crawl APIs to gather detailed information on a given topic, then structures this data in a format perfect for LLMs to create high-quality markdown documents.
ai-tools data-aggregation deep-research documentation-generation information-retrieval knowledge-base llm mcp mcp-server model-context-protocol model-context-protocol-servers nodejs research-assistant search-api tavily typescript web-crawling web-research
Last synced: 09 May 2026
https://github.com/yuis-ice/jseval
Evaluate JavaScript on a URL through headless Chrome browser.
browser-automation cli-utilities cmdline command-line commandline-interface data-scraping datascraping eval evaluator headless-browser headless-browsers pupeteer scrapers scrapper scrapping web-browser web-crawling web-scrapping webscrapping website-scraper
Last synced: 11 Apr 2025
https://github.com/omkarcloud/botasaurus-starter
🚀 OFFICIAL STARTER TEMPLATE FOR BOTASAURUS SCRAPING FRAMEWORK 🤖
beautifulsoup crawler crawling crawling-framework crawling-python crawling-tool headless node-crawler python-crawler scraper scraping scraping-framework scraping-python scraping-tool selenium web-crawler web-crawling web-scraper web-scraping webscraping
Last synced: 23 Apr 2025
https://github.com/leopardslab/crawlerx
CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.
django-backend elasticsearch firebase-auth message-broker mongodb-server vuejs web-crawling
Last synced: 29 Aug 2025
https://github.com/miroshnikov/scrapyteer
Web crawling & scraping framework for Node.js on top of headless Chrome browser
crawer crawling crawling-framework crawling-sites crawling-tool headless scrape scraper scraping scraping-websites scrapy scrapy-crawler spider spider-framework web-crawler web-crawling web-scraping web-scraping-nodejs
Last synced: 26 Oct 2025
https://github.com/superbrucejia/dynamic-web-crawlering-python
This repo is mainly for dynamic web (Ajax Tech) crawling using Python, taking China's NSTL websites as an example.
dynamic-web-crawler dynamic-website nstl python python-crawler web-crawler-python web-crawling
Last synced: 21 Apr 2025
https://github.com/omkarcloud/omkar-temp-mail
🚀 OMKAR TEMP MAIL HELPS YOU USE TEMPORARY EMAILS. 🤖
10minute 10minutemail beautifulsoup crawling disposable-email disposable-email-addresses free-mail mail-api scraper scraping scraping-framework selenium temp-mail tempmail temporary-email web-crawler web-crawling web-scraper web-scraping webscraping
Last synced: 18 Mar 2025
https://github.com/hubertroy/seen
A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:
easy-to-use javasciprt lightweight-framework python3 spider-framework support-javascript web-crawling
Last synced: 25 Mar 2025
https://github.com/innovinati/microwler
A micro-framework for asynchronous deep crawls and web scraping with Python
aiohttp asyncio micro-framework nuxt parsel python quart web-crawling web-scraping
Last synced: 14 Jan 2026
https://github.com/scrapingant/alibaba_scraper
Alibaba scraper with using of rotating proxies and headless Chrome from ScrapingAnt
alibaba-scraper datamining price-scraper price-scraping python scraper scraping scraping-api scraping-data scraping-tool scraping-web scraping-websites web-crawler web-crawler-python web-crawling
Last synced: 15 Aug 2025
https://github.com/crwlrsoft/robots-txt
Robots Exclusion Standard/Protocol Parser for Web Crawling/Scraping
hacktoberfest robots-exclusion-protocol robots-exclusion-standard robots-txt robots-txt-parser web-crawling web-scraping
Last synced: 13 May 2025
https://github.com/my8100/scrapyd-cluster-on-heroku-scrapyd-app
How to set up Scrapyd cluster on Heroku
cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping
Last synced: 13 Apr 2025
https://github.com/omkarcloud/web-scraping-template
🚀 THIS WEB SCRAPING TEMPLATE PROVIDES YOU WITH A GREAT STARTING POINT WHEN CREATING WEB SCRAPING BOTS. 🤖
beautifulsoup crawler crawling crawling-framework crawling-python crawling-tool headless node-crawler python-crawler scraper scraping scraping-framework scraping-python scraping-tool selenium web-crawler web-crawling web-scraper web-scraping webscraping
Last synced: 24 Oct 2025
https://github.com/talaatmagdyx/socials_regex
🪡 Social account detection and extraction in ruby, e.g. for crawling/scraping.
ruby web-crawling web-scraping
Last synced: 07 May 2025
https://github.com/michaelradu/web-crawler
A Web Crawler developed in Python.
crawler crawler-python crawlers python python-3 python-script python3 script scripting scripting-language scripts web web-crawler web-crawler-python web-crawlers web-crawling webcrawl webcrawler webcrawling
Last synced: 25 Jul 2025
https://github.com/andredarcie/best-games-of-all-time-data-based
🏆 Definite Best Games Of All Time Data Based by multiple sources
best critics data dataset game rank video-game video-games web-crawling web-scraping
Last synced: 28 Apr 2025
https://github.com/pps-22-scooby/pps-22-scooby
Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.
crawler crawlers internal-dsl scala scraper scrapers web web-crawler web-crawling web-scraper web-scrapers
Last synced: 24 Oct 2025
https://github.com/wangjksjtu/Data-Mining-51Job
Data-mining on 51Job website
51job data-mining machine-learning scikit-learn seaborn web-crawling
Last synced: 09 May 2025
https://github.com/joe-stifler/crawler
Crawler is a Python package that crawls web pages and converts their content into Markdown format, making it easy to create documentation, notes, or other text-based representations. It features domain restrictions, flexible output options, and graph visualization.
context-extraction conversion file-system-crawling github-repository-crawling large-language-models llm markdown python web-crawling
Last synced: 24 Jul 2025
https://github.com/lewisakura/spiderboi
A web crawling library written in TypeScript.
spider typescript typescript3 web-crawler web-crawling web-spider webcrawler
Last synced: 12 Apr 2025
https://github.com/0memo07/web-crawler
Web Crawler with Python
beautifulsoup4 bs4 crawler crawlers crawling crawling-python web-crawler web-crawler-python web-crawling webcrawler
Last synced: 24 Apr 2025
https://github.com/thesp0nge/nightcrawler-mitm
A python program that crawls a website and tries to stress it, polluting forms with bogus data
crawler offensive-scripts offensive-security stress-test web-crawler web-crawling
Last synced: 30 Apr 2025
https://github.com/omkarcloud/puppeteer-captcha-solving-tutorial
🚀 LEARN HOW TO SOLVE CAPTCHA IN PUPPETEER USING CAPSOLVER 🤖
2captcha anticaptcha capmonster capsolver capsolver-captcha capsolver-python captcha captcha-breaker captcha-breaking captcha-bypass captcha-generator captcha-image captcha-library captcha-solver captcha-solving crack-captcha hcaptcha web-crawling webscraping
Last synced: 07 Sep 2025
https://github.com/lekhmanrus/real-shot-pdf
RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.
ai-assistant angular browser-extension chrome-extension data-preservation gpt-integration knowledge-base knowledgebase link-parsing local-data-processing pdf pdf-downloader pdf-generation pdf-generator pdf-merger web-content-capture web-crawling web-data-extraction web-scraping webpage-to-pdf
Last synced: 14 Sep 2025
https://github.com/prakharchoudhary/fun_with_python
My adventures with python!!
automation machine-learning machine-learning-algorithms python regular-expression web-crawling web-scraping
Last synced: 11 Jul 2025
https://github.com/omkarcloud/selenium-2captcha-recaptcha-solver-demo
🚀 FINAL CODE FOR TUTORIAL ON HOW TO SOLVE CAPTCHA IN SELENIUM USING 2CAPTCHA 🤖
2captcha captcha captcha-break captcha-breaker captcha-breaking captcha-bypass captcha-generator captcha-image captcha-library captcha-solver captcha-solving crack-captcha scraping scraping-framework selenium web-crawler web-crawling web-scraper web-scraping webscraping
Last synced: 07 Sep 2025
https://github.com/ahmed-alnassif/net-spider
Net-Spider is a web scraping tool designed to retrieve the source code for a web page, including front-end elements such as JavaScript, CSS, images, and fonts. It allows you to crawl and download the source code from a target website.
beautifulsoup4 command-line-interface front-end-web-development python3 source-code-extraction web-automation web-crawling web-development-tool web-optimization web-scraping
Last synced: 26 Jun 2025
https://github.com/solrikk/datadigger
DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.
business-intelligence content-extraction data-analysis data-collection data-extraction data-mining go golang-api html-parser marketing-tools metadata-extraction research-tools seo-tools web-application web-crawling web-scraping web-tools
Last synced: 15 Apr 2025
https://github.com/crwlrsoft/laravel-crawler
Laravel adapter for the crwlr/crawler package.
crawler crawling crawling-framework hacktoberfest laravel laravel-package php scraper scraping web-crawler web-crawling web-scraping
Last synced: 28 Feb 2025
https://github.com/elektrostudios/fhm-crawler-freehardmusic.com
Crawls download urls of albums from freehardmusic.com website
albums crawl crawler crawling desktop-app desktop-application dotnet music web-crawler web-crawling web-scraper web-scraping webcrawler webcrawling webscraper webscraping windows windows-app windowsapp winforms
Last synced: 19 Jul 2025
https://github.com/amienbou121/crawl4ai-mcp-server
🕷️ Enable AI agents to scrape and crawl the web effortlessly with this lightweight Model Context Protocol server, integrating seamlessly into your workflows.
agentic-ai agentic-rag agentic-workflow ai-agents claude-code crawl4ai crawling firecrawl-alternative keyword-search mcp model-context-protocol-servers openai-agents-sdk pydantic-ai reranking semantic-search uvx web-crawling web-scraping
Last synced: 02 May 2026
https://github.com/saka7/data-science-sandbox
Data Science, Data Analysis, Web Scraping Sandbox
ai artificial-intelligence data-analysis data-science jupyter-notebook machine-learning neural-network sandbox web-crawling web-scraping
Last synced: 02 May 2026
https://github.com/kitsunesemcalda/info-elixir
A WebCrawler builded to recursive crawling
elixir elixir-ecto privategpt web-crawler web-crawling web-crawling-and-scraping
Last synced: 03 Apr 2025
https://github.com/mirtia/inappropriate-youtube
This repository contains some of the scripts used to obtain channel YouTube features and analyze potential disturbing channels.
data-mining web-crawling youtube
Last synced: 16 Sep 2025
https://github.com/osamikoyo/geass
web crawler for you, with some api function and configuration
docker go golang searching url web-crawler web-crawling web-scraping
Last synced: 14 Jul 2025
https://github.com/frederik-uni/docker-cloudflare-bypasser
A simple api that runs within a container that returns user-agent and cookies
anti-bot anti-detect anti-detection bypass-cloudflare cloudflare-bypass cloudflare-bypasser cloudflare-scraper undetectable web-crawling
Last synced: 11 Jun 2025
https://github.com/fern-aerell/web-crawling-to-txt
Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.
beautifulsoup4 crawling python requests scraping txt web-crawling web-scraping
Last synced: 03 May 2025
https://github.com/joeri-abbo/python-credly-scraper
This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an
badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling
Last synced: 23 Sep 2025
https://github.com/kluhan/kraken
Kraken is a generic, mid-scale web crawler specifically built to crawl vertical data-sources, like Youtube or the Google Play Store.
celery crawler google-play-store python web-crawling
Last synced: 07 Sep 2025
https://github.com/omkarcloud/dentalkart-scraper
🚀 SCRAPE 1000'S OF PRODUCTS FROM DENTALKART 🤖
beautifulsoup crawler crawling crawling-framework crawling-python dentalkart dentalkart-product-scraper dentalkart-scraper dentalkart-scraping node-crawler scraper scraping scraping-framework scraping-python selenium web-crawler web-crawling web-scraper web-scraping webscraping
Last synced: 07 Sep 2025
https://github.com/lightfeed/browser-agent
Serverless AI browser agent
ai ai-agents automation aws-lambda browser browser-agent browser-automation crawling playwright scraping serverless serverless-framework web-crawling web-extraction web-scraping
Last synced: 20 Jun 2025
https://github.com/feluelle/pastebin-crawler-lib
A library for web crawling http://pastebin.com
Last synced: 12 Jun 2025
https://github.com/mstephen19/apify-global-store
Easily access, manage, and manipulate a global state/multiple global states for an Apify actor's run
apify apify-sdk state state-management typescript web-crawling web-scraping
Last synced: 15 Mar 2025
https://github.com/islamhafez0/web-crawler
books-toscrap flask json pagination python restful-api scrapy web-crawling web-scraping
Last synced: 08 May 2026
https://github.com/iamfarrokhnejad/murkmaw
A web crawler using Rust.
functional functional-programming rust rust-lang web-crawler web-crawling webcrawler webcrawling
Last synced: 28 Mar 2025
https://github.com/yjg30737/wiki-offline
convert Wikipedia html into txt which makes it able to read offline
beautifulsoup python python3 python37 python38 urllib web-crawler web-crawling wiki wikipedia
Last synced: 14 Dec 2025
https://github.com/ahnsv/go-subscription-commerce
A commerce service for subscriptionable products implemented by Go
golang subscription-commerce web-crawling
Last synced: 16 Apr 2026
https://github.com/marcuwynu23/bahagi
a scraping tool to get website table data and convert into a file
bs4 console datasets requests scraping tool tools web-crawling
Last synced: 14 May 2025
https://github.com/solangeug/webscraper
A web scraping project in Python using Scrapy, an open source and collaborative framework for extracting data from websites.
data-mining python27 scrapy web-crawling web-scraping
Last synced: 17 Jun 2025
https://github.com/my8100/scrapyd-cluster-on-heroku-scrapydweb-app-git
How to set up Scrapyd cluster on Heroku
cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping
Last synced: 09 Mar 2026
https://github.com/gesistsa/python-web-data-collection-tutorial
Tutorial of Web data collection with Python.
beautifulsoup data-science python web-crawling wikipedia
Last synced: 29 Apr 2026
https://github.com/ankush-chander/github-crawler
Crawl information from github in friendly manner.
human-resource-analytics web-crawling
Last synced: 30 Aug 2025
https://github.com/savinrazvan/pagerank
This project implements the PageRank algorithm to rank web pages by importance using two approaches: a sampling method with the Markov Chain random surfer model and an iterative method with a recursive mathematical expression.
alogrithm convergence data-science graph-theory iterative-methods markov-chain mathematical-modelling pagerank pagerank-algorithm python random-surfer-model recursive-algorithm sampling-methods search-engine simulation web-crawling
Last synced: 06 May 2026
https://github.com/oxylabs/pricing-data-collection-from-ecommerce-stores
Appache Airflow DAGs for e-commerce pricing collection.
appache e-commerce-scraper ebay-search ebay-searches ecommerce-scraper ecommerce-website pricing-data scraping web-crawling web-scraping
Last synced: 11 Mar 2025
https://github.com/18520339/web-scraping-with-scrapy
Python web scraping with Scrapy
scrapy web-crawling web-scraping
Last synced: 30 Mar 2025
https://github.com/justserpapi/web-html
JustSerpAPI Crawl Webpage HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.
crawler google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api html-api justserpapi python serp-api web-crawling web-html-api web-scraping
Last synced: 08 Jun 2026
https://github.com/justserpapi/web-rendered-html
JustSerpAPI Crawl Webpage Rendered HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.
google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api javascript-rendering justserpapi python rendered-html-api serp-api web-crawling web-rendering web-scraping
Last synced: 08 Jun 2026
https://github.com/kunalpisolkar24/ir_lab
Collection of practical codes for Savitribai Phule Pune University's Information Retrieval Lab (410247) .
cosine-similarity information-retrieval map-reduce pagerank sppu-computer-engineering text-preprocessing web-crawling
Last synced: 09 Jun 2026
https://github.com/blocklet/snap-kit
Snap Kit is a powerful, Puppeteer-based service designed for seamless web automation. It enables you to effortlessly capture high-fidelity web page screenshots and efficiently scrape web content for precise data extraction.
browser-automation puppeteer self-hosting web-crawling web-screenshot
Last synced: 10 Aug 2025
https://github.com/ffiruzi/website-rag-qa-assistant
Transform any website into an intelligent AI assistant with ONE Docker command. Complete RAG system with web crawling, vector embeddings, and beautiful chat interface. Built with FastAPI, React, and LangChain.
artificial-intelligence chatbot docker fastapi full-stack langchain openai python question-answering rag rag-chatbot react typescript vector-database web-crawling
Last synced: 09 Apr 2026
https://github.com/manu-sh/http_normalizer_parts
http url normalization utilities for web crawlers
http-url library normalization spiders web-crawling web-scraping
Last synced: 05 Aug 2025
https://github.com/hjsblogger/web-crawling-with-python
Demonstration of Web Crawling using Python and Beautiful Soup
beautifulsoup beautifulsoup4 lambdatest python python3 web-crawler web-crawling web-crawling-and-scraping
Last synced: 10 Aug 2025
https://github.com/harr1424/go-crawl
A utility to crawl specified domains and download .zip files
Last synced: 19 Aug 2025
https://github.com/pmuens/crawler
Multi-threaded Web crawler with support for custom fetching and persisting logic
crawler crawler-engine rust rust-lang web-crawler web-crawling
Last synced: 15 May 2025
https://github.com/osandadeshan/web-crawler
This is a simple application to crawl your web pages.
app java8 maven selenium-webdriver swing-gui web-crawler web-crawling
Last synced: 01 May 2026
https://github.com/pv-912/scrapy
Crawling upcoming and recent movies data from IMDB website using python-scrapy.
python-scrapy scrapy-spider web-crawling
Last synced: 26 May 2026
https://github.com/m1/smap
smap is a site-mapping engine written in Go.
crawler go go-library go-package golang golang-library golang-package golang-tools sitemap sitemap-generator web-crawler web-crawling
Last synced: 01 Jul 2025
https://github.com/himudigonda/arxiv.org_crawler
web-crawler web-crawler-python web-crawling
Last synced: 08 May 2025
https://github.com/hasdata/find-urls-from-any-domain
This repository provides practical examples of website link scraping using Python and Node.js.
ai-extraction crawler hasdata-api nodejs python sitemap-parser url-extraction web-crawling web-scraping
Last synced: 06 May 2026
https://github.com/dlr-sc/conference-analyzer
conference-management diversity diversity-measures pycon python web-crawling
Last synced: 18 Jun 2025
https://github.com/elanora96/occlucrawlee
A Crawlee Web Spider for Horg.com, collects information on known Occlupanids.
cheerio crawlee nix occlupanid typescript web-crawling
Last synced: 20 Apr 2026