Projects in Awesome Lists tagged with web-crawling

https://github.com/apifytech/apify-js

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 06 Jul 2025

https://github.com/apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 03 Nov 2025

https://github.com/apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping

Last synced: 06 Mar 2026

https://github.com/brightdata/brightdata-mcp

A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.

ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping

Last synced: 16 Jan 2026

https://github.com/omkarcloud/botasaurus

The All in One Framework to Build Undefeatable Scrapers

anti-bot anti-detect anti-detect-browser anti-detection antidetect-browser bot-detection bypass-cloudflare cloudflare-bypass cloudflare-scrape python-scraper python-web-scraper python-web-scraping scraping-framework scraping-python scraping-tool undetectable undetected undetected-chromedriver web-crawling web-scraping-python

Last synced: 13 May 2025

https://github.com/crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development

crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping

Last synced: 15 May 2025

https://github.com/jrbadiabo/bet-on-sibyl

Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)

algorithms beautifulsoup machine-learning machine-learning-algorithms machinelearning predictive-analysis python python-2 scikit-learn selenium sports-stats sportsanalytics web-crawling web-scraping

Last synced: 13 Apr 2025

https://github.com/turnersoftware/infinitycrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 21 Jun 2025

https://github.com/TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 25 Mar 2025

https://github.com/spyboy-productions/omnisci3nt

Unveiling the Hidden Layers of the Web – A Comprehensive Web Reconnaissance Tool

admin-login-finder admin-panel-finder admin-panel-finder-of-any-website directory-enumeration dmarc-record-examination dns-enumeration ip-lookup osint pentesting-tools port-scanning reconnaissance-tool social-media-and-email-discovery ssl-certificate subdomain-enumeration technology-analysis wayback-machine-access web-crawling web-reconnaissance website-hacking whois

Last synced: 04 Apr 2025

https://github.com/ayakashi-io/ayakashi

:zap: Ayakashi.io - The next generation web scraping framework

automation data-mining headless-chrome web-crawling web-scraping

Last synced: 11 Apr 2025

https://github.com/serpapi/clauneck

A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.

automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping

Last synced: 06 Apr 2025

https://github.com/scrapinghub/scrapy-training

Scrapy Training companion code

python scrapy training web-crawling web-scraping

Last synced: 25 Apr 2025

https://github.com/my8100/scrapyd-cluster-on-heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:

cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping

Last synced: 14 Jun 2025

https://github.com/maxvalue/terpene-profile-parser-for-cannabis-strains

Parser and database to index the terpene profile of different strains of Cannabis from online databases

analysis aromatherapy bioinformatics biological-data biological-data-analysis cannabis cannabis-strains crawler data-science database health plants python python-3 scrapy terpene-profile terpenes web-crawler web-crawler-python web-crawling

Last synced: 22 Apr 2025

https://github.com/maxmindlin/scout-lang

A web crawling programming language

dsl programming-language scraper scraping scraping-websites web-crawling web-scraping

Last synced: 06 Apr 2025

https://github.com/alyakhtar/Katastrophe

Command Line Tool to download torrents

bittorrent command-line deluge kickass-torrents python screenshot torrent web-crawling

Last synced: 19 Jul 2025

https://github.com/scrapingant/amazon_scraper

Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt

amazon amazon-scraper amazon-scraping-library data-mining js node-js price-scraper price-scraping scrape-products scraper scraping scraping-api scraping-data scraping-python scraping-web scraping-websites web-crawler web-crawlers web-crawling

Last synced: 22 Aug 2025

https://github.com/ScrapingAnt/amazon_scraper

Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt

amazon amazon-scraper amazon-scraping-library data-mining js node-js price-scraper price-scraping scrape-products scraper scraping scraping-api scraping-data scraping-python scraping-web scraping-websites web-crawler web-crawlers web-crawling

Last synced: 06 Apr 2025

https://github.com/spyboy-productions/phantomcrawler

Boost website hits by generating requests from multiple proxy IPs.

ddos-attack-tools proxy proxy-configuration proxy-rotation web-crawling web-scrapping website-analytics website-hits

Last synced: 12 May 2025

https://github.com/dongweiming/daenerys

Scraping and Web Crawling Framework For Zhihu Live

scraping web-crawling zhihu zhihulive

Last synced: 20 Mar 2025

https://github.com/jgujerry/python-frameworks

Another curated list of Python frameworks

api artificial-intelligence cms data-workflow deep-learning devops distributed-computing enterprise-integrations frameworks machine-learning messaging parallel-computing pipeline python task-queue web-crawling webapp

Last synced: 26 Mar 2025

https://github.com/mohamedhmini/tweetsolaping

implementing an end-to-end tweets ETL/Analysis pipeline.

analysis api-client cube-analysis datawarehouse datawarehousing etl-pipeline google-api-client multi-dimensional-analysis multithreading powerbi-report ssas-multidimensional ssis tweets tweets-classification tweets-scraper twitter-api web-crawling

Last synced: 10 Jun 2025

https://github.com/spyboy-productions/PhantomCrawler

Boost website hits by generating requests from multiple proxy IPs.

ddos-attack-tools proxy proxy-configuration proxy-rotation web-crawling web-scrapping website-analytics website-hits

Last synced: 20 Apr 2025

https://github.com/mike-gee/webtranspose

Web scraping API for building AI applications.

chatbots crawling crawling-python python scraping scraping-python web-crawling web-scraping web-scraping-python

Last synced: 29 Apr 2025

https://github.com/cheng-lin-li/knowledgegraph

This repository for Web Crawling, Information Extraction, and Knowledge Graph build up.

cdr conditional conditional-random-fields crfsuite facebook-crawler facebook-graph-api information-extraction jsonlines knowledge-graph python python3 web-crawling

Last synced: 09 Mar 2026

https://github.com/scrapingant/zoominfo_scraper

Zoominfo scraper with using of rotating proxies and headless Chrome from ScrapingAnt

datamining leadgen leadgeneration python scraper scraping scraping-api scraping-data scraping-tool scraping-websites web-crawler web-crawler-python web-crawling web-harvesting zoominfo-client

Last synced: 11 Jun 2025

https://github.com/zytedata/spidyquotes

Example site for web scraping tutorials

crawling playground scraping tutorials web-crawling web-scraping web-scraping-tutorials

Last synced: 01 Apr 2026

https://github.com/pinkpixel-dev/deep-research-mcp

A Model Context Protocol (MCP) compliant server designed for comprehensive web research. It uses Tavily's Search and Crawl APIs to gather detailed information on a given topic, then structures this data in a format perfect for LLMs to create high-quality markdown documents.

ai-tools data-aggregation deep-research documentation-generation information-retrieval knowledge-base llm mcp mcp-server model-context-protocol model-context-protocol-servers nodejs research-assistant search-api tavily typescript web-crawling web-research

Last synced: 09 May 2026

https://github.com/yuis-ice/jseval

Evaluate JavaScript on a URL through headless Chrome browser.

browser-automation cli-utilities cmdline command-line commandline-interface data-scraping datascraping eval evaluator headless-browser headless-browsers pupeteer scrapers scrapper scrapping web-browser web-crawling web-scrapping webscrapping website-scraper

Last synced: 11 Apr 2025

https://github.com/omkarcloud/botasaurus-starter

🚀 OFFICIAL STARTER TEMPLATE FOR BOTASAURUS SCRAPING FRAMEWORK 🤖

beautifulsoup crawler crawling crawling-framework crawling-python crawling-tool headless node-crawler python-crawler scraper scraping scraping-framework scraping-python scraping-tool selenium web-crawler web-crawling web-scraper web-scraping webscraping

Last synced: 23 Apr 2025

https://github.com/leopardslab/crawlerx

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.

django-backend elasticsearch firebase-auth message-broker mongodb-server vuejs web-crawling

Last synced: 29 Aug 2025

https://github.com/miroshnikov/scrapyteer

Web crawling & scraping framework for Node.js on top of headless Chrome browser

crawer crawling crawling-framework crawling-sites crawling-tool headless scrape scraper scraping scraping-websites scrapy scrapy-crawler spider spider-framework web-crawler web-crawling web-scraping web-scraping-nodejs

Last synced: 26 Oct 2025

https://github.com/superbrucejia/dynamic-web-crawlering-python

This repo is mainly for dynamic web (Ajax Tech) crawling using Python, taking China's NSTL websites as an example.

dynamic-web-crawler dynamic-website nstl python python-crawler web-crawler-python web-crawling

Last synced: 21 Apr 2025

https://github.com/omkarcloud/omkar-temp-mail

🚀 OMKAR TEMP MAIL HELPS YOU USE TEMPORARY EMAILS. 🤖

10minute 10minutemail beautifulsoup crawling disposable-email disposable-email-addresses free-mail mail-api scraper scraping scraping-framework selenium temp-mail tempmail temporary-email web-crawler web-crawling web-scraper web-scraping webscraping

Last synced: 18 Mar 2025

https://github.com/hubertroy/seen

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:

easy-to-use javasciprt lightweight-framework python3 spider-framework support-javascript web-crawling

Last synced: 25 Mar 2025

https://github.com/innovinati/microwler

A micro-framework for asynchronous deep crawls and web scraping with Python

aiohttp asyncio micro-framework nuxt parsel python quart web-crawling web-scraping

Last synced: 14 Jan 2026

https://github.com/scrapingant/alibaba_scraper

Alibaba scraper with using of rotating proxies and headless Chrome from ScrapingAnt

alibaba-scraper datamining price-scraper price-scraping python scraper scraping scraping-api scraping-data scraping-tool scraping-web scraping-websites web-crawler web-crawler-python web-crawling

Last synced: 15 Aug 2025

https://github.com/crwlrsoft/robots-txt

Robots Exclusion Standard/Protocol Parser for Web Crawling/Scraping

hacktoberfest robots-exclusion-protocol robots-exclusion-standard robots-txt robots-txt-parser web-crawling web-scraping

Last synced: 13 May 2025

https://github.com/my8100/scrapyd-cluster-on-heroku-scrapyd-app

How to set up Scrapyd cluster on Heroku

cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping

Last synced: 13 Apr 2025

https://github.com/omkarcloud/web-scraping-template

🚀 THIS WEB SCRAPING TEMPLATE PROVIDES YOU WITH A GREAT STARTING POINT WHEN CREATING WEB SCRAPING BOTS. 🤖

beautifulsoup crawler crawling crawling-framework crawling-python crawling-tool headless node-crawler python-crawler scraper scraping scraping-framework scraping-python scraping-tool selenium web-crawler web-crawling web-scraper web-scraping webscraping

Last synced: 24 Oct 2025

https://github.com/talaatmagdyx/socials_regex

🪡 Social account detection and extraction in ruby, e.g. for crawling/scraping.

ruby web-crawling web-scraping

Last synced: 07 May 2025

https://github.com/michaelradu/web-crawler

A Web Crawler developed in Python.

crawler crawler-python crawlers python python-3 python-script python3 script scripting scripting-language scripts web web-crawler web-crawler-python web-crawlers web-crawling webcrawl webcrawler webcrawling

Last synced: 25 Jul 2025

https://github.com/andredarcie/best-games-of-all-time-data-based

🏆 Definite Best Games Of All Time Data Based by multiple sources

best critics data dataset game rank video-game video-games web-crawling web-scraping

Last synced: 28 Apr 2025

https://github.com/pps-22-scooby/pps-22-scooby

Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.

crawler crawlers internal-dsl scala scraper scrapers web web-crawler web-crawling web-scraper web-scrapers

Last synced: 24 Oct 2025

https://github.com/wangjksjtu/Data-Mining-51Job

Data-mining on 51Job website

51job data-mining machine-learning scikit-learn seaborn web-crawling

Last synced: 09 May 2025

https://github.com/joe-stifler/crawler

Crawler is a Python package that crawls web pages and converts their content into Markdown format, making it easy to create documentation, notes, or other text-based representations. It features domain restrictions, flexible output options, and graph visualization.

context-extraction conversion file-system-crawling github-repository-crawling large-language-models llm markdown python web-crawling

Last synced: 24 Jul 2025

https://github.com/lewisakura/spiderboi

A web crawling library written in TypeScript.

spider typescript typescript3 web-crawler web-crawling web-spider webcrawler

Last synced: 12 Apr 2025

https://github.com/0memo07/web-crawler

Web Crawler with Python

beautifulsoup4 bs4 crawler crawlers crawling crawling-python web-crawler web-crawler-python web-crawling webcrawler

Last synced: 24 Apr 2025

https://github.com/thesp0nge/nightcrawler-mitm

A python program that crawls a website and tries to stress it, polluting forms with bogus data

crawler offensive-scripts offensive-security stress-test web-crawler web-crawling

Last synced: 30 Apr 2025

https://github.com/omkarcloud/puppeteer-captcha-solving-tutorial

🚀 LEARN HOW TO SOLVE CAPTCHA IN PUPPETEER USING CAPSOLVER 🤖

2captcha anticaptcha capmonster capsolver capsolver-captcha capsolver-python captcha captcha-breaker captcha-breaking captcha-bypass captcha-generator captcha-image captcha-library captcha-solver captcha-solving crack-captcha hcaptcha web-crawling webscraping

Last synced: 07 Sep 2025

https://github.com/lekhmanrus/real-shot-pdf

RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

ai-assistant angular browser-extension chrome-extension data-preservation gpt-integration knowledge-base knowledgebase link-parsing local-data-processing pdf pdf-downloader pdf-generation pdf-generator pdf-merger web-content-capture web-crawling web-data-extraction web-scraping webpage-to-pdf

Last synced: 14 Sep 2025

https://github.com/prakharchoudhary/fun_with_python

My adventures with python!!

automation machine-learning machine-learning-algorithms python regular-expression web-crawling web-scraping

Last synced: 11 Jul 2025

https://github.com/omkarcloud/selenium-2captcha-recaptcha-solver-demo

🚀 FINAL CODE FOR TUTORIAL ON HOW TO SOLVE CAPTCHA IN SELENIUM USING 2CAPTCHA 🤖

2captcha captcha captcha-break captcha-breaker captcha-breaking captcha-bypass captcha-generator captcha-image captcha-library captcha-solver captcha-solving crack-captcha scraping scraping-framework selenium web-crawler web-crawling web-scraper web-scraping webscraping

Last synced: 07 Sep 2025

https://github.com/ahmed-alnassif/net-spider

Net-Spider is a web scraping tool designed to retrieve the source code for a web page, including front-end elements such as JavaScript, CSS, images, and fonts. It allows you to crawl and download the source code from a target website.

beautifulsoup4 command-line-interface front-end-web-development python3 source-code-extraction web-automation web-crawling web-development-tool web-optimization web-scraping

Last synced: 26 Jun 2025

https://github.com/solrikk/datadigger

DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.

business-intelligence content-extraction data-analysis data-collection data-extraction data-mining go golang-api html-parser marketing-tools metadata-extraction research-tools seo-tools web-application web-crawling web-scraping web-tools

Last synced: 15 Apr 2025

https://github.com/crwlrsoft/laravel-crawler

Laravel adapter for the crwlr/crawler package.

crawler crawling crawling-framework hacktoberfest laravel laravel-package php scraper scraping web-crawler web-crawling web-scraping

Last synced: 28 Feb 2025

https://github.com/elektrostudios/fhm-crawler-freehardmusic.com

Crawls download urls of albums from freehardmusic.com website

albums crawl crawler crawling desktop-app desktop-application dotnet music web-crawler web-crawling web-scraper web-scraping webcrawler webcrawling webscraper webscraping windows windows-app windowsapp winforms

Last synced: 19 Jul 2025

https://github.com/amienbou121/crawl4ai-mcp-server

🕷️ Enable AI agents to scrape and crawl the web effortlessly with this lightweight Model Context Protocol server, integrating seamlessly into your workflows.

agentic-ai agentic-rag agentic-workflow ai-agents claude-code crawl4ai crawling firecrawl-alternative keyword-search mcp model-context-protocol-servers openai-agents-sdk pydantic-ai reranking semantic-search uvx web-crawling web-scraping

Last synced: 02 May 2026

https://github.com/saka7/data-science-sandbox

Data Science, Data Analysis, Web Scraping Sandbox

ai artificial-intelligence data-analysis data-science jupyter-notebook machine-learning neural-network sandbox web-crawling web-scraping

Last synced: 02 May 2026

https://github.com/afrontend/dongnelibrary

도서관 책을 빌릴 수 있는지 확인하는 유틸리티

public-library web-crawling

Last synced: 26 Jun 2025

https://github.com/kitsunesemcalda/info-elixir

A WebCrawler builded to recursive crawling

elixir elixir-ecto privategpt web-crawler web-crawling web-crawling-and-scraping

Last synced: 03 Apr 2025

https://github.com/mirtia/inappropriate-youtube

This repository contains some of the scripts used to obtain channel YouTube features and analyze potential disturbing channels.

data-mining web-crawling youtube

Last synced: 16 Sep 2025

https://github.com/osamikoyo/geass

web crawler for you, with some api function and configuration

docker go golang searching url web-crawler web-crawling web-scraping

Last synced: 14 Jul 2025

https://github.com/frederik-uni/docker-cloudflare-bypasser

A simple api that runs within a container that returns user-agent and cookies

anti-bot anti-detect anti-detection bypass-cloudflare cloudflare-bypass cloudflare-bypasser cloudflare-scraper undetectable web-crawling

Last synced: 11 Jun 2025

https://github.com/fern-aerell/web-crawling-to-txt

Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.

beautifulsoup4 crawling python requests scraping txt web-crawling web-scraping

Last synced: 03 May 2025

https://github.com/joeri-abbo/python-credly-scraper

This project is a set of Python scripts designed to crawl and extract data from the Credly platform, focusing on skills, organizations, and badges. The scripts allow users to perform searches using command-line arguments, predefined search terms, or skills listed in a JSON file. The collected data is then saved to JSON files for further analysis an

badges crawler credly data-extraction json organizations python python3 requests-library skills web-crawling

Last synced: 23 Sep 2025

https://github.com/kluhan/kraken

Kraken is a generic, mid-scale web crawler specifically built to crawl vertical data-sources, like Youtube or the Google Play Store.

celery crawler google-play-store python web-crawling

Last synced: 07 Sep 2025

https://github.com/omkarcloud/dentalkart-scraper

🚀 SCRAPE 1000'S OF PRODUCTS FROM DENTALKART 🤖

beautifulsoup crawler crawling crawling-framework crawling-python dentalkart dentalkart-product-scraper dentalkart-scraper dentalkart-scraping node-crawler scraper scraping scraping-framework scraping-python selenium web-crawler web-crawling web-scraper web-scraping webscraping

Last synced: 07 Sep 2025

https://github.com/lightfeed/browser-agent

Serverless AI browser agent

ai ai-agents automation aws-lambda browser browser-agent browser-automation crawling playwright scraping serverless serverless-framework web-crawling web-extraction web-scraping

Last synced: 20 Jun 2025

https://github.com/feluelle/pastebin-crawler-lib

A library for web crawling http://pastebin.com

pastebin web-crawling

Last synced: 12 Jun 2025

https://github.com/mstephen19/apify-global-store

Easily access, manage, and manipulate a global state/multiple global states for an Apify actor's run

apify apify-sdk state state-management typescript web-crawling web-scraping

Last synced: 15 Mar 2025

https://github.com/islamhafez0/web-crawler

books-toscrap flask json pagination python restful-api scrapy web-crawling web-scraping

Last synced: 08 May 2026

https://github.com/iamfarrokhnejad/murkmaw

A web crawler using Rust.

functional functional-programming rust rust-lang web-crawler web-crawling webcrawler webcrawling

Last synced: 28 Mar 2025

https://github.com/yjg30737/wiki-offline

convert Wikipedia html into txt which makes it able to read offline

beautifulsoup python python3 python37 python38 urllib web-crawler web-crawling wiki wikipedia

Last synced: 14 Dec 2025

https://github.com/ahnsv/go-subscription-commerce

A commerce service for subscriptionable products implemented by Go

golang subscription-commerce web-crawling

Last synced: 16 Apr 2026

https://github.com/marcuwynu23/bahagi

a scraping tool to get website table data and convert into a file

bs4 console datasets requests scraping tool tools web-crawling

Last synced: 14 May 2025

https://github.com/solangeug/webscraper

A web scraping project in Python using Scrapy, an open source and collaborative framework for extracting data from websites.

data-mining python27 scrapy web-crawling web-scraping

Last synced: 17 Jun 2025

https://github.com/my8100/scrapyd-cluster-on-heroku-scrapydweb-app-git

How to set up Scrapyd cluster on Heroku

cluster heroku logparser python scrapy scrapyd scrapydweb web-crawling web-scraping

Last synced: 09 Mar 2026

https://github.com/gesistsa/python-web-data-collection-tutorial

Tutorial of Web data collection with Python.

beautifulsoup data-science python web-crawling wikipedia

Last synced: 29 Apr 2026

https://github.com/ankush-chander/github-crawler

Crawl information from github in friendly manner.

human-resource-analytics web-crawling

Last synced: 30 Aug 2025

https://github.com/savinrazvan/pagerank

This project implements the PageRank algorithm to rank web pages by importance using two approaches: a sampling method with the Markov Chain random surfer model and an iterative method with a recursive mathematical expression.

alogrithm convergence data-science graph-theory iterative-methods markov-chain mathematical-modelling pagerank pagerank-algorithm python random-surfer-model recursive-algorithm sampling-methods search-engine simulation web-crawling

Last synced: 06 May 2026

https://github.com/oxylabs/pricing-data-collection-from-ecommerce-stores

Appache Airflow DAGs for e-commerce pricing collection.

appache e-commerce-scraper ebay-search ebay-searches ecommerce-scraper ecommerce-website pricing-data scraping web-crawling web-scraping

Last synced: 11 Mar 2025

https://github.com/18520339/web-scraping-with-scrapy

Python web scraping with Scrapy

scrapy web-crawling web-scraping

Last synced: 30 Mar 2025

https://github.com/justserpapi/web-html

JustSerpAPI Crawl Webpage HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

crawler google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api html-api justserpapi python serp-api web-crawling web-html-api web-scraping

Last synced: 08 Jun 2026

https://github.com/justserpapi/web-rendered-html

JustSerpAPI Crawl Webpage Rendered HTML API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api javascript-rendering justserpapi python rendered-html-api serp-api web-crawling web-rendering web-scraping

Last synced: 08 Jun 2026

https://github.com/kunalpisolkar24/ir_lab

Collection of practical codes for Savitribai Phule Pune University's Information Retrieval Lab (410247) .

Last synced: 09 Jun 2026

https://github.com/blocklet/snap-kit

Snap Kit is a powerful, Puppeteer-based service designed for seamless web automation. It enables you to effortlessly capture high-fidelity web page screenshots and efficiently scrape web content for precise data extraction.

browser-automation puppeteer self-hosting web-crawling web-screenshot

Last synced: 10 Aug 2025

https://github.com/ffiruzi/website-rag-qa-assistant

Transform any website into an intelligent AI assistant with ONE Docker command. Complete RAG system with web crawling, vector embeddings, and beautiful chat interface. Built with FastAPI, React, and LangChain.

artificial-intelligence chatbot docker fastapi full-stack langchain openai python question-answering rag rag-chatbot react typescript vector-database web-crawling