Crawler | Ecosyste.ms: Awesome

https://github.com/rajatomar788/pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

archive-tool crawler html html-parser mirror python web webpage

Last synced: 04 Aug 2024

https://github.com/crawljax/crawljax

Crawljax

crawler crawling dom dynamic event-driven-crawling javascript test-generation web-analysis web-testing

Last synced: 29 Oct 2024

https://github.com/nanshihui/scan-t

a new crawler based on python with more function including Network fingerprint search

crawler netfingerprint python sybersecurity

Last synced: 03 Nov 2024

https://github.com/nanshihui/Scan-T

a new crawler based on python with more function including Network fingerprint search

crawler netfingerprint python sybersecurity

Last synced: 13 Nov 2024

https://github.com/zhuyingda/webster

a reliable high-level web crawling & scraping framework for Node.js.

automation-test automation-ui chromium crawler crawling headless-chrome javascript javascript-framework nodejs nodejs-framework puppeteer scraping-framework spider

Last synced: 10 Oct 2024

https://github.com/abhisharma404/vault

swiss army knife for hackers

crawler fuzzing hacking hacking-tool information-gathering lfi networking offensive-security osint pentesting port-scanner python rfi scanner scrapy security sqlite ssl-inspection vault xss-vulnerability

Last synced: 03 Nov 2024

https://github.com/jaeksoft/opensearchserver

Open-source Enterprise Grade Search Engine Software

crawler custom-search enterprise indexing java lucene ocr opensearchserver search search-engine synonyms webcrawler webcrawling

Last synced: 29 Oct 2024

https://github.com/dirtyfilthy/freshonions-torscraper

Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

crawler darknet hidden-services onion scraper spider tor

Last synced: 06 Nov 2024

https://github.com/AlexMathew/scrapple

A framework for creating semi-automatic web content extractors

beautifulsoup crawler css-selector extractor lxml python scrapers scraping scrapy selector selector-expression tutorial web-scraper web-scraping xpath-expression

Last synced: 31 Oct 2024

https://github.com/chushuai/wscan

Wscan is a web security scanner that focuses on web security, dedicated to making web security accessible to everyone.

cel-go chromedp crawler headless martian passive-vulnerability-scanner poc sql-injection subdomains testwaf vulnerability-scanner waf webscan wscan xss

Last synced: 04 Aug 2024

https://github.com/stanzhai/html2article

Html网页正文提取

article content crawler html spider topic

Last synced: 14 Nov 2024

https://github.com/stanzhai/Html2Article

Html网页正文提取

article content crawler html spider topic

Last synced: 04 Aug 2024

https://github.com/ChenZixinn/spider_reverse

crawler python requests spider

Last synced: 31 Oct 2024

https://github.com/chenjiandongx/mmjpg

👩 美女写真套图爬虫（一）

crawler meinv

Last synced: 09 Nov 2024

https://github.com/yhy0/Jie

Jie stands out as a comprehensive security assessment and exploitation tool meticulously crafted for web applications. Its robust suite of features encompasses vulnerability scanning, information gathering, and exploitation, elevating it to an indispensable toolkit for both security professionals and penetration testers.(expectations)

apollo-exp crawler jie scan scanner security-copilot shiro-exp vul vulnerability vulnerability-detection vulnerability-exploitation vulnerability-scanners

Last synced: 10 Sep 2024

https://github.com/yhy0/jie

Jie stands out as a comprehensive security assessment and exploitation tool meticulously crafted for web applications. Its robust suite of features encompasses vulnerability scanning, information gathering, and exploitation, elevating it to an indispensable toolkit for both security professionals and penetration testers.(expectations)

apollo-exp crawler jie scan scanner security-copilot shiro-exp vul vulnerability vulnerability-detection vulnerability-exploitation vulnerability-scanners

Last synced: 15 Nov 2024

https://github.com/shaohua0116/ICLR2020-OpenReviewData

Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.

conference crawler data-analysis iclr iclr2020 machine-learning visualization

Last synced: 07 Aug 2024

https://github.com/hect0x7/jmcomic-crawler-python

Python API for JMComic | 提供Python API访问禁漫天堂，同时支持网页端和移动端 | 禁漫天堂GitHub Actions下载器🚀

18comic crawler downloader github-actions jmcomic pypi python readthedocs

Last synced: 15 Nov 2024

https://github.com/AndyTheFactory/newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

articles articles-data crawler datasets-preparation news newspaper3k python requests scraper scraping

Last synced: 26 Oct 2024

https://github.com/andythefactory/newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

articles articles-data crawler datasets-preparation news newspaper3k python requests scraper scraping

Last synced: 14 Nov 2024

https://github.com/tasos-py/Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python

bing crawler google python scraper search-engine yahoo

Last synced: 04 Aug 2024

https://github.com/lixi5338619/lxbook

《爬虫逆向进阶实战》书籍代码库

android-resever crawler frida java javascript python smali spiders unidbg xposed

Last synced: 05 Nov 2024

https://github.com/gadfly0x/signature_algorithm

各种App、小程序、网站的请求签名或加密算法。现已有：自如、小红书、蛋壳公寓、luckin coffee(瑞幸咖啡)、bangkokair(曼谷航空)

crawler reverse-engineering spider

Last synced: 11 Nov 2024

https://github.com/roniemartinez/dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

async beautifulsoup4 crawler css framework lxml parsel playwright python scraper scraping selenium sync web-scraping webscraping xpath

Last synced: 11 Oct 2024

https://github.com/heqin-zhu/music-recover

:musical_note: 缓存文件转换为 MP3 文件

crawler mp3 python regex

Last synced: 06 Aug 2024

https://github.com/platonai/PulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

crawler data-mining data-science rpa scraper scraping web-automation web-crawler web-mining web-scraping web-sql

Last synced: 05 Nov 2024

https://github.com/lgraubner/sitemap-generator

Easily create XML sitemaps for your website.

crawler google seo sitemap sitemap-generator xml-sitemap

Last synced: 08 Aug 2024

https://github.com/cyubuchen/free_proxy_website

获取免费socks/https/http代理的网站集合

crawler free-proxy-list ip proxy proxy-checker spider

Last synced: 03 Aug 2024

https://github.com/shaohua0116/ICLR2019-OpenReviewData

Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.

crawler crawling-python openreview tutorial

Last synced: 07 Aug 2024

https://github.com/smuyyh/crawlerforreader

Android 本地网络小说爬虫，基于jsoup及xpath

android bookreader crawler jsoup xpath

Last synced: 10 Nov 2024

https://github.com/brendonboshell/supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

crawler distributed-crawler robot sitemap web-crawler

Last synced: 25 Oct 2024

https://github.com/microsoft/ghcrawler

Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

crawler data github github-api github-webhooks ospo

Last synced: 25 Sep 2024

https://github.com/mhmdiaa/second-order

Second-order subdomain takeover scanner

crawler crawling infosec mapping penetration-testing penetration-testing-tools pentesting recon reconnaissance security security-tools web-application-security wordlist wordlist-generator

Last synced: 03 Nov 2024

https://github.com/elvisyjlin/media-scraper

Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok

crawler instagram pixiv reddit scraper tiktok tumblr twitter

Last synced: 04 Nov 2024

https://github.com/Josue87/EmailFinder

Search emails from a domain through search engines

crawler osint

Last synced: 13 Nov 2024

https://github.com/scrapy-plugins/scrapy-crawlera

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

crawler crawler-detection plugin proxy scraping scrapy

Last synced: 05 Sep 2024

https://github.com/scrapy-plugins/scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

crawler crawler-detection plugin proxy scraping scrapy

Last synced: 12 Nov 2024

https://github.com/salimk/Rcrawler

An R web crawler and scraper

crawler crawlers r rpackage scraper webcrawler webscraper webscraping webscrapping

Last synced: 25 Oct 2024

https://github.com/Malwarize/webpalm

🕸️ Crawl in the web network

crawler crawling data data-science datamining go golang hack mining osint redteam spider tool

Last synced: 08 Nov 2024

https://github.com/crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development

crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping

Last synced: 25 Oct 2024

https://github.com/xiyuan-fengyu/ppspider

web spider built by puppeteer, support task-queue and task-scheduling by decorators，support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架，提供灵活的任务队列管理调度方案，提供便捷的数据保存方案（nedb/mongodb），提供数据可视化和用户交互的实现方案

angular cheerio crawler headless mongodb nedb node node-spider nodejs nodejs-spider proxy puppeteer spider task-queue task-scheduling typescript

Last synced: 10 Oct 2024

https://github.com/rivermont/spidy

The simple, easy to use command line web crawler.

crawler crawling python python3 web-crawler web-spider

Last synced: 29 Oct 2024

https://github.com/dmi3kno/polite

Be nice on the web

crawler memoise r r-package rate-limiter robotstxt rstats rvest scraper webscraping

Last synced: 25 Oct 2024

https://github.com/jackluson/chinese-fund-crawler

中国场外基金数据爬取&汇总分析

crawler fund morningstar

Last synced: 11 Nov 2024

https://github.com/yangjianxin1/qqmusicspider

基于Scrapy的QQ音乐爬虫(QQ Music Spider)，爬取歌曲信息、歌词、精彩评论等，并且分享了QQ音乐中排名前6400名的内地和港台歌手的49万+的音乐语料

crawler music musicspider qqmusic scrapy

Last synced: 14 Nov 2024

https://github.com/snakem982/Pandora-Box

A Simple Mihomo GUI.

crawler gui linux mac mihomo windows

Last synced: 28 Oct 2024

https://github.com/MikeMeliz/TorCrawl.py

Crawl and extract (regular or onion) webpages through TOR network

crawler extractor onion osint python tor

Last synced: 06 Nov 2024

https://github.com/infinilabs/crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider

Last synced: 09 Nov 2024

https://github.com/krypton-byte/tiktok-downloader

Tiktok Downloader/Scraper using requests & bs4

asynchronous asyncio beautifulsoup bs4 crawler downloader flask krypton-byte lightweight nowm python python3 requests tiktok watermark web without

Last synced: 11 Nov 2024

https://github.com/dennis-tra/nebula

🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

cid crawler filecoin golang hacktoberfest ipfs libp2p

Last synced: 06 Nov 2024

https://github.com/TikHubIO/TikHub-API-Python-SDK

High-performance asynchronous Douyin(抖音) TikTok Xiaohongshu(小红书) Kuaishou(快手) Weibo(微博) Instagram YouTube(油管) Twitter(X) Captcha Solver(验证码解决器) Temp Mail(临时邮箱) API(接口).

api captcha-solver crawler data-api douyin douyin-tiktok-api instagram kuaishou netease-cloud-music private-api scrapy tiktok twitter weibo xiaohongshu xiguashipin

Last synced: 29 Oct 2024

https://github.com/jaybizzle/laravel-crawler-detect

A Laravel wrapper for CrawlerDetect - the web crawler detection library

bot crawler detect laravel php spider

Last synced: 29 Oct 2024

https://github.com/lgraubner/sitemap-generator-cli

Creates an XML-Sitemap by crawling a given site.

cli crawler google seo sitemap xml-sitemap

Last synced: 11 Nov 2024

https://github.com/twtrubiks/line-bot-tutorial

line-bot-tutorial use python flask

bot crawler heroku line ptt python-flask tutorial

Last synced: 16 Nov 2024

https://github.com/snakem982/pandora-box

A Simple Mihomo GUI.

crawler gui linux mac mihomo windows

Last synced: 09 Oct 2024

https://github.com/yaroslaff/nudecrawler

Crawl telegra.ph searching for nudes!

crawl crawler find nsfw nsfw-recognition nude nudes nudity-detection onlyfans python python3 scrape scraper scraping search spider telegra-ph tits web-scraping webscraping

Last synced: 14 Nov 2024

https://github.com/mustafadalga/instagram-bot

An Instagram bot developed using the Selenium Framework

automation automation-selenium bot bulk-comments bulk-unfollow crawler crawling download-stories instagram instagram-api instagram-bot instagram-downloader instagram-without-api mass-liking python python3 selenium selenium-framework selenium-python selenium-webdriver

Last synced: 28 Sep 2024

https://github.com/oppsec/pinkerton

🕵️ Pinkerton is an JavaScript file crawler and secret finder tool developed in Python

crawl crawler hacktoberfest javascript pentest python python3 redteam secrets

Last synced: 16 Nov 2024

https://github.com/GraySilver/wencai

This is a wencai crawler.（i问财的策略回测接口的Pythonic工具包）

crawler finance pandas quant quantitative-finance tushare wencai

Last synced: 30 Oct 2024

https://github.com/mikemeliz/torcrawl.py

Crawl and extract (regular or onion) webpages through TOR network

crawler extractor onion osint python tor

Last synced: 03 Aug 2024

https://github.com/devanshbatham/Gorecon

Gorecon is a All in one Reconnaissance Tool , a.k.a swiss knife for Reconnaissance , A tool that every pentester/bughunter might wanna consider into their arsenal

admin-panel-finder backups-finder cmsdetecter configurationfiles crawler directory-bruteforce dns dnsrecon email-hunter geo-ip nameserver recon reconaissance reverse-dns scanner subdomain-enumeration subdomain-scanner subnet-lookup whois-lookup wordpress-scanner

Last synced: 04 Nov 2024

https://github.com/zhupingqi/RuiJi.Net

crawler framework, distributed crawler extractor

crawler extractor headless-chrome netcore owin scraper scrapy

Last synced: 13 Nov 2024

https://github.com/eight04/comiccrawler

An image crawler written in Python.

cli crawler gui image-crawler python tkinter

Last synced: 13 Nov 2024

https://github.com/Jasonnor/th-music-video-generator

Touhou Project random music video generator/player, crawling image and video from websites to generate MV.

crawler javascript music-video touhou web

Last synced: 11 Nov 2024

https://github.com/marshalx/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 12 Nov 2024

https://github.com/eight04/ComicCrawler

An image crawler written in Python.

cli crawler gui image-crawler python tkinter

Last synced: 15 Aug 2024

https://github.com/chenjiandongx/Github-spider

Github 仓库及用户分析爬虫

crawler github scrapy

Last synced: 12 Nov 2024

https://github.com/chenjiandongx/github-spider

Github 仓库及用户分析爬虫

crawler github scrapy

Last synced: 09 Nov 2024

https://github.com/algolia/algoliasearch-netlify

Official Algolia Plugin for Netlify. Index your website to Algolia when deploying your project to Netlify with the Algolia Crawler

algolia algolia-crawler algoliasearch crawler jamstack netlify netlify-plugin search

Last synced: 12 Oct 2024

https://github.com/glaucocustodio/tanakai

Tanakai is a modern web scraping framework written in Ruby. A fork of Kimurai.

chrome-headless crawler kimurai scraper scrapy webscraping

Last synced: 31 Oct 2024

https://github.com/lucasjinreal/weibo_terminator_workflow

Update Version of weibo_terminator, This is Workflow Version aim at Get Job Done!

crawler nlp scraper sentiment-analysis weibo-terminator

Last synced: 06 Nov 2024

https://github.com/hezhizheng/go-movies

golang spider Crawler 爬虫电影

colly crawler docker fasthttp go gocolly golang movies redis spider

Last synced: 12 Nov 2024

https://github.com/antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

crawler crawling framework golang scraping web-crawler web-spider

Last synced: 26 Oct 2024

https://github.com/zrashwani/arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

crawler php scraping seo

Last synced: 29 Oct 2024

https://github.com/xyntax/filesensor

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具

crawler fuzzing pentesting scrapy

Last synced: 14 Nov 2024

https://github.com/zntfdr/selenops

A Swift Web Crawler 🕷

command-line-tool crawler scripting swift web

Last synced: 15 Nov 2024

https://github.com/zntfdr/Selenops

A Swift Web Crawler 🕷

command-line-tool crawler scripting swift web

Last synced: 06 Aug 2024

https://github.com/dwisiswant0/galer

A fast tool to fetch URLs from HTML attributes by crawl-in.

crawler devtool extractor galer go golang spider url-extractor url-parser waybackurls

Last synced: 28 Oct 2024

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 03 Aug 2024

https://github.com/myvyang/chromium_for_spider

dynamic crawler for web vulnerability scanner

chromium crawler puppeteer security spider

Last synced: 04 Aug 2024

https://github.com/s0rg/crawley

The unix-way web crawler

cli crawler go golang golang-application pentest pentest-tool pentesting unix-way web-crawler web-scraping web-spider

Last synced: 02 Nov 2024

https://github.com/cwjokaka/ok_ip_proxy_pool

🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池

aiohttp async beautifulsoup4 crawler flask http ip pool proxy proxypool py python python3 spider sqlite

Last synced: 12 Oct 2024

https://github.com/vitorfs/woid

Simple news aggregator displaying top stories in real time

crawler django news

Last synced: 15 Nov 2024

https://github.com/kong36088/ZhihuSpider

多线程知乎用户爬虫，基于python3

crawler multi-threading python python3 spider zhihu

Last synced: 07 Aug 2024

https://github.com/MarshalX/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 04 Aug 2024

https://github.com/ScottSloan/Bili23-Downloader

下载 Bilibili 视频/番剧/电影/纪录片等资源

bilibili crawler linux macos python videodownloader windows wxpython

Last synced: 27 Oct 2024

https://github.com/lgh06/web-page-monitor

Web Site Page Changes Monitor. 网站网页页面更新变更监控提醒。

change-alert change-detection change-monitor crawler monitor website-change-monitor website-monitoring

Last synced: 04 Aug 2024

https://github.com/dwisiswant0/gf-secrets

Secret and/or credential patterns used for gf.

alienvault-otx bugbounty crawler gau gf gitleaks infosec open-threat-exchange secrets-detection trufflehog trufflehog3 wayback wayback-machine waybackurl

Last synced: 28 Oct 2024

https://github.com/python3spiders/allnewsspider

澎湃新闻，新浪新闻，腾讯新闻，搜狐新闻，新闻联播，泰晤士报，纽约时报，BBCNews，旨在爬取所有新闻门户网站的新闻，禁止将所得数据商用！

bbc-news crawler newsapi nytimes sina sohu spider tencent thetimes xwlb

Last synced: 10 Nov 2024

https://github.com/R4yGM/dorkscout

DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

bug-bounty crawler ghdb golang google-dorks osint scraper security

Last synced: 04 Aug 2024

https://github.com/redco/goose-parser

Universal scraping tool, which allows you to extract data using multiple environments

browser crawler docker goose jsdom nodejs parser parsing phantomjs scraper scraping

Last synced: 05 Nov 2024

https://github.com/kirralabs/indonesian-NLP-resources

data resource untuk NLP bahasa indonesia

corpus corpus-linguistics crawler dataset dependency-parser indonesian indonesian-language named-entity-recognition nlp parallel-corpus pos-tagging sentiment-analysis

Last synced: 08 Nov 2024

https://github.com/spatie/robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

crawler php robots-txt

Last synced: 10 Nov 2024

https://github.com/icy/google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

bash cookie crawler curl google ownership wget

Last synced: 14 Nov 2024

https://github.com/linkedtales/scrapedin-linkedin-crawler

Crawler for LinkedIn full profiles 2019

crawler linkedin linkedin-crawler

Last synced: 06 Nov 2024

https://github.com/crypto-crawler/crypto-crawler-rs

A rock-solid cryptocurrency crawler library.

crawler cryptocurrency websocket

Last synced: 28 Oct 2024

https://github.com/gaussic/weibo_wordcloud

根据关键词抓取微博数据，再生成词云

crawler keyword search weibo wordcloud

Last synced: 13 Nov 2024

https://github.com/macacajs/NoSmoke

A cross platform UI crawler which scans view trees then generate and execute UI test cases.

android crawler ios macaca smoke-tests test-automation webdriver

Last synced: 08 Nov 2024

https://github.com/mgleon08/instagram-crawler

Crawl instagram photos, posts and videos for download.

crawler gem instagram instagram-crawler instagram-scraper ruby rubygems scraper

Last synced: 14 Aug 2024

https://github.com/zhaotianff/csharpcrawler

C#爬虫示例程序，想学习爬虫入门知识的可以看过来。后续会慢慢加入更多爬虫相关的知识。

crawler csharp wpf

Last synced: 15 Nov 2024