Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/hrbrmstr/spiderbar

Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R

r r-cyber robots-exclusion-protocol robots-txt rstats

Last synced: 21 Jun 2024

https://github.com/PhrozenByte/pico-robots

This is Pico's official robots plugin to add a robots.txt and sitemap.xml to your website. Pico is a stupidly simple, blazing fast, flat file CMS.

pico pico-robots picocms picocms-plugin robots robots-txt sitemap sitemap-xml

Last synced: 07 Jun 2024

https://github.com/kyr0/astro-launchpad

An Astro project template for decent projects: auth, i18next, Bootstrap, sitemap, webworker, robots.txt, preact, react, endpoints, endpoint clients, OAuth, various Astro features and data loading preconfigured

astro authentication bootstrap i18next microservices preact robots-txt scaffold sitemap-xml template

Last synced: 07 Jun 2024

https://github.com/ameygawade/streamlit-robots_txt_generator

This Streamlit app allows users to generate and customize a robots.txt file by selecting user-agents, specifying disallowed paths, enabling crawler delay, and providing a sitemap URL.

config data-science front generative generator google robots-txt search-algorithm search-engine seo seo-optimization stream streamlit txt-files web webapp webapplication

Last synced: 02 Jun 2024

https://github.com/LuXDAmore/nuxt-humans-txt

🧑🏻👩🏻 "We are people, not machines" - An initiative to know the creators of a website. Contains the information about humans to the web building - A Nuxt Module to statically integrate and generate a humans.txt author file - Based on the HumansTxt Project.

author humans humans-txt modules nuxt nuxt-module nuxtjs robots robots-txt static vuejs

Last synced: 01 Jun 2024

https://github.com/mdreizin/gatsby-plugin-robots-txt

Gatsby plugin that automatically creates robots.txt for your site

gatsby gatsby-plugin robots-txt

Last synced: 11 May 2024

https://github.com/beb7/gflare-tk

Open-Source Python Based SEO Web Crawler

crawler python robots-txt scraper seo seo-crawler tkinter

Last synced: 10 May 2024

https://github.com/adileo/MicroFrontier

A lightweight crawler frontier implementation in TypeScript using Redis.

crawler frontier microservice redis robots-txt spider

Last synced: 07 May 2024

https://github.com/TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 05 May 2024

https://github.com/emacs-php/robots-txt-mode

Emacs major mode for editing robots.txt

emacs major-mode melpa robots-txt

Last synced: 13 Apr 2024

https://github.com/LexiestLeszek/scrapeGPT

ScrapeGPT is a Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper

Last synced: 11 Apr 2024

https://github.com/stovv/next-strapi-sitemap

Generate sitemap and robots.txt for NextJS used web hook from STRAPI

nextjs robots-txt sitemap strapi

Last synced: 09 Apr 2024

https://github.com/PuerkitoBio/fetchbot

A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

crawler robots-txt

Last synced: 27 Mar 2024

https://github.com/PuerkitoBio/gocrawl

Polite, slim and concurrent web crawler.

crawler robots-txt

Last synced: 27 Mar 2024

https://github.com/php-middleware/block-robots

Middleware to avoid search engine indexing with PSR-7 using robots.txt and X-Robots-Tag

google middleware psr-15 psr-7 robots-txt seo

Last synced: 25 Mar 2024

https://github.com/cyb3r3x3r/chanakya

Scan websites for multiple things like honeypot, whois , port scan etc...

honeypot nmap portscan robots-txt scan-tool webscanner website whois whois-lookup

Last synced: 23 Mar 2024

https://github.com/spatie/robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

crawler php robots-txt

Last synced: 16 Mar 2024