{"id":29113953,"url":"https://github.com/ashjha75/crawlforge","last_synced_at":"2026-04-04T22:32:50.233Z","repository":{"id":298165243,"uuid":"999049474","full_name":"Ashjha75/CrawlForge","owner":"Ashjha75","description":"⚙️ CrawlForge — A high-performance, multithreaded web crawler and analytics engine built with Java Servlets, JSP,, and MySQL.","archived":false,"fork":false,"pushed_at":"2025-06-26T13:49:29.000Z","size":50809,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"dev_v1","last_synced_at":"2025-06-26T14:44:27.364Z","etag":null,"topics":["aws","collections-framework","design-patterns","docker","hibernate","java","maven","multithreading","mysql","oops","real-time","render","servlet-jsp"],"latest_commit_sha":null,"homepage":"https://crawlforge.onrender.com/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ashjha75.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-09T16:56:24.000Z","updated_at":"2025-06-26T13:49:34.000Z","dependencies_parsed_at":"2025-06-26T14:54:35.264Z","dependency_job_id":null,"html_url":"https://github.com/Ashjha75/CrawlForge","commit_stats":null,"previous_names":["ashjha75/webcrawler_analytics","ashjha75/crawlforge"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Ashjha75/CrawlForge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashjha75%2FCrawlForge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashjha75%2FCrawlForge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashjha75%2FCrawlForge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashjha75%2FCrawlForge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ashjha75","download_url":"https://codeload.github.com/Ashjha75/CrawlForge/tar.gz/refs/heads/dev_v1","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ashjha75%2FCrawlForge/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262581390,"owners_count":23331914,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","collections-framework","design-patterns","docker","hibernate","java","maven","multithreading","mysql","oops","real-time","render","servlet-jsp"],"created_at":"2025-06-29T11:05:57.953Z","updated_at":"2025-12-30T22:21:21.993Z","avatar_url":"https://github.com/Ashjha75.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌐 CrawlForge - Enterprise Web Crawler Analytics Platform\n\n![System Architecture](./src/main/webapp/img/Flowchart.jpg)\n\u003e CrawlForge is a robust Java Servlet + JSP enterprise application that performs intelligent web crawling, real-time analytics, and comprehensive content indexing at scale. Built with modern MVC architecture and enterprise-grade dashboard analytics.\n\n## 🏗️ Enterprise Architecture Overview\n\n### 🔄 **Core System Components**\n\n#### **🟦 URL Management Layer**\n- **URL Seeder** → Multi-threaded URL input processing with validation\n- **URL Populator** → Advanced deduplication using bloom filters and hash-based storage\n- **Seen URL Storage** → High-performance Redis/MySQL hybrid storage for processed URLs\n- **URL Queue Management** → Priority-based queue with depth and domain-aware scheduling\n- **URL Supplier Service** → Thread-safe URL distribution with load balancing\n\n#### **🟨 Intelligent Fetch \u0026 Render Engine**\n- **Multi-threaded HTML Fetcher** → Configurable thread pools with connection pooling\n- **Content Renderer** → JavaScript-aware rendering with headless browser support\n- **Robots.txt Compliance** → Automatic robots.txt parsing and politeness policies\n- **Rate Limiting** → Domain-specific delays and request throttling\n\n#### **🧩 Advanced URL Orchestration**\n- **Smart URL Extractor** → Machine learning-enhanced link discovery\n- **Duplicate Detection** → SHA-256 hash-based deduplication with bloom filters\n- **Intelligent URL Filtering** → Pattern matching, domain whitelisting, and content-type filtering\n- **Depth Management** → Configurable crawl depth with breadth-first traversal\n\n#### **🟩 Enterprise Storage Solutions**\n- **Analytics Database** → Real-time metrics aggregation with time-series data\n- **Content Storage** → MySQL with full-text indexing and compression\n- **Redis Cache Layer** → LRU eviction with TTL-based cache management\n- **File Storage** → Distributed file system for large content assets\n\n## 📊 **Real-Time Dashboard Analytics**\n\n### **Dashboard Features**\n- **📈 Live Crawl Monitoring** → Real-time session tracking with WebSocket updates\n- **🎯 Performance Metrics** → Success rates, load times, and error analytics\n- **📋 Content Analysis** → Keyword extraction, SEO metrics, and link analysis\n- **👥 User Management** → Role-based access control with JWT authentication\n- **📤 Data Export** → CSV/JSON export with scheduled reporting\n- **🔍 Advanced Search** → Full-text search across crawled content with relevance scoring\n\n### **Chart \u0026 Visualization Types**\n- **Domain Distribution** → Interactive pie charts showing crawl coverage\n- **Keyword Frequency** → Word clouds and bar charts for content analysis\n- **Status Code Analytics** → HTTP response code distribution and error tracking\n- **Performance Trends** → Time-series charts for load times and success rates\n- **Link Analysis** → Network graphs showing internal/external link relationships\n\n## 🧱 **Enterprise Project Structure**\n\n```\nCrawlForge/\n│\n├── src/main/java/\n│   ├── com/controller/          # Servlet Controllers\n│   │   ├── DashboardServlet.java\n│   │   ├── CrawlController.java\n│   │   └── AuthController.java\n│   │\n│   ├── com/service/             # Business Logic Layer\n│   │   ├── DashboardService.java\n│   │   ├── AnalyticsService.java\n│   │   ├── CrawlerService.java\n│   │   └── StatisticsCalculator.java\n│   │\n│   ├── com/dao/                 # Data Access Layer\n│   │   ├── DashboardDataDAO.java\n│   │   ├── CrawlStatisticsDAO.java\n│   │   ├── UserActivityDAO.java\n│   │   └── ChartDataDAO.java\n│   │\n│   ├── com/entity/              # JPA Entities\n│   │   ├── User.java\n│   │   ├── CrawlSession.java\n│   │   ├── Page.java\n│   │   ├── DashboardData.java\n│   │   ├── CrawlStatistics.java\n│   │   ├── ChartData.java\n│   │   └── UserActivity.java\n│   │\n│   ├── com/utils/               # Utility Classes\n│   │   ├── HibernateUtil.java\n│   │   ├── JsonUtil.java\n│   │   ├── CacheManager.java\n│   │   └── ExportUtil.java\n│   │\n│   └── com/middleware/          # Security \u0026 Filters\n│       ├── DashboardAuthFilter.java\n│       └── SecurityFilter.java\n│\n├── src/main/webapp/\n│   ├── jsp/                     # JSP Views\n│   │   ├── dashboard/\n│   │   ├── crawler/\n│   │   └── auth/\n│   │\n│   ├── css/                     # Stylesheets\n│   │   ├── dashboard.css\n│   │   ├── charts.css\n│   │   └── style.css\n│   │\n│   ├── js/                      # JavaScript\n│   │   ├── dashboard.js\n│   │   ├── charts.js\n│   │   └── crawler.js\n│   │\n│   └── index.jsp               # Main Layout\n│\n├── src/main/resources/\n│   ├── hibernate.cfg.xml\n│   ├── log4j2.xml\n│   └── application.properties\n│\n├── pom.xml                     # Maven Dependencies\n└── README.md\n```\n\n## 🎨 **Modern UI Design System**\n\n### **Color Palette (Scrapy-Inspired)**\n| Element | Color Code | Usage | Accessibility |\n|---------|------------|-------|---------------|\n| Primary Accent | `#10a37f` | Buttons, CTAs, Success states | WCAG AA Compliant |\n| Background Dark | `#1f1f2b` | Main background, Hero sections | High contrast |\n| Section Background | `#2b2c39` | Card containers, Panels | Optimal readability |\n| Card Background | `#40414f` | Dashboard cards, Forms | Glass morphism ready |\n| Text Primary | `#ffffff` | Headings, Primary content | Maximum contrast |\n| Text Muted | `#c5c5d2` | Subtext, Form labels | Subtle emphasis |\n| Border Soft | `#5c5f6e` | Input borders, Dividers | Clean separation |\n\n### **Advanced CSS Features**\n- **🌟 Glass Morphism Effects** → Backdrop blur with transparency layers\n- **🎭 Gradient Animations** → Smooth color transitions and hover states\n- **📱 Responsive Grid System** → CSS Grid with mobile-first approach\n- **🎯 Interactive Components** → Hover effects with cubic-bezier easing\n- **🔄 Loading Animations** → Skeleton screens and progress indicators\n\n## ⚙️ **Enterprise Technology Stack**\n\n| Layer | Technology | Version | Purpose |\n|-------|------------|---------|---------|\n| **Frontend** | JSP + Bootstrap 5 | 5.3.0 | Responsive UI framework |\n| **Backend** | Java Servlet | Java 21 | Enterprise web layer |\n| **Crawler Engine** | JSoup + Multithreading | 1.17.2 | HTML parsing \u0026 extraction |\n| **Database** | MySQL + Hibernate | 8.0.33 | Data persistence \u0026 ORM |\n| **Caching** | Redis + ConcurrentHashMap | 7.0 | High-performance caching |\n| **Security** | JWT + BCrypt | Latest | Authentication \u0026 authorization |\n| **Build Tool** | Maven | 3.9.0 | Dependency management |\n| **Testing** | JUnit 5 + Testcontainers | 5.10.0 | Unit \u0026 integration testing |\n| **Logging** | SLF4J + Logback | 2.0.9 | Structured logging |\n| **JSON Processing** | Jackson + Gson | 2.17.2 | Data serialization |\n\n## 🚀 **Advanced Features \u0026 Capabilities**\n\n### **🔧 Core Crawler Features**\n- ✅ **Multi-threaded Architecture** → Configurable thread pools with work-stealing\n- ✅ **Intelligent URL Discovery** → Machine learning-based link prioritization\n- ✅ **Content Analysis** → NLP-powered keyword extraction and sentiment analysis\n- ✅ **SEO Metrics** → Meta tag analysis, heading structure, and content quality scoring\n- ✅ **Performance Monitoring** → Real-time load time tracking and bottleneck detection\n- ✅ **Error Recovery** → Automatic retry logic with exponential backoff\n- ✅ **Compliance Engine** → Robots.txt parsing and rate limiting per domain\n\n### **📊 Dashboard Analytics**\n- ✅ **Real-time Metrics** → Live session monitoring with WebSocket updates\n- ✅ **Interactive Charts** → Chart.js integration with drill-down capabilities\n- ✅ **User Activity Tracking** → Comprehensive user behavior analytics\n- ✅ **Performance Insights** → Success rate trends and optimization recommendations\n- ✅ **Export Capabilities** → Scheduled reports in CSV/JSON/PDF formats\n- ✅ **Search \u0026 Filtering** → Advanced query builder with full-text search\n\n### **🔐 Security \u0026 Compliance**\n- ✅ **JWT Authentication** → Stateless token-based security\n- ✅ **Role-based Access Control** → Granular permission management\n- ✅ **Input Sanitization** → XSS and SQL injection prevention\n- ✅ **HTTPS Enforcement** → SSL/TLS encryption for all communications\n- ✅ **GDPR Compliance** → Data anonymization and user consent management\n- ✅ **Audit Logging** → Comprehensive activity tracking and compliance reporting\n\n## 📈 **Performance \u0026 Scalability**\n\n### **🎯 Performance Metrics**\n- **Throughput**: 1000+ pages/minute with optimal configuration\n- **Concurrency**: 50+ simultaneous crawl sessions\n- **Response Time**: 95% with JaCoCo\n- **Code Quality**: SonarQube integration with quality gates\n- **Security Scanning**: Automated dependency vulnerability checks\n- **Performance Monitoring**: Application metrics with Micrometer\n\n[//]: # ()\n[//]: # (## 📚 **API Documentation**)\n\n[//]: # ()\n[//]: # (### **Dashboard API Endpoints**)\n\n[//]: # (```)\n\n[//]: # (GET  /api/dashboard?timeRange=24h\u0026type=all)\n\n[//]: # (POST /api/dashboard/refresh)\n\n[//]: # (GET  /api/statistics?userId=123\u0026timeRange=7d)\n\n[//]: # (GET  /api/charts?type=domain-distribution)\n\n[//]: # (POST /api/export?format=csv\u0026timeRange=30d)\n\n[//]: # (```)\n\n[//]: # ()\n[//]: # (### **Crawler API Endpoints**)\n\n[//]: # (```)\n\n[//]: # (POST /api/crawl/start)\n\n[//]: # (GET  /api/crawl/status/{sessionId})\n\n[//]: # (POST /api/crawl/pause/{sessionId})\n\n[//]: # (POST /api/crawl/resume/{sessionId})\n\n[//]: # (DELETE /api/crawl/cancel/{sessionId})\n\n[//]: # (```)\n\n\n## 📄 **License**\n\nThis project is licensed under the MIT License .\n\n## 👨‍💻 **Author \u0026 Maintainer**\n\n**Ashish Kumar Jha**  \n*Full-Stack Developer | MERN Developer \u0026 Spring Boot Aficionado 🌱 | Crafting seamless, scalable web apps with modern tech and passion for clean, efficient code*\n\n- 🌐 **Portfolio**: [your-portfolio.com](https://ashish5jha.github.io/portfolio/)\n- 💼 **LinkedIn**: [linkedin.com/in/ashish-jha](https://www.linkedin.com/in/ashish5jha)\n- 📧 **Email**: network.ashishjha@gmail.com\n- 🐙 **GitHub**: [@ashish-jha](https://github.com/Ashjha75)\n## 🙏 **Acknowledgments**\n\n- **Scrapy.org** for design inspiration\n- **Spring Framework** for architectural patterns\n- **Apache Software Foundation** for excellent open-source tools\n- **Java Community** for continuous innovation and support\n\n### 🛠️ Installation \u0026 Build\n\nClone the repository and build:\n\n```bash\ngit clone https://github.com/Ashjha75/CrawlForge.git\ncd CrawlForge\nmvn clean install\n````\n\n### 🚀 Running the Application\n\n* Deploy the generated `.war` file from `target/` into your local Tomcat server.\n* Or run using Jetty/Maven plugin for quick testing:\n\n```bash\nmvn jetty:run\n```\n\n*(Ensure your `hibernateUtils.java` and `application.properties` are configured properly.)*\n\n### 🐳 Docker (MySQL \u0026 Redis)\n\nIf you prefer Docker for dependencies, start MySQL \u0026 Redis services:\n\n```bash\ndocker-compose up -d\n```\n\n*(See `docker-compose.yml` in project root for configuration.)*\n\n### 🔐 Environment Variables\n\n| Variable      | Description        | Default Value |\n| ------------- | ------------------ | ------------- |\n| `DB_HOST`     | MySQL Host Address | `localhost`   |\n| `DB_USER`     | MySQL Username     | `root`        |\n| `DB_PASSWORD` | MySQL Password     | `root`        |\n| `DB_NAME`     | Database Name      | `mydb`        |\n\n---\n\n**⭐ Star this repository if you find it helpful!**\n\n*Built with ❤️ using Java, dedication, and countless cups of coffee ☕*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashjha75%2Fcrawlforge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashjha75%2Fcrawlforge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashjha75%2Fcrawlforge/lists"}