https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2
Cloud-based service for uploading, scraping, and managing PDF/DOCX documents. Features include title sorting, content search with highlights, rule-based classification, and storage stats. Integrated with cloud platforms for scalable document analytics.
https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2
cloud-computing document-analysis document-classification document-management full-stack machine-learning react serverless supabase text-analysis typescript vite
Last synced: 21 days ago
JSON representation
Cloud-based service for uploading, scraping, and managing PDF/DOCX documents. Features include title sorting, content search with highlights, rule-based classification, and storage stats. Integrated with cloud platforms for scalable document analytics.
- Host: GitHub
- URL: https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2
- Owner: Yosef-AlSabbah
- Created: 2025-05-01T09:59:23.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-06-05T12:24:56.000Z (12 months ago)
- Last Synced: 2025-06-20T08:43:12.252Z (11 months ago)
- Topics: cloud-computing, document-analysis, document-classification, document-management, full-stack, machine-learning, react, serverless, supabase, text-analysis, typescript, vite
- Language: TypeScript
- Homepage:
- Size: 269 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cloud Document Analytics Platform
> Advanced cloud-based document analytics system for searching, sorting, and classifying documents using machine learning algorithms

## 📑 Abstract
The Cloud Document Analytics Platform is a comprehensive solution designed for efficient document management, analysis, and classification in cloud environments. Built with modern web technologies and leveraging cloud-native services, this platform enables users to extract meaningful insights from various document formats through advanced search capabilities, intelligent classification, and detailed analytics.
This project demonstrates the practical application of cloud computing concepts by implementing a serverless architecture with Supabase as the backend service provider, coupled with a responsive React frontend. The system showcases how distributed document processing can be achieved efficiently in the cloud while maintaining security, scalability, and performance.
## 1. Introduction
The Cloud Document Analytics Platform addresses the growing need for intelligent document management systems in educational, corporate, and research environments. As digital content continues to grow exponentially, traditional document management systems struggle with organizing, searching, and extracting insights from large document collections.
This platform adopts a cloud-first development methodology, leveraging serverless architecture patterns to minimize operational overhead while maximizing scalability. By utilizing Supabase's Backend-as-a-Service (BaaS) capabilities, the system achieves a separation of concerns between frontend and backend components while maintaining robust data security through Row Level Security policies. The development process followed an iterative approach with continuous integration and deployment practices, enabling rapid feature development and refinement.
## 2. Cloud Software Program/Service Requirements
### User Stories
- As a researcher, I want to upload multiple document formats so that I can analyze documents regardless of their source.
- As a student, I want to search within document content so that I can quickly find relevant information.
- As a teacher, I want to categorize documents automatically so that I can maintain an organized collection.
- As an analyst, I want to visualize document metrics so that I can understand the composition of my document collection.
- As a content collector, I want to scrape web pages for content so that I can build my document repository efficiently.
- As a mobile user, I want a responsive interface so that I can access my documents from any device.
### Use Cases
1. **Document Management**:
- Upload documents (PDF, DOC, DOCX)
- View document metadata
- Delete documents
- Download documents
2. **Content Analysis**:
- Full-text search with relevance scoring
- Metadata extraction
- Content summarization
- Classification by content type
3. **Web Scraping**:
- URL input for content extraction
- Automatic document generation from web content
- Scheduled scraping of specified sources
4. **Analytics**:
- Document type distribution
- Upload frequency trends
- Search term analytics
- Classification accuracy metrics
## 3. Software Architecture and Design
### Architecture Diagram
```
┌─────────────────┐ ┌───────────────────────┐ ┌────────────────────┐
│ │ │ │ │ │
│ React Frontend ├─────┤ Supabase BaaS Layer ├─────┤ Edge Functions │
│ (Vite + TS) │ │ (Auth, DB, Storage) │ │ (Web Scraper) │
│ │ │ │ │ │
└────────┬────────┘ └───────────┬───────────┘ └────────────────────┘
│ │
│ │
┌────────▼──────────┐ ┌─────────▼───────────┐
│ │ │ │
│ UI Components │ │ PostgreSQL DB │
│ (Shadcn/UI) │ │ (Document Store) │
│ │ │ │
└───────────────────┘ └─────────────────────┘
```
### Component Design
The system is designed with several key functional components:
1. **Document Processing Pipeline**:
- File upload handler with type validation
- Content extraction module for different document formats
- Metadata parser for capturing document properties
2. **Search Algorithm**:
- Tokenization and normalization of document content
- Inverted index for efficient term matching
- TF-IDF based relevance scoring system
- Fuzzy matching capabilities for handling typos
3. **Classification System**:
- Feature extraction from document content and metadata
- Rule-based classification for common document types
- Category matching based on content analysis
- User feedback loop for improving classification accuracy
4. **Web Scraping Service**:
- URL validation and sanitization
- HTML parsing and content extraction
- Document conversion from web content
- Rate limiting to prevent abuse
### Database Design
The system uses a PostgreSQL database (provided by Supabase) with the following schema:
```sql
-- Documents table
CREATE TABLE public.documents (
id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY,
user_id UUID REFERENCES auth.users NOT NULL,
name TEXT NOT NULL,
title TEXT,
content TEXT,
file_path TEXT,
size BIGINT,
type TEXT,
classification TEXT,
relevance_score DECIMAL,
upload_time TIMESTAMP WITH TIME ZONE DEFAULT now(),
created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT now()
);
```
### User Interface Design
The UI is designed with a focus on usability and aesthetics:
- **Dashboard Layout**: Card-based interface showing document statistics and recent uploads
- **Search Interface**: Prominent search bar with filtering options and result highlighting
- **Upload Component**: Drag-and-drop interface with progress indication
- **Document List**: Sortable and filterable list with action buttons
- **Classification Panel**: Visual representation of document categories with distribution charts
## 4. Used Cloud Services and Interfaces
The platform leverages the following cloud services:
1. **Supabase**:
- **Authentication**: User registration and login management
- **Database**: PostgreSQL database for document metadata storage
- **Storage**: Object storage for document files
- **Edge Functions**: Serverless functions for web scraping and processing
2. **Vercel**:
- **Hosting**: Frontend application deployment
- **CI/CD Pipeline**: Automated build and deployment
- **CDN**: Global content delivery network
- **Analytics**: Usage and performance monitoring
3. **Integration Services**:
- **PDF Processing**: PDF-lib for document manipulation
- **Word Processing**: Mammoth for DOCX/DOC conversion
## 5. Implementation
### Frontend Implementation
The frontend is built using React with TypeScript, utilizing the Vite build tool for optimal developer experience and build performance. Key implementation details include:
```tsx
// Document Upload Component
const DocumentUpload: React.FC = () => {
const [files, setFiles] = useState([]);
const [uploading, setUploading] = useState(false);
const handleUpload = async () => {
setUploading(true);
try {
// Process each file
for (const file of files) {
// Extract metadata
const metadata = await extractMetadata(file);
// Upload to Supabase storage
const { data, error } = await supabase.storage
.from('documents')
.upload(`${uuidv4()}-${file.name}`, file);
if (error) throw error;
// Store document record
await supabase.from('documents').insert({
name: file.name,
title: metadata.title,
size: file.size,
type: file.type,
file_path: data.path,
});
}
} catch (error) {
console.error('Error uploading:', error);
} finally {
setUploading(false);
setFiles([]);
}
};
// Render upload interface
return (
// Upload interface implementation
);
};
```
### Document Processing Implementation
```typescript
// Document Processing Utility
export const processDocument = async (file: File): Promise => {
const fileType = file.name.split('.').pop()?.toLowerCase();
let content = '';
switch (fileType) {
case 'pdf':
content = await extractPdfContent(file);
break;
case 'docx':
case 'doc':
content = await extractWordContent(file);
break;
default:
throw new Error('Unsupported file type');
}
// Extract metadata
const metadata = extractMetadata(content);
// Classify document
const classification = classifyDocument(content, metadata);
return {
content,
metadata,
classification,
};
};
```
### Search Implementation
```typescript
// Search functionality
export const searchDocuments = async (
query: string,
filters: SearchFilters
): Promise => {
// Tokenize search query
const tokens = tokenizeQuery(query);
// Build SQL search conditions
let searchCondition = '';
if (tokens.length > 0) {
searchCondition = tokens.map(token =>
`content ILIKE '%${token}%' OR title ILIKE '%${token}%'`
).join(' OR ');
}
// Apply filters
let filterConditions = [];
if (filters.type) filterConditions.push(`type = '${filters.type}'`);
if (filters.classification) filterConditions.push(`classification = '${filters.classification}'`);
// Execute search
const { data, error } = await supabase
.from('documents')
.select('*')
.or(searchCondition)
.and(filterConditions.join(' AND '));
if (error) throw error;
// Calculate relevance scores
return data.map(doc => ({
...doc,
relevance_score: calculateRelevanceScore(doc, query)
}));
};
```
## 6. Data
The platform utilizes Supabase's PostgreSQL database for structured data storage with the following model:
### Document Data Model
- **documents**: Stores metadata and content for uploaded documents
- Primary identification (UUID)
- User ownership reference
- Document attributes (name, title, size, type)
- Content and classification data
- Temporal metadata (upload time, update time)
### Storage Model
Document files are stored in Supabase Storage buckets with the following structure:
- **documents/**: Root bucket for all document files
- **{user_id}/**: Segregated by user for security
- **{document_id}-{filename}**: Individual document files
### Security Implementation
- Row Level Security (RLS) policies ensure users can only access their own documents
- Storage bucket policies restrict file access based on user authentication
- JWT-based authentication for secure API access
## 7. The Used Cloud Platform
### Supabase Platform Architecture
Supabase provides a comprehensive Backend-as-a-Service platform with the following components:
1. **PostgreSQL Database**:
- High-performance relational database
- Full-text search capabilities
- Real-time subscriptions
- Row-level security policies
2. **Authentication Service**:
- User management
- Multiple auth providers
- JWT token handling
- Secure password storage
3. **Storage Service**:
- S3-compatible object storage
- Public and private buckets
- Access control policies
- Image transformations
4. **Edge Functions**:
- Deno-based serverless functions
- Globally distributed execution
- Low-latency responses
- Secure environment variables
### Vercel Deployment Platform
Vercel provides a seamless frontend deployment platform with:
1. **Build System**:
- Optimized for modern JavaScript frameworks
- Automatic dependency installation
- Environment variable management
2. **Edge Network**:
- Global CDN distribution
- Automatic SSL/TLS
- High-performance edge caching
- Instant cache invalidation
## 8. Deployment on the Platform
### Deployment Process
The application deployment follows a streamlined process:
1. **Code Repository Setup**:
- GitHub repository for version control
- Branch protection rules for main branch
- Pre-commit hooks for code quality
2. **Supabase Configuration**:
- Database schema initialization
- RLS policy setup
- Storage bucket creation
- Edge function deployment
3. **Vercel Deployment**:
- Connection to GitHub repository
- Build configuration:
```
Framework Preset: Vite
Build Command: npm run build
Output Directory: dist
Install Command: npm install
```
- Environment variable setup
- Domain configuration
4. **Continuous Integration/Deployment**:
- Automatic builds on push to main branch
- Preview deployments for pull requests
- Rollback capability for failed deployments
## 9. User Support
### User Documentation
#### Getting Started
1. **Account Creation**:
- Navigate to the application URL
- Click "Sign Up" and enter your details
- Verify your email address
2. **Document Upload**:
- Click "Upload" button on the dashboard
- Select files or drag and drop documents
- Wait for processing to complete
3. **Searching Documents**:
- Use the search bar at the top of the interface
- Enter keywords related to your document
- Apply filters to narrow results
4. **Document Classification**:
- Navigate to the Classification panel
- View automatic document categorization
- Manually adjust categories if needed
#### Troubleshooting
Common issues and their solutions are documented in the [Troubleshooting](#troubleshooting) section.
### Source Code and Live Application
- **Source Code**: [GitHub Repository](https://github.com/yourusername/cloud-docu-analyzer-nexus)
- **Live Application**: [Cloud Document Analytics Platform](https://cloud-docu-analyzer-nexus.vercel.app)
## 10. Conclusion
The Cloud Document Analytics Platform demonstrates the power of modern cloud-native development for creating efficient document management and analysis systems. By leveraging serverless architecture and BaaS platforms like Supabase, the application achieves high performance, scalability, and security without requiring extensive backend infrastructure management.
### Current Limitations
- Document processing is limited to specific formats (PDF, DOC, DOCX)
- Classification accuracy depends on document content quality
- Web scraper may not handle all website structures efficiently
- Storage limitations based on free-tier constraints
### Future Enhancements
1. **Enhanced AI Classification**:
- Integration with machine learning models for improved categorization
- Document similarity detection
- Content summarization
2. **Advanced Analytics**:
- Sentiment analysis of document content
- Topic modeling and clustering
- Trend identification across document collections
3. **Collaboration Features**:
- Document sharing capabilities
- Comment and annotation tools
- Version control for documents
4. **Performance Optimizations**:
- Parallel processing for large document batches
- Caching strategies for frequently accessed content
- Progressive loading for large documents
## References
1. Supabase Documentation. (2023). *Authentication*. Retrieved from https://supabase.com/docs/guides/auth
2. Vercel Documentation. (2023). *Deployment*. Retrieved from https://vercel.com/docs/concepts/deployments/overview
3. React Documentation. (2023). *React Hooks*. Retrieved from https://reactjs.org/docs/hooks-intro.html
4. Vite Documentation. (2023). *Features*. Retrieved from https://vitejs.dev/guide/features.html
5. Mozilla Developer Network. (2023). *Using the Fetch API*. Retrieved from https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
6. PostgreSQL Documentation. (2023). *Full Text Search*. Retrieved from https://www.postgresql.org/docs/current/textsearch.html
---
**Made with ❤️ by Yousef M. Y. Al Sabbah**
*Islamic University of Gaza - Faculty of Information Technology*
---
*Last updated: June 5, 2025*