https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2

Cloud-based service for uploading, scraping, and managing PDF/DOCX documents. Features include title sorting, content search with highlights, rule-based classification, and storage stats. Integrated with cloud platforms for scalable document analytics.
https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2
cloud-computing document-analysis document-classification document-management full-stack machine-learning react serverless supabase text-analysis typescript vite
Last synced: 21 days ago
JSON representation
Host: GitHub
URL: https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2
Owner: Yosef-AlSabbah
Created: 2025-05-01T09:59:23.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-06-05T12:24:56.000Z (12 months ago)
Last Synced: 2025-06-20T08:43:12.252Z (11 months ago)
Topics: cloud-computing, document-analysis, document-classification, document-management, full-stack, machine-learning, react, serverless, supabase, text-analysis, typescript, vite
Language: TypeScript
Homepage:
Size: 269 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Cloud Document Analytics Platform

> Advanced cloud-based document analytics system for searching, sorting, and classifying documents using machine learning algorithms

![Platform Preview](https://via.placeholder.com/800x400/3B82F6/FFFFFF?text=Cloud+Document+Analytics)

## 📑 Abstract

The Cloud Document Analytics Platform is a comprehensive solution designed for efficient document management, analysis, and classification in cloud environments. Built with modern web technologies and leveraging cloud-native services, this platform enables users to extract meaningful insights from various document formats through advanced search capabilities, intelligent classification, and detailed analytics.

This project demonstrates the practical application of cloud computing concepts by implementing a serverless architecture with Supabase as the backend service provider, coupled with a responsive React frontend. The system showcases how distributed document processing can be achieved efficiently in the cloud while maintaining security, scalability, and performance.

## 1. Introduction

The Cloud Document Analytics Platform addresses the growing need for intelligent document management systems in educational, corporate, and research environments. As digital content continues to grow exponentially, traditional document management systems struggle with organizing, searching, and extracting insights from large document collections.

This platform adopts a cloud-first development methodology, leveraging serverless architecture patterns to minimize operational overhead while maximizing scalability. By utilizing Supabase's Backend-as-a-Service (BaaS) capabilities, the system achieves a separation of concerns between frontend and backend components while maintaining robust data security through Row Level Security policies. The development process followed an iterative approach with continuous integration and deployment practices, enabling rapid feature development and refinement.

## 2. Cloud Software Program/Service Requirements

### User Stories

- As a researcher, I want to upload multiple document formats so that I can analyze documents regardless of their source.

- As a student, I want to search within document content so that I can quickly find relevant information.

- As a teacher, I want to categorize documents automatically so that I can maintain an organized collection.

- As an analyst, I want to visualize document metrics so that I can understand the composition of my document collection.

- As a content collector, I want to scrape web pages for content so that I can build my document repository efficiently.

- As a mobile user, I want a responsive interface so that I can access my documents from any device.

### Use Cases

1. **Document Management**:

   - Upload documents (PDF, DOC, DOCX)

   - View document metadata

   - Delete documents

   - Download documents

2. **Content Analysis**:

   - Full-text search with relevance scoring

   - Metadata extraction

   - Content summarization

   - Classification by content type

3. **Web Scraping**:

   - URL input for content extraction

   - Automatic document generation from web content

   - Scheduled scraping of specified sources

4. **Analytics**:

   - Document type distribution

   - Upload frequency trends

   - Search term analytics

   - Classification accuracy metrics

## 3. Software Architecture and Design

### Architecture Diagram

```

┌─────────────────┐     ┌───────────────────────┐     ┌────────────────────┐

│                 │     │                       │     │                    │

│  React Frontend ├─────┤ Supabase BaaS Layer   ├─────┤ Edge Functions    │

│  (Vite + TS)    │     │ (Auth, DB, Storage)   │     │ (Web Scraper)     │

│                 │     │                       │     │                    │

└────────┬────────┘     └───────────┬───────────┘     └────────────────────┘

         │                          │

         │                          │

┌────────▼──────────┐     ┌─────────▼───────────┐

│                   │     │                     │

│  UI Components    │     │  PostgreSQL DB      │

│  (Shadcn/UI)      │     │  (Document Store)   │

│                   │     │                     │

└───────────────────┘     └─────────────────────┘

```

### Component Design

The system is designed with several key functional components:

1. **Document Processing Pipeline**:

   - File upload handler with type validation

   - Content extraction module for different document formats

   - Metadata parser for capturing document properties

2. **Search Algorithm**:

   - Tokenization and normalization of document content

   - Inverted index for efficient term matching

   - TF-IDF based relevance scoring system

   - Fuzzy matching capabilities for handling typos

3. **Classification System**:

   - Feature extraction from document content and metadata

   - Rule-based classification for common document types

   - Category matching based on content analysis

   - User feedback loop for improving classification accuracy

4. **Web Scraping Service**:

   - URL validation and sanitization

   - HTML parsing and content extraction

   - Document conversion from web content

   - Rate limiting to prevent abuse

### Database Design

The system uses a PostgreSQL database (provided by Supabase) with the following schema:

```sql

-- Documents table

CREATE TABLE public.documents (

  id UUID NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY,

  user_id UUID REFERENCES auth.users NOT NULL,

  name TEXT NOT NULL,

  title TEXT,

  content TEXT,

  file_path TEXT,

  size BIGINT,

  type TEXT,

  classification TEXT,

  relevance_score DECIMAL,

  upload_time TIMESTAMP WITH TIME ZONE DEFAULT now(),

  created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),

  updated_at TIMESTAMP WITH TIME ZONE DEFAULT now()

);

```

### User Interface Design

The UI is designed with a focus on usability and aesthetics:

- **Dashboard Layout**: Card-based interface showing document statistics and recent uploads

- **Search Interface**: Prominent search bar with filtering options and result highlighting

- **Upload Component**: Drag-and-drop interface with progress indication

- **Document List**: Sortable and filterable list with action buttons

- **Classification Panel**: Visual representation of document categories with distribution charts

## 4. Used Cloud Services and Interfaces

The platform leverages the following cloud services:

1. **Supabase**:

   - **Authentication**: User registration and login management

   - **Database**: PostgreSQL database for document metadata storage

   - **Storage**: Object storage for document files

   - **Edge Functions**: Serverless functions for web scraping and processing

2. **Vercel**:

   - **Hosting**: Frontend application deployment

   - **CI/CD Pipeline**: Automated build and deployment

   - **CDN**: Global content delivery network

   - **Analytics**: Usage and performance monitoring

3. **Integration Services**:

   - **PDF Processing**: PDF-lib for document manipulation

   - **Word Processing**: Mammoth for DOCX/DOC conversion

## 5. Implementation

### Frontend Implementation

The frontend is built using React with TypeScript, utilizing the Vite build tool for optimal developer experience and build performance. Key implementation details include:

```tsx

// Document Upload Component

const DocumentUpload: React.FC = () => {

  const [files, setFiles] = useState([]);

  const [uploading, setUploading] = useState(false);

  

  const handleUpload = async () => {

    setUploading(true);

    try {

      // Process each file

      for (const file of files) {

        // Extract metadata

        const metadata = await extractMetadata(file);

        

        // Upload to Supabase storage

        const { data, error } = await supabase.storage

          .from('documents')

          .upload(`${uuidv4()}-${file.name}`, file);

          

        if (error) throw error;

        

        // Store document record

        await supabase.from('documents').insert({

          name: file.name,

          title: metadata.title,

          size: file.size,

          type: file.type,

          file_path: data.path,

        });

      }

    } catch (error) {

      console.error('Error uploading:', error);

    } finally {

      setUploading(false);

      setFiles([]);

    }

  };

  

  // Render upload interface

  return (

    // Upload interface implementation

  );

};

```

### Document Processing Implementation

```typescript

// Document Processing Utility

export const processDocument = async (file: File): Promise => {

  const fileType = file.name.split('.').pop()?.toLowerCase();

  let content = '';

  

  switch (fileType) {

    case 'pdf':

      content = await extractPdfContent(file);

      break;

    case 'docx':

    case 'doc':

      content = await extractWordContent(file);

      break;

    default:

      throw new Error('Unsupported file type');

  }

  

  // Extract metadata

  const metadata = extractMetadata(content);

  

  // Classify document

  const classification = classifyDocument(content, metadata);

  

  return {

    content,

    metadata,

    classification,

  };

};

```

### Search Implementation

```typescript

// Search functionality

export const searchDocuments = async (

  query: string,

  filters: SearchFilters

): Promise => {

  // Tokenize search query

  const tokens = tokenizeQuery(query);

  

  // Build SQL search conditions

  let searchCondition = '';

  if (tokens.length > 0) {

    searchCondition = tokens.map(token => 

      `content ILIKE '%${token}%' OR title ILIKE '%${token}%'`

    ).join(' OR ');

  }

  

  // Apply filters

  let filterConditions = [];

  if (filters.type) filterConditions.push(`type = '${filters.type}'`);

  if (filters.classification) filterConditions.push(`classification = '${filters.classification}'`);

  

  // Execute search

  const { data, error } = await supabase

    .from('documents')

    .select('*')

    .or(searchCondition)

    .and(filterConditions.join(' AND '));

    

  if (error) throw error;

  

  // Calculate relevance scores

  return data.map(doc => ({

    ...doc,

    relevance_score: calculateRelevanceScore(doc, query)

  }));

};

```

## 6. Data

The platform utilizes Supabase's PostgreSQL database for structured data storage with the following model:

### Document Data Model

- **documents**: Stores metadata and content for uploaded documents

  - Primary identification (UUID)

  - User ownership reference

  - Document attributes (name, title, size, type)

  - Content and classification data

  - Temporal metadata (upload time, update time)

### Storage Model

Document files are stored in Supabase Storage buckets with the following structure:

- **documents/**: Root bucket for all document files

  - **{user_id}/**: Segregated by user for security

    - **{document_id}-{filename}**: Individual document files

### Security Implementation

- Row Level Security (RLS) policies ensure users can only access their own documents

- Storage bucket policies restrict file access based on user authentication

- JWT-based authentication for secure API access

## 7. The Used Cloud Platform

### Supabase Platform Architecture

Supabase provides a comprehensive Backend-as-a-Service platform with the following components:

1. **PostgreSQL Database**:

   - High-performance relational database

   - Full-text search capabilities

   - Real-time subscriptions

   - Row-level security policies

2. **Authentication Service**:

   - User management

   - Multiple auth providers

   - JWT token handling

   - Secure password storage

3. **Storage Service**:

   - S3-compatible object storage

   - Public and private buckets

   - Access control policies

   - Image transformations

4. **Edge Functions**:

   - Deno-based serverless functions

   - Globally distributed execution

   - Low-latency responses

   - Secure environment variables

### Vercel Deployment Platform

Vercel provides a seamless frontend deployment platform with:

1. **Build System**:

   - Optimized for modern JavaScript frameworks

   - Automatic dependency installation

   - Environment variable management

2. **Edge Network**:

   - Global CDN distribution

   - Automatic SSL/TLS

   - High-performance edge caching

   - Instant cache invalidation

## 8. Deployment on the Platform

### Deployment Process

The application deployment follows a streamlined process:

1. **Code Repository Setup**:

   - GitHub repository for version control

   - Branch protection rules for main branch

   - Pre-commit hooks for code quality

2. **Supabase Configuration**:

   - Database schema initialization

   - RLS policy setup

   - Storage bucket creation

   - Edge function deployment

3. **Vercel Deployment**:

   - Connection to GitHub repository

   - Build configuration:

     ```

     Framework Preset: Vite

     Build Command: npm run build

     Output Directory: dist

     Install Command: npm install

     ```

   - Environment variable setup

   - Domain configuration

4. **Continuous Integration/Deployment**:

   - Automatic builds on push to main branch

   - Preview deployments for pull requests

   - Rollback capability for failed deployments

## 9. User Support

### User Documentation

#### Getting Started

1. **Account Creation**:

   - Navigate to the application URL

   - Click "Sign Up" and enter your details

   - Verify your email address

2. **Document Upload**:

   - Click "Upload" button on the dashboard

   - Select files or drag and drop documents

   - Wait for processing to complete

3. **Searching Documents**:

   - Use the search bar at the top of the interface

   - Enter keywords related to your document

   - Apply filters to narrow results

4. **Document Classification**:

   - Navigate to the Classification panel

   - View automatic document categorization

   - Manually adjust categories if needed

#### Troubleshooting

Common issues and their solutions are documented in the [Troubleshooting](#troubleshooting) section.

### Source Code and Live Application

- **Source Code**: [GitHub Repository](https://github.com/yourusername/cloud-docu-analyzer-nexus)

- **Live Application**: [Cloud Document Analytics Platform](https://cloud-docu-analyzer-nexus.vercel.app)

## 10. Conclusion

The Cloud Document Analytics Platform demonstrates the power of modern cloud-native development for creating efficient document management and analysis systems. By leveraging serverless architecture and BaaS platforms like Supabase, the application achieves high performance, scalability, and security without requiring extensive backend infrastructure management.

### Current Limitations

- Document processing is limited to specific formats (PDF, DOC, DOCX)

- Classification accuracy depends on document content quality

- Web scraper may not handle all website structures efficiently

- Storage limitations based on free-tier constraints

### Future Enhancements

1. **Enhanced AI Classification**:

   - Integration with machine learning models for improved categorization

   - Document similarity detection

   - Content summarization

2. **Advanced Analytics**:

   - Sentiment analysis of document content

   - Topic modeling and clustering

   - Trend identification across document collections

3. **Collaboration Features**:

   - Document sharing capabilities

   - Comment and annotation tools

   - Version control for documents

4. **Performance Optimizations**:

   - Parallel processing for large document batches

   - Caching strategies for frequently accessed content

   - Progressive loading for large documents

## References

1. Supabase Documentation. (2023). *Authentication*. Retrieved from https://supabase.com/docs/guides/auth

2. Vercel Documentation. (2023). *Deployment*. Retrieved from https://vercel.com/docs/concepts/deployments/overview

3. React Documentation. (2023). *React Hooks*. Retrieved from https://reactjs.org/docs/hooks-intro.html

4. Vite Documentation. (2023). *Features*. Retrieved from https://vitejs.dev/guide/features.html

5. Mozilla Developer Network. (2023). *Using the Fetch API*. Retrieved from https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch

6. PostgreSQL Documentation. (2023). *Full Text Search*. Retrieved from https://www.postgresql.org/docs/current/textsearch.html

---

**Made with ❤️ by Yousef M. Y. Al Sabbah**

*Islamic University of Gaza - Faculty of Information Technology*

---

*Last updated: June 5, 2025*
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yosef-alsabbah/cloud-based-document-analytics-service-2

Awesome Lists containing this project

README