An open API service indexing awesome lists of open source software.

https://github.com/jonaskahn/azure-invoice

Simple app for simple life
https://github.com/jonaskahn/azure-invoice

Last synced: 7 days ago
JSON representation

Simple app for simple life

Awesome Lists containing this project

README

          

# Azure Invoice Analyzer Pro - Complete Strategy Document

## Executive Summary

Azure Invoice Analyzer Pro is an advanced Streamlit application that transforms Azure billing data into actionable business intelligence. The system provides comprehensive cost analysis through 9 business categories, interactive drill-down capabilities, and complete financial reconciliation with PDF export functionality.

## Quick Chart Reference

### Chart Calculation Summary

| Chart Type | Calculation Method | Key Metrics |
| ----------------------------- | --------------------------------------------- | ----------------------------------- |
| **Cost Category Pie/Bar** | `df.groupby('CostCategory')['Cost'].sum()` | Category percentages of total cost |
| **Service Provider Analysis** | `df.groupby('ConsumedService')['Cost'].sum()` | Total cost per Azure service |
| **Resource Group Costs** | `df.groupby('ResourceGroup')['Cost'].sum()` | Total cost per resource group |
| **Machine Efficiency** | `Cost / Quantity` per resource | Cost per unit/hour analysis |
| **Interactive Drill-Down** | Multi-level grouping + pattern matching | Resource Group → Machine → Category |
| **Cost Reconciliation** | `abs(original_total - categorized_total)` | Financial validation & accuracy |

### Core Calculations

**Cost Reconciliation:**

```python
original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
coverage = (categorized_cost / original_total) × 100
```

**Category Classification:**

```python
# 9 business categories based on Azure service patterns
if MeterSubcategory in ['Premium SSD Managed Disks']:
return 'Managed Disks'
elif ConsumedService == 'Microsoft.Compute' and not disk_category:
return 'VM Compute'
```

**Machine Cost Breakdown:**

```python
# Includes related resources (disks, NICs, etc.)
machine_data = df[df['ResourceName'] == machine_name]
related_data = df[df['ResourceName'].str.contains(machine_name)]
total_cost = combined_data['Cost'].sum()
```

### Chart Types and Business Value

| Chart | Purpose | Key Insight | Optimization Action |
| ----------------------------- | -------------------------------------- | ----------------------------------- | ----------------------------------------- |
| **Cost Category Pie** | Budget allocation by business function | Which category consumes most budget | Focus optimization on largest segments |
| **Service Provider Bar** | Azure service spending distribution | Which Microsoft services cost most | Negotiate better rates, optimize usage |
| **Resource Group Costs** | Environment/team cost allocation | Which groups spend most | Budget allocation, cost accountability |
| **Machine Efficiency** | Resource utilization analysis | Cost per hour performance | Right-size over/under-utilized resources |
| **Drill-Down Analysis** | Granular cost investigation | Machine-level cost breakdown | Target specific machines for optimization |
| **Reconciliation Validation** | Financial accuracy verification | Data integrity and completeness | Ensure 100% cost accountability |

## Core Architecture

### Data Processing Pipeline

1. **Data Ingestion**: CSV file upload with validation and error handling
2. **Cost Classification**: 9-category business intelligence classification system
3. **Cost Reconciliation**: Mathematical validation ensuring 100% cost accountability
4. **Interactive Analysis**: Multi-level drill-down from resource groups to individual machines
5. **Export Generation**: Professional PDF reports and individual chart downloads

### Classification Engine

#### 9 Business Cost Categories

1. **Managed Disks** (typically 60-80% of costs)

- Premium SSD Managed Disks
- Standard HDD Managed Disks
- Standard SSD Managed Disks
- Ultra SSD Managed Disks

2. **VM Compute** (10-25% of costs)

- Virtual machine runtime costs
- CPU and memory consumption
- Excludes storage components

3. **CDN** (5-15% of costs)

- Content Delivery Network services
- Data transfer and caching

4. **Network/IP** (3-10% of costs)

- Virtual Network services
- Public IP addresses
- Network security groups

5. **Backup** (2-8% of costs)

- Recovery Services vault
- Backup storage and operations

6. **Load Balancer** (1-5% of costs)

- Standard and Basic load balancers
- Application gateways

7. **Other Storage** (<2% of costs)

- Blob storage, File storage
- Table storage, Queue storage

8. **Bandwidth** (<1% of costs)

- Data transfer charges
- Inter-region communication

9. **Key Vault** (<1% of costs)
- Secret management services
- Certificate operations

### Interactive Drill-Down System

#### Three-Level Analysis Hierarchy

1. **Resource Group Level**

- Total cost, machine count, usage hours
- High-level resource allocation view

2. **Machine Level**

- Individual machine costs and usage
- Service provider breakdown
- Related resource identification

3. **Category Level**
- Detailed cost breakdown by business category
- Service-level attribution
- Optimization recommendations

### Enhanced Machine Analysis

#### Related Resource Detection

- **Pattern Matching**: Identifies associated resources using naming conventions
- **Resource Relationships**: Links VMs with their disks, NICs, and other components
- **Complete Cost Attribution**: Ensures all machine-related costs are captured

#### Categories Include:

```
vm-name → Primary virtual machine
vm-name-disk → Managed disks
vm-name-nic → Network interfaces
vm-name_OsDisk → Operating system disks
vm-name-backup → Backup services
```

## Chart Calculations and Analysis

### 1. Executive Summary & Validation

#### Cost Reconciliation Engine

**Mathematical Foundation:**

```python
# Core reconciliation calculation
original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
difference = abs(original_total - categorized_total)
reconciliation_success = difference < 0.01

# Coverage calculation
categorized_cost = total_cost - uncategorized_cost
coverage_percentage = (categorized_cost / original_total) × 100
```

**Key Metrics Explained:**

- **Original Invoice Total**: Sum of all Cost column values from uploaded CSV
- **Categorized Total**: Sum of all costs after classification processing
- **Difference**: Mathematical variance between original and processed totals
- **Coverage Percentage**: (Categorized costs / Original total) × 100
- **Reconciliation Status**: ✅ Success if difference < $0.01, ❌ Failed otherwise

**Business Interpretation:**

- **100% Coverage**: All costs properly classified, complete financial accountability
- **95-99% Coverage**: Good classification, minor uncategorized items need review
- **<95% Coverage**: Significant classification gaps, immediate attention required

#### Executive Metrics Dashboard

**Calculation Method:**

```python
total_cost = df['Cost'].sum()
total_quantity = df['Quantity'].sum()
unique_resource_groups = df['ResourceGroup'].nunique()
unique_machines = df['ResourceName'].nunique()
```

**Metrics Meaning:**

- **Total Cost**: Complete Azure spending for the invoice period
- **Total Usage**: Sum of all quantity values (typically hours)
- **Top Category**: Highest cost category from 9-category classification
- **Resource Groups**: Number of distinct Azure resource groups
- **Services**: Number of unique Azure service providers (Microsoft.Compute, etc.)

### 2. Cost Category Analysis

#### Cost Category Pie Chart

**Calculation Method:**

```python
category_summary = classified_df.groupby('CostCategory').agg({
'Cost': 'sum',
'Quantity': 'sum'
}).round(4)

category_percentage = (category_summary['Cost'] / total_cost * 100).round(2)
```

**Chart Elements:**

- **Pie Segments**: Each represents one of 9 business categories
- **Percentage Labels**: Category cost as percentage of total invoice
- **Dollar Values**: Absolute cost amount for each category
- **Color Coding**: Consistent colors across all charts for category identification

**Business Interpretation:**

- **Dominant Categories**: Largest segments indicate primary cost drivers
- **Distribution Balance**: Even distribution suggests diversified infrastructure
- **Anomalies**: Unusually large percentages may indicate optimization opportunities

#### Cost Category Bar Chart (Detailed View)

**Calculation Method:**

```python
category_breakdown = classified_df.groupby('CostCategory').agg({
'Cost': ['sum', 'count', 'mean'],
'Quantity': 'sum'
})

record_count = category_breakdown['Cost']['count']
avg_cost = category_breakdown['Cost']['mean']
```

**Data Points Explained:**

- **Horizontal Bars**: Length represents total cost for each category
- **Cost Values**: Dollar amount displayed on each bar
- **Percentage Labels**: Category percentage of total costs
- **Record Count**: Number of invoice line items in each category (hover data)

**Key Insights:**

- **Bar Length Comparison**: Visual representation of spending distribution
- **High Record Count**: Many small transactions vs few large transactions
- **Cost Concentration**: Identifies which categories drive majority of spending

### 3. Service Provider Analysis

#### Service Provider Chart

**Calculation Method:**

```python
provider_summary = df.groupby('ConsumedService').agg({
'Cost': ['sum', 'count'],
'Quantity': 'sum'
}).round(4)

provider_percentage = (provider_summary['Cost'] / total_cost * 100).round(2)
```

**Chart Components:**

- **X-Axis**: Azure service providers (Microsoft.Compute, Microsoft.Storage, etc.)
- **Y-Axis**: Total cost in USD for each provider
- **Bar Height**: Proportional to spending on each service
- **Text Labels**: Dollar amounts displayed above bars

**Service Provider Breakdown:**

- **Microsoft.Compute**: Virtual machines, disks, compute resources
- **Microsoft.Network**: Virtual networks, load balancers, IP addresses
- **Microsoft.Storage**: Blob storage, file storage, queues
- **Microsoft.RecoveryServices**: Backup and disaster recovery
- **Microsoft.Cdn**: Content delivery network services

**Business Analysis:**

- **Compute Dominance**: High Microsoft.Compute costs indicate VM-heavy workloads
- **Storage Intensity**: High Microsoft.Storage suggests data-intensive applications
- **Network Costs**: Significant Microsoft.Network indicates complex networking requirements

### 4. Interactive Drill-Down Analysis

#### Resource Group Selection Metrics

**Calculation Method:**

```python
rg_data = df[df['ResourceGroup'] == selected_rg]
rg_cost = rg_data['Cost'].sum()
rg_machines = rg_data['ResourceName'].nunique()
rg_quantity = rg_data['Quantity'].sum()
```

**Metrics Explanation:**

- **Total Cost**: Sum of all costs within selected resource group
- **Machines**: Count of unique ResourceName values in the group
- **Total Usage**: Sum of quantity values (usually compute hours)

#### Machine Analysis Table

**Calculation Method:**

```python
machine_summary = rg_data.groupby('ResourceName').agg({
'Cost': 'sum',
'Quantity': 'sum',
'ConsumedService': lambda x: ', '.join(x.unique()),
'MeterCategory': lambda x: ', '.join(x.unique())
})

cost_percentage = (machine_summary['Cost'] / rg_cost * 100).round(2)
```

**Table Columns:**

- **Machine Name**: ResourceName from Azure invoice
- **Total Cost**: Sum of all costs for this specific machine
- **Total Usage**: Sum of quantity values for this machine
- **Services Used**: List of Azure services consumed by this machine
- **Meter Categories**: Types of Azure meters associated with this machine
- **Cost %**: This machine's percentage of total resource group costs

#### Individual Machine Analysis

**Calculation Method:**

```python
# Enhanced resource detection
machine_data = classified_df[classified_df['ResourceName'] == selected_machine]
related_data = classified_df[
(classified_df['ResourceName'].str.contains(selected_machine, case=False)) |
(classified_df['ResourceName'].str.startswith(selected_machine + '-')) |
(classified_df['ResourceName'].str.startswith(selected_machine + '_'))
]

combined_data = pd.concat([machine_data, related_data]).drop_duplicates()
```

**Machine Metrics:**

- **Total Cost**: Sum of costs for machine and all related resources
- **Total Usage**: Sum of quantity values across all related resources
- **Categories**: Number of cost categories this machine uses
- **Cost/Hour**: Total cost divided by total usage hours

**Related Resources Detection:**

- **Primary Resource**: Exact match of machine name
- **Associated Disks**: Resources with patterns like "vm-name-disk", "vm-name_OsDisk"
- **Network Components**: Resources like "vm-name-nic" (network interfaces)
- **Backup Resources**: Resources like "vm-name-backup"

#### Machine Cost Breakdown Charts

**Pie Chart (Cost Distribution):**

```python
category_breakdown = combined_data.groupby('CostCategory').agg({
'Cost': 'sum',
'Quantity': 'sum'
})

machine_percentage = (category_breakdown['Cost'] / machine_cost * 100).round(2)
```

**Elements:**

- **Pie Segments**: Proportional to category costs for this machine
- **Percentage Labels**: Category percentage of machine's total cost
- **Dollar Values**: Absolute cost for each category

**Bar Chart (Category Details):**

- **Horizontal Bars**: Cost amount for each category
- **Category Colors**: Consistent color scheme matching pie chart
- **Hover Data**: Additional details including quantity used

### 5. Resource Efficiency Analysis

#### Efficiency Metrics Chart

**Calculation Method:**

```python
efficiency_df = df.groupby('ResourceName').agg({
'Cost': 'sum',
'Quantity': 'sum'
})

efficiency_df['CostPerUnit'] = efficiency_df['Cost'] / efficiency_df['Quantity']
efficiency_df['EfficiencyScore'] = efficiency_df['Cost'] / efficiency_df['Quantity']
```

**Dual-Axis Chart Components:**

- **Left Y-Axis**: Total cost in USD (bars)
- **Right Y-Axis**: Cost per unit/hour (line)
- **X-Axis**: Resource names (top 15 by cost)
- **Bar Height**: Total spending on each resource
- **Line Points**: Efficiency score (cost per hour)

**Efficiency Interpretation:**

- **High Cost + Low Efficiency**: Expensive resources with high cost per hour
- **High Cost + High Efficiency**: Expensive but efficient resources
- **Low Cost + Low Efficiency**: Cheap resources but poor cost per hour ratio
- **Low Cost + High Efficiency**: Cost-effective resources

**Optimization Targets:**

- **High Efficiency Score**: Resources with cost per hour above average
- **Right-Sizing Candidates**: Resources with consistently high efficiency scores
- **Optimization Opportunities**: Resources showing poor cost per unit ratios

### 6. Traditional Resource Analysis

#### Cost by Resource Group Chart

**Calculation Method:**

```python
cost_by_rg = df.groupby('ResourceGroup')['Cost'].sum().sort_values(ascending=False)
```

**Chart Elements:**

- **X-Axis**: Resource group names
- **Y-Axis**: Total cost in USD
- **Bar Height**: Proportional to total spending per resource group
- **Text Labels**: Dollar amounts displayed above each bar

**Business Usage:**

- **Environment Comparison**: Compare prod, dev, test environment costs
- **Department Allocation**: Understand which teams/projects drive costs
- **Budget Attribution**: Assign costs to specific business units

#### Top Machines by Cost Chart

**Calculation Method:**

```python
cost_by_machine = df.groupby('ResourceName')['Cost'].sum().sort_values(ascending=False)
top_machines = cost_by_machine.head(Config.TOP_ITEMS_COUNT)
```

**Analysis Focus:**

- **Top N Machines**: Configurable from 5-50 most expensive resources
- **Cost Ranking**: Machines ordered by total spending
- **Optimization Targets**: Highest-cost machines for immediate attention

#### Cost vs Usage Comparison Chart

**Calculation Method:**

```python
agg_data = df.groupby('ResourceGroup').agg({
'Cost': 'sum',
'Quantity': 'sum'
})
```

**Dual-Axis Visualization:**

- **Primary Y-Axis (Bars)**: Total cost per resource group
- **Secondary Y-Axis (Line)**: Total usage hours per resource group
- **Correlation Analysis**: Relationship between cost and usage

**Business Insights:**

- **High Cost + Low Usage**: Potentially over-provisioned resources
- **Low Cost + High Usage**: Efficient resource utilization
- **Correlation Patterns**: Expected vs unexpected cost-usage relationships

### 7. Uncategorized Items Analysis

#### Uncategorized Cost Metrics

**Calculation Method:**

```python
uncategorized_items = classified_df[classified_df['CostCategory'] == 'Other']
uncategorized_cost = uncategorized_items['Cost'].sum()
uncategorized_percentage = (uncategorized_cost / total_cost * 100)
```

**Key Indicators:**

- **Uncategorized Cost**: Dollar amount not classified into business categories
- **Percentage of Total**: Uncategorized amount as percentage of total invoice
- **Number of Items**: Count of line items that couldn't be classified
- **Status Assessment**: ✅ Excellent (<1%), ⚠️ Good (1-5%), ❌ Needs Review (>5%)

#### Service Type Breakdown

**Calculation Method:**

```python
service_breakdown = uncategorized_items.groupby([
'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
'Cost': 'sum',
'ResourceName': 'count'
})
```

**Table Analysis:**

- **Service Provider**: Azure service generating uncategorized costs
- **Meter Category**: Type of Azure service meter
- **Meter Subcategory**: Specific service subtype
- **Total Cost**: Dollar amount for this service combination
- **Item Count**: Number of invoice lines for this service type
- **% of Uncategorized**: Percentage of uncategorized costs from this service

### 8. Enhanced Service Breakdown (Machine Level)

#### Comprehensive Service Analysis

**Calculation Method:**

```python
enhanced_breakdown = all_machine_data.groupby([
'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
'Cost': 'sum',
'Quantity': 'sum',
'ResourceName': lambda x: ', '.join(x.unique())
})
```

**Detailed Breakdown Components:**

- **Service Provider**: Azure service (Microsoft.Compute, Microsoft.Storage)
- **Meter Category**: Service category (Virtual Machines, Storage, Network)
- **Meter Subcategory**: Specific service type (Premium SSD, D2s v3, etc.)
- **Cost**: Total spending for this service type
- **Quantity**: Total usage amount (hours, GB, transactions)
- **Resource Names**: Which specific resources use this service
- **Suggested Category**: What business category this should be classified as

#### Storage Analysis for Machines

**Calculation Method:**

```python
storage_data = all_machine_data[
all_machine_data['MeterCategory'].str.contains('Storage|Disk', case=False)
]

storage_breakdown = storage_data.groupby(['MeterSubcategory', 'ResourceName']).agg({
'Cost': 'sum',
'Quantity': 'sum'
})

storage_percentage = (storage_cost / total_machine_cost * 100)
```

**Storage Metrics:**

- **Storage Type**: Specific disk type (Premium SSD, Standard HDD)
- **Resource Name**: Individual disk or storage resource
- **Cost**: Dollar amount for this storage component
- **Quantity**: Storage usage (typically GB-hours or disk-hours)
- **Storage Percentage**: Storage costs as percentage of total machine costs

**Optimization Insights:**

- **Disk Tier Analysis**: Premium vs Standard SSD usage
- **Storage Efficiency**: Cost per GB analysis
- **Right-Sizing Opportunities**: Over-provisioned storage identification

## Technical Implementation

### Required CSV Structure

```
Date,Cost,Quantity,ResourceGroup,ResourceName,ConsumedService,MeterCategory,MeterSubcategory
2024-01-01,45.67,720,prod-rg,web-server-01,Microsoft.Compute,Virtual Machines,D2s v3
2024-01-01,89.23,744,prod-rg,web-server-01-disk,Microsoft.Compute,Storage,Premium SSD Managed Disks
```

### Classification Logic Implementation

```python
# Managed Disks Classification
if MeterSubcategory in ['Premium SSD Managed Disks', 'Standard HDD Managed Disks']:
return 'Managed Disks'

# VM Compute Classification
if ConsumedService == 'Microsoft.Compute' and MeterSubcategory not in disk_categories:
return 'VM Compute'

# Network Classification
if MeterCategory == 'Virtual Network':
return 'Network/IP'
```

### Cost Reconciliation Engine

- **Mathematical Validation**: Original total = Sum of categorized costs
- **Coverage Tracking**: Percentage of costs successfully categorized
- **Gap Identification**: Uncategorized items analysis and recommendations
- **Error Detection**: Data quality issues and missing classifications

This comprehensive strategy document provides the technical foundation and business rationale for implementing Azure Invoice Analyzer Pro as a critical tool for Azure cost management and financial governance.