https://github.com/jonaskahn/azure-invoice
Simple app for simple life
https://github.com/jonaskahn/azure-invoice
Last synced: 7 days ago
JSON representation
Simple app for simple life
- Host: GitHub
- URL: https://github.com/jonaskahn/azure-invoice
- Owner: jonaskahn
- License: apache-2.0
- Created: 2025-06-04T08:26:09.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-08T15:35:59.000Z (about 1 year ago)
- Last Synced: 2026-01-17T05:10:29.240Z (5 months ago)
- Language: Python
- Homepage: https://azure-invoice.streamlit.app
- Size: 293 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Azure Invoice Analyzer Pro - Complete Strategy Document
## Executive Summary
Azure Invoice Analyzer Pro is an advanced Streamlit application that transforms Azure billing data into actionable business intelligence. The system provides comprehensive cost analysis through 9 business categories, interactive drill-down capabilities, and complete financial reconciliation with PDF export functionality.
## Quick Chart Reference
### Chart Calculation Summary
| Chart Type | Calculation Method | Key Metrics |
| ----------------------------- | --------------------------------------------- | ----------------------------------- |
| **Cost Category Pie/Bar** | `df.groupby('CostCategory')['Cost'].sum()` | Category percentages of total cost |
| **Service Provider Analysis** | `df.groupby('ConsumedService')['Cost'].sum()` | Total cost per Azure service |
| **Resource Group Costs** | `df.groupby('ResourceGroup')['Cost'].sum()` | Total cost per resource group |
| **Machine Efficiency** | `Cost / Quantity` per resource | Cost per unit/hour analysis |
| **Interactive Drill-Down** | Multi-level grouping + pattern matching | Resource Group → Machine → Category |
| **Cost Reconciliation** | `abs(original_total - categorized_total)` | Financial validation & accuracy |
### Core Calculations
**Cost Reconciliation:**
```python
original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
coverage = (categorized_cost / original_total) × 100
```
**Category Classification:**
```python
# 9 business categories based on Azure service patterns
if MeterSubcategory in ['Premium SSD Managed Disks']:
return 'Managed Disks'
elif ConsumedService == 'Microsoft.Compute' and not disk_category:
return 'VM Compute'
```
**Machine Cost Breakdown:**
```python
# Includes related resources (disks, NICs, etc.)
machine_data = df[df['ResourceName'] == machine_name]
related_data = df[df['ResourceName'].str.contains(machine_name)]
total_cost = combined_data['Cost'].sum()
```
### Chart Types and Business Value
| Chart | Purpose | Key Insight | Optimization Action |
| ----------------------------- | -------------------------------------- | ----------------------------------- | ----------------------------------------- |
| **Cost Category Pie** | Budget allocation by business function | Which category consumes most budget | Focus optimization on largest segments |
| **Service Provider Bar** | Azure service spending distribution | Which Microsoft services cost most | Negotiate better rates, optimize usage |
| **Resource Group Costs** | Environment/team cost allocation | Which groups spend most | Budget allocation, cost accountability |
| **Machine Efficiency** | Resource utilization analysis | Cost per hour performance | Right-size over/under-utilized resources |
| **Drill-Down Analysis** | Granular cost investigation | Machine-level cost breakdown | Target specific machines for optimization |
| **Reconciliation Validation** | Financial accuracy verification | Data integrity and completeness | Ensure 100% cost accountability |
## Core Architecture
### Data Processing Pipeline
1. **Data Ingestion**: CSV file upload with validation and error handling
2. **Cost Classification**: 9-category business intelligence classification system
3. **Cost Reconciliation**: Mathematical validation ensuring 100% cost accountability
4. **Interactive Analysis**: Multi-level drill-down from resource groups to individual machines
5. **Export Generation**: Professional PDF reports and individual chart downloads
### Classification Engine
#### 9 Business Cost Categories
1. **Managed Disks** (typically 60-80% of costs)
- Premium SSD Managed Disks
- Standard HDD Managed Disks
- Standard SSD Managed Disks
- Ultra SSD Managed Disks
2. **VM Compute** (10-25% of costs)
- Virtual machine runtime costs
- CPU and memory consumption
- Excludes storage components
3. **CDN** (5-15% of costs)
- Content Delivery Network services
- Data transfer and caching
4. **Network/IP** (3-10% of costs)
- Virtual Network services
- Public IP addresses
- Network security groups
5. **Backup** (2-8% of costs)
- Recovery Services vault
- Backup storage and operations
6. **Load Balancer** (1-5% of costs)
- Standard and Basic load balancers
- Application gateways
7. **Other Storage** (<2% of costs)
- Blob storage, File storage
- Table storage, Queue storage
8. **Bandwidth** (<1% of costs)
- Data transfer charges
- Inter-region communication
9. **Key Vault** (<1% of costs)
- Secret management services
- Certificate operations
### Interactive Drill-Down System
#### Three-Level Analysis Hierarchy
1. **Resource Group Level**
- Total cost, machine count, usage hours
- High-level resource allocation view
2. **Machine Level**
- Individual machine costs and usage
- Service provider breakdown
- Related resource identification
3. **Category Level**
- Detailed cost breakdown by business category
- Service-level attribution
- Optimization recommendations
### Enhanced Machine Analysis
#### Related Resource Detection
- **Pattern Matching**: Identifies associated resources using naming conventions
- **Resource Relationships**: Links VMs with their disks, NICs, and other components
- **Complete Cost Attribution**: Ensures all machine-related costs are captured
#### Categories Include:
```
vm-name → Primary virtual machine
vm-name-disk → Managed disks
vm-name-nic → Network interfaces
vm-name_OsDisk → Operating system disks
vm-name-backup → Backup services
```
## Chart Calculations and Analysis
### 1. Executive Summary & Validation
#### Cost Reconciliation Engine
**Mathematical Foundation:**
```python
# Core reconciliation calculation
original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
difference = abs(original_total - categorized_total)
reconciliation_success = difference < 0.01
# Coverage calculation
categorized_cost = total_cost - uncategorized_cost
coverage_percentage = (categorized_cost / original_total) × 100
```
**Key Metrics Explained:**
- **Original Invoice Total**: Sum of all Cost column values from uploaded CSV
- **Categorized Total**: Sum of all costs after classification processing
- **Difference**: Mathematical variance between original and processed totals
- **Coverage Percentage**: (Categorized costs / Original total) × 100
- **Reconciliation Status**: ✅ Success if difference < $0.01, ❌ Failed otherwise
**Business Interpretation:**
- **100% Coverage**: All costs properly classified, complete financial accountability
- **95-99% Coverage**: Good classification, minor uncategorized items need review
- **<95% Coverage**: Significant classification gaps, immediate attention required
#### Executive Metrics Dashboard
**Calculation Method:**
```python
total_cost = df['Cost'].sum()
total_quantity = df['Quantity'].sum()
unique_resource_groups = df['ResourceGroup'].nunique()
unique_machines = df['ResourceName'].nunique()
```
**Metrics Meaning:**
- **Total Cost**: Complete Azure spending for the invoice period
- **Total Usage**: Sum of all quantity values (typically hours)
- **Top Category**: Highest cost category from 9-category classification
- **Resource Groups**: Number of distinct Azure resource groups
- **Services**: Number of unique Azure service providers (Microsoft.Compute, etc.)
### 2. Cost Category Analysis
#### Cost Category Pie Chart
**Calculation Method:**
```python
category_summary = classified_df.groupby('CostCategory').agg({
'Cost': 'sum',
'Quantity': 'sum'
}).round(4)
category_percentage = (category_summary['Cost'] / total_cost * 100).round(2)
```
**Chart Elements:**
- **Pie Segments**: Each represents one of 9 business categories
- **Percentage Labels**: Category cost as percentage of total invoice
- **Dollar Values**: Absolute cost amount for each category
- **Color Coding**: Consistent colors across all charts for category identification
**Business Interpretation:**
- **Dominant Categories**: Largest segments indicate primary cost drivers
- **Distribution Balance**: Even distribution suggests diversified infrastructure
- **Anomalies**: Unusually large percentages may indicate optimization opportunities
#### Cost Category Bar Chart (Detailed View)
**Calculation Method:**
```python
category_breakdown = classified_df.groupby('CostCategory').agg({
'Cost': ['sum', 'count', 'mean'],
'Quantity': 'sum'
})
record_count = category_breakdown['Cost']['count']
avg_cost = category_breakdown['Cost']['mean']
```
**Data Points Explained:**
- **Horizontal Bars**: Length represents total cost for each category
- **Cost Values**: Dollar amount displayed on each bar
- **Percentage Labels**: Category percentage of total costs
- **Record Count**: Number of invoice line items in each category (hover data)
**Key Insights:**
- **Bar Length Comparison**: Visual representation of spending distribution
- **High Record Count**: Many small transactions vs few large transactions
- **Cost Concentration**: Identifies which categories drive majority of spending
### 3. Service Provider Analysis
#### Service Provider Chart
**Calculation Method:**
```python
provider_summary = df.groupby('ConsumedService').agg({
'Cost': ['sum', 'count'],
'Quantity': 'sum'
}).round(4)
provider_percentage = (provider_summary['Cost'] / total_cost * 100).round(2)
```
**Chart Components:**
- **X-Axis**: Azure service providers (Microsoft.Compute, Microsoft.Storage, etc.)
- **Y-Axis**: Total cost in USD for each provider
- **Bar Height**: Proportional to spending on each service
- **Text Labels**: Dollar amounts displayed above bars
**Service Provider Breakdown:**
- **Microsoft.Compute**: Virtual machines, disks, compute resources
- **Microsoft.Network**: Virtual networks, load balancers, IP addresses
- **Microsoft.Storage**: Blob storage, file storage, queues
- **Microsoft.RecoveryServices**: Backup and disaster recovery
- **Microsoft.Cdn**: Content delivery network services
**Business Analysis:**
- **Compute Dominance**: High Microsoft.Compute costs indicate VM-heavy workloads
- **Storage Intensity**: High Microsoft.Storage suggests data-intensive applications
- **Network Costs**: Significant Microsoft.Network indicates complex networking requirements
### 4. Interactive Drill-Down Analysis
#### Resource Group Selection Metrics
**Calculation Method:**
```python
rg_data = df[df['ResourceGroup'] == selected_rg]
rg_cost = rg_data['Cost'].sum()
rg_machines = rg_data['ResourceName'].nunique()
rg_quantity = rg_data['Quantity'].sum()
```
**Metrics Explanation:**
- **Total Cost**: Sum of all costs within selected resource group
- **Machines**: Count of unique ResourceName values in the group
- **Total Usage**: Sum of quantity values (usually compute hours)
#### Machine Analysis Table
**Calculation Method:**
```python
machine_summary = rg_data.groupby('ResourceName').agg({
'Cost': 'sum',
'Quantity': 'sum',
'ConsumedService': lambda x: ', '.join(x.unique()),
'MeterCategory': lambda x: ', '.join(x.unique())
})
cost_percentage = (machine_summary['Cost'] / rg_cost * 100).round(2)
```
**Table Columns:**
- **Machine Name**: ResourceName from Azure invoice
- **Total Cost**: Sum of all costs for this specific machine
- **Total Usage**: Sum of quantity values for this machine
- **Services Used**: List of Azure services consumed by this machine
- **Meter Categories**: Types of Azure meters associated with this machine
- **Cost %**: This machine's percentage of total resource group costs
#### Individual Machine Analysis
**Calculation Method:**
```python
# Enhanced resource detection
machine_data = classified_df[classified_df['ResourceName'] == selected_machine]
related_data = classified_df[
(classified_df['ResourceName'].str.contains(selected_machine, case=False)) |
(classified_df['ResourceName'].str.startswith(selected_machine + '-')) |
(classified_df['ResourceName'].str.startswith(selected_machine + '_'))
]
combined_data = pd.concat([machine_data, related_data]).drop_duplicates()
```
**Machine Metrics:**
- **Total Cost**: Sum of costs for machine and all related resources
- **Total Usage**: Sum of quantity values across all related resources
- **Categories**: Number of cost categories this machine uses
- **Cost/Hour**: Total cost divided by total usage hours
**Related Resources Detection:**
- **Primary Resource**: Exact match of machine name
- **Associated Disks**: Resources with patterns like "vm-name-disk", "vm-name_OsDisk"
- **Network Components**: Resources like "vm-name-nic" (network interfaces)
- **Backup Resources**: Resources like "vm-name-backup"
#### Machine Cost Breakdown Charts
**Pie Chart (Cost Distribution):**
```python
category_breakdown = combined_data.groupby('CostCategory').agg({
'Cost': 'sum',
'Quantity': 'sum'
})
machine_percentage = (category_breakdown['Cost'] / machine_cost * 100).round(2)
```
**Elements:**
- **Pie Segments**: Proportional to category costs for this machine
- **Percentage Labels**: Category percentage of machine's total cost
- **Dollar Values**: Absolute cost for each category
**Bar Chart (Category Details):**
- **Horizontal Bars**: Cost amount for each category
- **Category Colors**: Consistent color scheme matching pie chart
- **Hover Data**: Additional details including quantity used
### 5. Resource Efficiency Analysis
#### Efficiency Metrics Chart
**Calculation Method:**
```python
efficiency_df = df.groupby('ResourceName').agg({
'Cost': 'sum',
'Quantity': 'sum'
})
efficiency_df['CostPerUnit'] = efficiency_df['Cost'] / efficiency_df['Quantity']
efficiency_df['EfficiencyScore'] = efficiency_df['Cost'] / efficiency_df['Quantity']
```
**Dual-Axis Chart Components:**
- **Left Y-Axis**: Total cost in USD (bars)
- **Right Y-Axis**: Cost per unit/hour (line)
- **X-Axis**: Resource names (top 15 by cost)
- **Bar Height**: Total spending on each resource
- **Line Points**: Efficiency score (cost per hour)
**Efficiency Interpretation:**
- **High Cost + Low Efficiency**: Expensive resources with high cost per hour
- **High Cost + High Efficiency**: Expensive but efficient resources
- **Low Cost + Low Efficiency**: Cheap resources but poor cost per hour ratio
- **Low Cost + High Efficiency**: Cost-effective resources
**Optimization Targets:**
- **High Efficiency Score**: Resources with cost per hour above average
- **Right-Sizing Candidates**: Resources with consistently high efficiency scores
- **Optimization Opportunities**: Resources showing poor cost per unit ratios
### 6. Traditional Resource Analysis
#### Cost by Resource Group Chart
**Calculation Method:**
```python
cost_by_rg = df.groupby('ResourceGroup')['Cost'].sum().sort_values(ascending=False)
```
**Chart Elements:**
- **X-Axis**: Resource group names
- **Y-Axis**: Total cost in USD
- **Bar Height**: Proportional to total spending per resource group
- **Text Labels**: Dollar amounts displayed above each bar
**Business Usage:**
- **Environment Comparison**: Compare prod, dev, test environment costs
- **Department Allocation**: Understand which teams/projects drive costs
- **Budget Attribution**: Assign costs to specific business units
#### Top Machines by Cost Chart
**Calculation Method:**
```python
cost_by_machine = df.groupby('ResourceName')['Cost'].sum().sort_values(ascending=False)
top_machines = cost_by_machine.head(Config.TOP_ITEMS_COUNT)
```
**Analysis Focus:**
- **Top N Machines**: Configurable from 5-50 most expensive resources
- **Cost Ranking**: Machines ordered by total spending
- **Optimization Targets**: Highest-cost machines for immediate attention
#### Cost vs Usage Comparison Chart
**Calculation Method:**
```python
agg_data = df.groupby('ResourceGroup').agg({
'Cost': 'sum',
'Quantity': 'sum'
})
```
**Dual-Axis Visualization:**
- **Primary Y-Axis (Bars)**: Total cost per resource group
- **Secondary Y-Axis (Line)**: Total usage hours per resource group
- **Correlation Analysis**: Relationship between cost and usage
**Business Insights:**
- **High Cost + Low Usage**: Potentially over-provisioned resources
- **Low Cost + High Usage**: Efficient resource utilization
- **Correlation Patterns**: Expected vs unexpected cost-usage relationships
### 7. Uncategorized Items Analysis
#### Uncategorized Cost Metrics
**Calculation Method:**
```python
uncategorized_items = classified_df[classified_df['CostCategory'] == 'Other']
uncategorized_cost = uncategorized_items['Cost'].sum()
uncategorized_percentage = (uncategorized_cost / total_cost * 100)
```
**Key Indicators:**
- **Uncategorized Cost**: Dollar amount not classified into business categories
- **Percentage of Total**: Uncategorized amount as percentage of total invoice
- **Number of Items**: Count of line items that couldn't be classified
- **Status Assessment**: ✅ Excellent (<1%), ⚠️ Good (1-5%), ❌ Needs Review (>5%)
#### Service Type Breakdown
**Calculation Method:**
```python
service_breakdown = uncategorized_items.groupby([
'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
'Cost': 'sum',
'ResourceName': 'count'
})
```
**Table Analysis:**
- **Service Provider**: Azure service generating uncategorized costs
- **Meter Category**: Type of Azure service meter
- **Meter Subcategory**: Specific service subtype
- **Total Cost**: Dollar amount for this service combination
- **Item Count**: Number of invoice lines for this service type
- **% of Uncategorized**: Percentage of uncategorized costs from this service
### 8. Enhanced Service Breakdown (Machine Level)
#### Comprehensive Service Analysis
**Calculation Method:**
```python
enhanced_breakdown = all_machine_data.groupby([
'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
'Cost': 'sum',
'Quantity': 'sum',
'ResourceName': lambda x: ', '.join(x.unique())
})
```
**Detailed Breakdown Components:**
- **Service Provider**: Azure service (Microsoft.Compute, Microsoft.Storage)
- **Meter Category**: Service category (Virtual Machines, Storage, Network)
- **Meter Subcategory**: Specific service type (Premium SSD, D2s v3, etc.)
- **Cost**: Total spending for this service type
- **Quantity**: Total usage amount (hours, GB, transactions)
- **Resource Names**: Which specific resources use this service
- **Suggested Category**: What business category this should be classified as
#### Storage Analysis for Machines
**Calculation Method:**
```python
storage_data = all_machine_data[
all_machine_data['MeterCategory'].str.contains('Storage|Disk', case=False)
]
storage_breakdown = storage_data.groupby(['MeterSubcategory', 'ResourceName']).agg({
'Cost': 'sum',
'Quantity': 'sum'
})
storage_percentage = (storage_cost / total_machine_cost * 100)
```
**Storage Metrics:**
- **Storage Type**: Specific disk type (Premium SSD, Standard HDD)
- **Resource Name**: Individual disk or storage resource
- **Cost**: Dollar amount for this storage component
- **Quantity**: Storage usage (typically GB-hours or disk-hours)
- **Storage Percentage**: Storage costs as percentage of total machine costs
**Optimization Insights:**
- **Disk Tier Analysis**: Premium vs Standard SSD usage
- **Storage Efficiency**: Cost per GB analysis
- **Right-Sizing Opportunities**: Over-provisioned storage identification
## Technical Implementation
### Required CSV Structure
```
Date,Cost,Quantity,ResourceGroup,ResourceName,ConsumedService,MeterCategory,MeterSubcategory
2024-01-01,45.67,720,prod-rg,web-server-01,Microsoft.Compute,Virtual Machines,D2s v3
2024-01-01,89.23,744,prod-rg,web-server-01-disk,Microsoft.Compute,Storage,Premium SSD Managed Disks
```
### Classification Logic Implementation
```python
# Managed Disks Classification
if MeterSubcategory in ['Premium SSD Managed Disks', 'Standard HDD Managed Disks']:
return 'Managed Disks'
# VM Compute Classification
if ConsumedService == 'Microsoft.Compute' and MeterSubcategory not in disk_categories:
return 'VM Compute'
# Network Classification
if MeterCategory == 'Virtual Network':
return 'Network/IP'
```
### Cost Reconciliation Engine
- **Mathematical Validation**: Original total = Sum of categorized costs
- **Coverage Tracking**: Percentage of costs successfully categorized
- **Gap Identification**: Uncategorized items analysis and recommendations
- **Error Detection**: Data quality issues and missing classifications
This comprehensive strategy document provides the technical foundation and business rationale for implementing Azure Invoice Analyzer Pro as a critical tool for Azure cost management and financial governance.