https://github.com/microsoftcloudessentials-learninghub/pdfs-invoice-processing-fapp-openframework

Example of how to create to extract PDFs from an Azure Storage Account, process them using Azure resources and Open Framework, and store the results in Cosmos DB for further analysis.
https://github.com/microsoftcloudessentials-learninghub/pdfs-invoice-processing-fapp-openframework

ai-use-cases-strategy automated-extraction cosmos-db full-code function-app open-frameworks pdf-invoice storage-account

Last synced: about 1 month ago
JSON representation

Example of how to create to extract PDFs from an Azure Storage Account, process them using Azure resources and Open Framework, and store the results in Cosmos DB for further analysis.

Host: GitHub
URL: https://github.com/microsoftcloudessentials-learninghub/pdfs-invoice-processing-fapp-openframework
Owner: MicrosoftCloudEssentials-LearningHub
License: mit
Created: 2025-05-19T19:46:52.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-10-29T18:35:29.000Z (4 months ago)
Last Synced: 2025-10-29T20:45:36.839Z (4 months ago)
Topics: ai-use-cases-strategy, automated-extraction, cosmos-db, full-code, function-app, open-frameworks, pdf-invoice, storage-account
Language: HCL
Homepage:
Size: 120 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Demo: Automated PDF Invoice Processing with Open Framework (full-code approach)

`Azure Storage + Function App + Open Framework + Cosmos DB`

Costa Rica

[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/)
[brown9804](https://github.com/brown9804)

Last updated: 2025-09-17

----------

> [!IMPORTANT]
> This example is based on a `public network site and is intended for demonstration purposes only`. It showcases how several Azure resources can work together to achieve the desired result. Consider the section below about [Important Considerations for Production Environment](#important-considerations-for-production-environment). Please note that `these demos are intended as a guide and are based on my personal experiences. For official guidance, support, or more detailed information, please refer to Microsoft's official documentation or contact Microsoft directly`: [Microsoft Sales and Support](https://support.microsoft.com/contactus?ContactUsExperienceEntryPointAssetId=S.HP.SMC-HOME)

List of References (Click to expand)

- [Power Apps pricing](https://www.microsoft.com/en-us/power-platform/products/power-apps/)
- [Create a Fabric data agent (preview)](https://learn.microsoft.com/en-us/fabric/data-science/how-to-create-data-agent) `enables users to interact with data stored in lakehouses, warehouses, Power BI semantic models, and KQL databases using natural language queries`. Key prerequisites include having a paid Fabric capacity, enabling specific tenant settings, and ensuring data sources are accessible.
- To create a data agent, users navigate to their workspace, select the data agent option, and configure it by adding `up to five data sources`. Users can ask questions in plain English, and the system translates these into structured queries (SQL, DAX, KQL) to retrieve data. The document outlines the process for creating, configuring, and sharing the data agent, including providing instructions and example queries to enhance performance.
- The data agent operates under the user’s Microsoft Entra ID permissions, ensuring secure access to data. `It does not support complex reasoning or advanced analytics, focusing instead on retrieving structured data based on user queries`. Users can publish their data agent for colleagues to use after testing its performance. We can extend this with [Extend your agent with Model Context Protocol](https://learn.microsoft.com/en-us/microsoft-copilot-studio/agent-extend-action-mcp) allows users to connect to existing knowledge servers and data sources, providing access to resources, tools, and predefined prompts for specific tasks.

- [Document Processor](https://learn.microsoft.com/en-us/microsoft-copilot-studio/template-managed-document-processor) managed agent in Microsoft Copilot Studio. `E2E solution for document processing, including extraction, validation, human monitoring, and exporting to downstream applications. Users can upload a sample document and configure extraction without needing to label data or train custom models. The agent informs users of the processing status and allows for manual verification of extracted data in the Validation Station`. Key prerequisites include licenses for Copilot Studio and Power Platform, enabling the Power Apps component framework, and specific security roles and permissions. Limitations on document processing include file size (less than 25 MB), file types (PNG, JPG, JPEG, PDF), and a maximum of 50 pages. Users can set up the agent by creating connections with required services, configuring data fields for extraction, and creating validation rules. The agent can notify reviewers when validation fails, and users can interact with the agent via Microsoft Teams. Here is more about `setting up the agent, including uploading sample documents, defining validation rules, and selecting document sources.` [Use an autonomous agent in Copilot Studio for document processing](https://learn.microsoft.com/en-us/power-platform/architecture/reference-architectures/document-processing-agent)

- [Solution Accelerator for AI Document Processor (ADP)](https://github.com/azure/ai-document-processor) - AI Factory

| **Category** | **Details** |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Purpose** | Automate document processing using Azure services and LLMs. Extracts data from files (PDF, Word, MP3), processes via Azure OpenAI, and outputs structured insights (JSON/CSV) to blob storage. |
| **Infrastructure Provisioning** | Uses Bicep templates to deploy all required Azure resources. Automates setup of networking, identity, and access controls using RBAC and managed identities to ensure secure and scalable deployment. |
| **Main Azure Resources** | - **Azure Function App**: Hosts the orchestrator and activity functions using Durable Functions to manage the document processing workflow.
- **Azure Storage Account**: Stores input documents (bronze container) and output results (gold container). Also holds prompt configurations if UI is not deployed.
- **Azure Static Web App**: Provides a user-friendly interface for uploading files, editing prompts, and triggering workflows.
- **Azure OpenAI**: Processes extracted text using LLMs to generate structured summaries or insights.
- **Azure Cognitive Services**: Specifically uses Document Intelligence for OCR and text extraction from uploaded files.
- **Cosmos DB**: Stores prompt configurations when the frontend UI is enabled, allowing dynamic updates from the web interface.
- **Key Vault**: Securely stores secrets, keys, and credentials used by the Function App and other services.
- **Application Insights**: Enables monitoring, logging, and diagnostics for the Function App and other components.
- **App Service Plan**: Provides the compute resources for running the Function App. |
| **Pipeline Components** | - `function_app.py`: Main orchestrator using Durable Functions chaining pattern.
- `activities/runDocIntel.py`: Extracts text from documents using Azure Document Intelligence.
- `activities/callAoai.py`: Sends extracted text and prompt to Azure OpenAI and receives structured JSON.
- `activities/writeToBlob.py`: Writes the final output to the gold container in blob storage. |
| **Data Flow** | 1. Upload document to bronze container
2. OCR via Document Intelligence
3. Send extracted text + prompt to Azure OpenAI
4. Receive structured JSON
5. Write output to gold container |
| **Frontend UI (Optional)** | - Allows business users to upload files and edit prompts
- Prompts are stored in Cosmos DB
- Users can trigger workflows and view job status directly from the interface |
| **Prompt Configuration** | - Without UI: Prompts are stored in `prompts.yaml` file in blob storage
- With UI: Prompts are stored and managed in Cosmos DB via the web interface |
| **Deployment Steps** | 1. Fork and clone the GitHub repo
2. Run `az login`, `azd auth login`, `azd up`
3. Provide User Principal ID for RBAC setup
4. Choose whether to deploy frontend UI |
| **Execution (Without UI)** | - Update `prompts.yaml` with desired instructions
- Send POST request to `http_start` endpoint with blob metadata
- Monitor pipeline execution via Log Stream |
| **Execution (With UI)** | - Upload files via web interface
- Edit system and user prompts
- Click "Start Workflow" to trigger pipeline
- View success/failure messages and job status |
| **Monitoring & Troubleshooting** | - Use Log Stream for real-time logs
- Use Log Analytics Workspace to query exceptions and performance metrics
- Use SSH console in Development Tools to inspect deployment logs and file system |
| **Pre-Requisites** | - Azure CLI
- Azure Developer CLI (azd)
- Node.js 18.x.x
- npm 9.x.x
- Python 3.11 |
| **License** | MIT License – Free to use, modify, and distribute with attribution. No warranty provided. |

> Data flow:

> ZTA:

- [Use Azure AI services with SynapseML in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-science/how-to-use-ai-services-with-synapseml)
- [Plan and manage costs for Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/costs-plan-manage)
- [Azure Cosmos DB - Database for the AI Era](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction)
- [What is Azure SQL Database?](https://learn.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview?view=azuresql)
- [App settings reference for Azure Functions](https://learn.microsoft.com/en-us/azure/azure-functions/functions-app-settings)
- [Storage considerations for Azure Functions](https://learn.microsoft.com/en-us/azure/azure-functions/storage-considerations?tabs=azure-cli)

Table of Content (Click to expand)

- [Prerequisites](#prerequisites)
- [Where to start?](#where-to-start)
- [Important Considerations for Production Environment](#important-considerations-for-production-environment)
- [Overview](#overview)
- [Function App Hosting Options](#function-app-hosting-options)
- [Step 1: Set Up Your Azure Environment](#step-1-set-up-your-azure-environment)
- [Step 2: Set Up Azure Blob Storage for PDF Ingestion](#step-2-set-up-azure-blob-storage-for-pdf-ingestion)
- [Step 3: Set Up Azure Cosmos DB](#step-3-set-up-azure-cosmos-db)
- [Step 4: Set Up Azure Functions for Document Ingestion and Processing](#step-4-set-up-azure-functions-for-document-ingestion-and-processing)
- [Create a Function App](#create-a-function-app)
- [Configure/Validate the Environment variables](#configurevalidate-the-environment-variables)
- [Develop the Function](#develop-the-function)
- [Step 5: Test the solution](#step-5-test-the-solution)

> `How we move from basic coding all the way to AI agents?`

```mermaid
flowchart LR
A[Scripting: Line-by-line instructions] --> B[Machine Learning: Packages + statistical foundations]
B --> C[LLMs: Reasoning, understanding, human-like responses]
C --> D[Agents: LLMs with ability to act]

%% Styling
classDef step fill:#4a90e2,stroke:#333,stroke-width:2px,color:#fff,font-weight:bold;
class A,B,C,D step;

%% Extra notes
A:::step
B:::step
C:::step
D:::step
```

More details about it here (Click to expand)

> - We all `start with scripting`, no matter the language, it’s the first step. `Simple/complex instructions, written line by line`, to get something done
> - Then comes `machine learning`. At this stage, we’re not reinventing the math, we’re `leveraging powerful packages built on deep statistical and mathematical foundations.` These tools let us `automate smarter processes, like reviewing claims with predictive analytics. You’re not just coding anymore; you’re building systems that learn and adapt.`
> - `LLMs`. This is what most people mean when they say `AI.` Think of `yourself as the architect, and the LLM as your strategic engine. You can plug into it via an API, a key, or through integrated services. It’s not just about automation, it’s about reasoning, understanding, and generating human-like responses.`
> - And finally, `agents`. These are LLMs with the `ability to act`. They don’t just respond, `they take initiative. They can create code, trigger workflows, make decisions, interact with tools, with other agents. It’s where intelligence meets execution`

> [!NOTE]
> A landing zone is a general `cloud framework that sets up the core structure for all workloads`. Each use case (like an app, data pipeline, or API) then builds on top of this framework, using the `same environments (Dev → Test → UAT → Prod) and CI/CD pipelines to move code safely into production.` It’s general by design, but `applied per use case.`

> How to parse PDFs from an Azure Storage Account, process them using a Open Framework (needs manual configuration), and store the results in Cosmos DB for further analysis.

>
> 1. Upload your PDFs to an Azure Blob Storage container.

> 2. An Azure Function is triggered by the upload, which uses an Open Framework, and multiple customizations as part of an API call to analyze the PDFs.

> 3. The extracted data is parsed and subsequently stored in a Cosmos DB database, ensuring a seamless and automated workflow from document upload to data storage.

> [!NOTE]
> Limitations of this approach:

>
> - Requires significant manual effort to structure and format extracted data.

> - Limited in handling complex layouts and non-text elements like images and charts.

## Important Considerations for Production Environment

Private Network Configuration

> For enhanced security, consider configuring your Azure resources to operate within a private network. This can be achieved using Azure Virtual Network (VNet) to isolate your resources and control inbound and outbound traffic. Implementing private endpoints for services like Azure Blob Storage and Azure Functions can further secure your data by restricting access to your VNet.

Security

> Ensure that you implement appropriate security measures when deploying this solution in a production environment. This includes:

>
> - Securing Access: Use Azure Entra ID (formerly known as Azure Active Directory or Azure AD) for authentication and role-based access control (RBAC) to manage permissions.

> - Managing Secrets: Store sensitive information such as connection strings and API keys in Azure Key Vault.

> - Data Encryption: Enable encryption for data at rest and in transit to protect sensitive information.

Scalability

> While this example provides a basic setup, you may need to scale the resources based on your specific requirements. Azure services offer various scaling options to handle increased workloads. Consider using:

>
> - Auto-scaling: Configure auto-scaling for Azure Functions and other services to automatically adjust based on demand.

> - Load Balancing: Use Azure Load Balancer or Application Gateway to distribute traffic and ensure high availability.

Cost Management

> Monitor and manage the costs associated with your Azure resources. Use Azure Cost Management and Billing to track usage and optimize resource allocation.

Compliance

> Ensure that your deployment complies with relevant regulations and standards. Use Azure Policy to enforce compliance and governance policies across your resources.

Disaster Recovery

> Implement a disaster recovery plan to ensure business continuity in case of failures. Use Azure Site Recovery and backup solutions to protect your data and applications.

## Prerequisites

- An `Azure subscription is required`. All other resources, including instructions for creating a Resource Group, are provided in this workshop.
- `Contributor role assigned or any custom role that allows`: access to manage all resources, and the ability to deploy resources within subscription.
- If you choose to use the Terraform approach, please ensure that:
- [Terraform is installed on your local machine](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli#install-terraform).
- [Install the Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) to work with both Terraform and Azure commands.

## Where to start?

> Please follow as described below.

- If you're choosing the `Infrastructure via Azure Portal`, please start [here in this section](#step-1-set-up-your-azure-environment).
- If you're choosing the `Infrastructure via Terraform` approach:
1. Please follow the [Terraform guide](./terraform-infrastructure/) to deploy the necessary Azure resources for the workshop.
2. Then, follow each [each section](#step-1-set-up-your-azure-environment) but `skip the creation of each resource`.

> [!IMPORTANT]
> Regarding `Networking`, this example will cover `Public access configuration`, and `system-managed identity`. However, please ensure you `review your privacy requirements and adjust network and access settings as necessary for your specific case`.

## Overview

> Using Cosmos DB provides you with a flexible, scalable, and globally distributed database solution that can handle both structured and semi-structured data efficiently.

>
> - `Azure Blob Storage`: Store the PDF invoices.

> - `Azure Functions`: Trigger on new PDF uploads, extract data, and process it.

> - `Azure SQL Database or Cosmos DB`: Store the extracted data for querying and analytics.

| Resource | Recommendation |
|---------------------------|----------------------------------------------------------------------------------------------------------------------|
| **Azure Blob Storage** | Use for storing the PDF files. This keeps your file storage separate from your data storage, which is a common best practice. |
| **Azure SQL Database** | Use if your data is highly structured and you need complex queries and transactions. |
| **Azure Cosmos DB** | Use if you need a globally distributed database with low latency and the ability to handle semi-structured data. |

## Function App Hosting Options

> In the context of Azure Function Apps, a `hosting option refers to the plan you choose to run your function app`. This choice affects how your function app is scaled, the resources available to each function app instance, and the support for advanced functionalities like virtual network connectivity and container support.

> [!TIP]
>
> - `Scale to Zero`: Indicates whether the service can automatically scale down to zero instances when idle.
> - **IDLE** stands for:
> - **I** – Inactive
> - **D** – During
> - **L** – Low
> - **E** – Engagement
> - In other words, when the application is not actively handling requests or events (it's in a low-activity or paused state).
> - `Scale Behavior`: Describes how the service scales (e.g., `event-driven`, `dedicated`, or `containerized`).
> - `Virtual Networking`: Whether the service supports integration with virtual networks for secure communication.
> - `Dedicated Compute & Reserved Cold Start`: Availability of always-on compute to avoid cold starts and ensure low latency.
> - `Max Scale Out (Instances)`: Maximum number of instances the service can scale out to.
> - `Example AI Use Cases`: Real-world scenarios where each plan excels.

Flex Consumption

| Feature | Description |
|--------|-------------|
| **Scale to Zero** | `Yes` |
| **Scale Behavior** | `Fast event-driven` |
| **Virtual Networking** | `Optional` |
| **Dedicated Compute & Reserved Cold Start** | `Optional (Always Ready)` |
| **Max Scale Out (Instances)** | `1000` |
| **Example AI Use Cases** | `Real-time data processing` for AI models, `high-traffic AI-powered APIs`, `event-driven AI microservices`. Ideal for fraud detection, real-time recommendations, NLP, and computer vision services. |

Consumption

| Feature | Description |
|--------|-------------|
| **Scale to Zero** | `Yes` |
| **Scale Behavior** | `Event-driven` |
| **Virtual Networking** | `Optional` |
| **Dedicated Compute & Reserved Cold Start** | `No` |
| **Max Scale Out (Instances)** | `200` |
| **Example AI Use Cases** | `Lightweight AI APIs`, `scheduled AI tasks`, `low-traffic AI event processing`. Great for sentiment analysis, simple image recognition, and batch ML tasks. |

Functions Premium

| Feature | Description |
|--------|-------------|
| **Scale to Zero** | `No` |
| **Scale Behavior** | `Event-driven with premium options` |
| **Virtual Networking** | `Yes` |
| **Dedicated Compute & Reserved Cold Start** | `Yes` |
| **Max Scale Out (Instances)** | `100` |
| **Example AI Use Cases** | `Enterprise AI applications`, `low-latency AI APIs`, `VNet integration`. Ideal for secure, high-performance AI services like customer support and analytics. |

App Service

| Feature | Description |
|--------|-------------|
| **Scale to Zero** | `No` |
| **Scale Behavior** | `Dedicated VMs` |
| **Virtual Networking** | `Yes` |
| **Dedicated Compute & Reserved Cold Start** | `Yes` |
| **Max Scale Out (Instances)** | `Varies` |
| **Example AI Use Cases** | `AI-powered web applications`, `dedicated resources`. Great for chatbots, personalized content, and intensive AI inference. |

Container Apps Env.

| Feature | Description |
|--------|-------------|
| **Scale to Zero** | `No` |
| **Scale Behavior** | `Containerized microservices environment` |
| **Virtual Networking** | `Yes` |
| **Dedicated Compute & Reserved Cold Start** | `Yes` |
| **Max Scale Out (Instances)** | `Varies` |
| **Example AI Use Cases** | `AI microservices architecture`, `containerized AI workloads`, `complex AI workflows`. Ideal for orchestrating AI services like image processing, text analysis, and real-time analytics. |

## Step 1: Set Up Your Azure Environment

> An Azure `Resource Group` is a `container that holds related resources for an Azure solution`.
> It can include all the resources for the solution or only those you want to manage as a group.
> Typically, resources that share the same lifecycle are added to the same resource group, allowing for easier deployment, updating, and deletion as a unit.
> Resource groups also store metadata about the resources, and you can apply access control, locks, and tags to them for better management and organization.

1. **Create an Azure Account**: If you don't have one, sign up for an Azure account.
2. **Create a Resource Group**:
- Go to the Azure portal.
- Navigate to **Resource groups**.
- Click **+ Create**.

- Enter the Resource Group name (e.g., `RGContosoAI`) and select a region (e.g., `East US 2`). You can add tags if needed.
- Click **Review + create** and then **Create**.

## Step 2: Set Up Azure Blob Storage for PDF Ingestion

> An `Azure Storage Account` provides a `unique namespace in Azure for your data, allowing you to store and manage various types of data such as blobs, files, queues, and tables`. It serves as the foundation for all Azure Storage services, ensuring high availability, scalability, and security for your data.

> A `Blob Container` is a `logical grouping of blobs within an Azure Storage Account, similar to a directory in a file system`. Containers help organize and manage blobs, which can be any type of unstructured data like text or binary data. Each container can store an unlimited number of blobs, and you must create a container before uploading any blobs.

1. **Create a Storage Account**:
- In the Azure portal, navigate to your **Resource Group**.
- Click **+ Create**.

- Search for `Storage Account`.

- Select the Resource Group you created.
- Enter a Storage Account name (e.g., `contosostorageaidemo`).
- Choose the region and performance options, and click `Next` to continue.

- If you need to modify anything related to `Security, Access protocols, Blob Storage Tier`, you can do that in the `Advanced` tab.

- Regarding `Networking`, this example will cover `Public access` configuration. However, please ensure you review your privacy requirements and adjust network and access settings as necessary for your specific case.

- Click **Review + create** and then **Create**. Once is done, you'll be able to see it in your Resource Group.

2. **Create a Blob Container**:
- Go to your Storage Account.
- Under **Data storage**, select **Containers**.
- Click **+ Container**.
- Enter a name for the container (e.g., `pdfinvoices`) and set the public access level to **Private**.
- Click **Create**.

## Step 3: Set Up Azure Cosmos DB

> `Azure Cosmos DB` is a globally distributed,`multi-model database service provided by Microsoft Azure`. It is designed to offer high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a `NoSQL database, meaning it can handle unstructured, semi-structured, and structured data types`. `It supports multiple data models, including document, key-value, graph, and column-family, making it versatile for various use cases.`

> An `Azure Cosmos DB container` is a `logical unit` within a Cosmos DB database where data is stored. `Containers are schema-agnostic, meaning they can store items with different structures. Each container is automatically partitioned to scale out across multiple servers, providing virtually unlimited throughput and storage`. Containers are the primary scalability unit in Cosmos DB, and they use a partition key to distribute data efficiently across partitions.

1. **Create a Cosmos DB Account**:
- In the Azure portal, navigate to your **Resource Group**.
- Click **+ Create**.
- Search for `Cosmos DB`, click on `Create`:

- Choose your desired API type, for this will be using `Azure Cosmos DB for NoSQL`. This option supports a SQL-like query language, which is familiar and powerful for querying and analyzing your invoice data. It also integrates well with various client libraries, making development easier and more flexible.

- Please enter an account name (e.g., `contosocosmosdbaidemo`). As with the previously configured resources, we will use the `Public network` for this example. Ensure that you adjust the architecture to include your networking requirements.
- Select the region and other settings.
- Click **Review + create** and then **Create**.

2. **Create a Database and Container**:
- Go to your Cosmos DB account.
- Under **Data Explorer**, click **New Database**.

- Enter a database name (e.g., `ContosoDBAIDemo`) and click **OK**.

- Click **New Container**.
- Enter a container name (e.g., `Invoices`) and set the partition key (e.g., `/invoice_number`).
- Click **OK**.

## Step 4: Set Up Azure Functions for Document Ingestion and Processing

> An `Azure Function App` is a `container for hosting individual Azure Functions`. It provides the execution context for your functions, allowing you to manage, deploy, and scale them together. `Each function app can host multiple functions, which are small pieces of code that run in response to various triggers or events, such as HTTP requests, timers, or messages from other Azure services`.

> Azure Functions are designed to be lightweight and event-driven, enabling you to build scalable and serverless applications. `You only pay for the resources your functions consume while they are running, making it a cost-effective solution for many scenarios`.

### Create a Function App

- In the Azure portal, go to your **Resource Group**.
- Click **+ Create**.

- Search for `Function App`, click on `Create`:

- Choose a `hosting option`; for this example, we will use `Consumption`. Click [here for a quick overview of hosting options](https://github.com/brown9804/MicrosoftCloudEssentialsHub/tree/main/0_Azure/3_AzureAI/14_AIUseCases/0_PDFProcessingFAOF#function-app-hosting-options):

- Enter a name for the Function App (e.g., `ContosoFunctionAppAI`).
- Choose your runtime stack (e.g., `.NET` or `Python`).
- Select the region and other settings.

- Select **Review + create** and then **Create**. Verify the resources created in your `Resource Group`.

> [!IMPORTANT]
> This example is using system-assigned managed identity to assign RBACs (Role-based Access Control).
>

- Please assign the `Storage Blob Data Contributor` and `Storage File Data SMB Share Contributor` roles to the `Function App` within the `Storage Account` related to the runtime (the one created with the function app).

- Assign `Storage Blob Data Reader` to the `Function App` within the `Storage Account` that will contains the invoices, click `Next`. Then, click on `select members` and search for your `Function App` identity. Finally click on `Review + assign`:

- Also add `Cosmos DB Operator`, `DocumentDB Account Contributor`, `Cosmos DB Account Reader Role`, `Contributor`:

- To assign the `Microsoft.DocumentDB/databaseAccounts/readMetadata` permission, you need to create a custom role in Azure Cosmos DB. This permission is required for accessing metadata in Cosmos DB. Click [here to understand more about it](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/security/how-to-grant-data-plane-role-based-access?tabs=built-in-definition%2Ccsharp&pivots=azure-interface-cli#prepare-role-definition).

| **Aspect** | **Data Plane Access** | **Control Plane Access** |
|--------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| **Scope** | Focuses on `data operations` within databases and containers. This includes actions such as reading, writing, and querying data in your databases and containers. | Focuses on `management operations` at the account level. This includes actions such as creating, deleting, and configuring databases and containers. |
| **Roles** | - `Cosmos DB Built-in Data Reader`: Provides read-only access to data within the databases and containers.
- `Cosmos DB Built-in Data Contributor`: Allows read and write access to data within the databases and containers.
- `Cosmos DB Built-in Data Owner`: Grants full access to manage data within the databases and containers. | - `Contributor`: Grants full access to manage all Azure resources, including Cosmos DB.
- `Owner`: Grants full access to manage all resources, including the ability to assign roles in Azure RBAC.
- `Cosmos DB Account Contributor`: Allows management of Cosmos DB accounts, including creating and deleting databases and containers.
- `Cosmos DB Account Reader`: Provides read-only access to Cosmos DB account metadata. |
| **Permissions** | - `Reading documents`
- `Writing documents`
- Managing data within containers. | - `Creating or deleting databases and containers`
- Configuring settings
- Managing account-level configurations. |
| **Authentication** | Uses `Azure Active Directory (AAD) tokens` or `resource tokens` for authentication. | Uses `Azure Active Directory (AAD)` for authentication. |

> Steps to assing it:

1. **Open Azure CLI**: Go to the [Azure portal](portal.azure.com) and click on the icon for the Azure CLI.

2. **List Role Definitions**: Run the following command to list all of the role definitions associated with your Azure Cosmos DB for NoSQL account. Review the output and locate the role definition named `Cosmos DB Built-in Data Contributor`.

```powershell
az cosmosdb sql role definition list \
--resource-group "" \
--account-name ""
```

3. **Get Cosmos DB Account ID**: Run this command to get the ID of your Cosmos DB account. Record the value of the `id` property as it is required for the next step.

```powershell
az cosmosdb show --resource-group "" --name "" --query "{id:id}"
```

Example output:

```json
{
"id": "/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/Microsoft.DocumentDB/databaseAccounts/{cosmos-account-name}"
}
```

4. **Assign the Role**: Assign the new role using `az cosmosdb sql role assignment create`. Use the previously recorded role definition ID for the `--role-definition-id` argument, the unique identifier for your identity for the `--principal-id` argument, and your `account's ID and the Function App` for the `--scope` argument. You need to do this for both the Function App to read metadata from Cosmos DB and your ID to access and view the information.

> You can extract the `principal-id`, from `Identity` of the `Function App`:

```powershell
az cosmosdb sql role assignment create \
--resource-group "" \
--account-name "" \
--role-definition-id "" \
--principal-id "" \
--scope "/subscriptions/{subscriptions-id}/resourceGroups/{resource-group-name}/providers/Microsoft.DocumentDB/databaseAccounts/{cosmos-account-name}"
```

> After a few minutes, you will see something like this:

5. **Verify Role Assignment**: Use `az cosmosdb sql role assignment list` to list all role assignments for your Azure Cosmos DB for NoSQL account. Review the output to ensure your role assignment was created.

```powershell
az cosmosdb sql role assignment list \
--resource-group "" \
--account-name ""
```

### Configure/Validate the Environment variables

- Under `Settings`, go to `Environment variables`. And `+ Add` the following variables:

- `COSMOS_DB_ENDPOINT`: Your Cosmos DB account endpoint.
- `COSMOS_DB_KEY`: Your Cosmos DB account key.
- `contosostorageaidemo_STORAGE`: Your Storage Account connection string.

- Click on `Apply` to save your configuration.

### Develop the Function

- You need to install [VSCode](https://code.visualstudio.com/download)
- Install python from Microsoft store:

- Open VSCode, and install some extensions: `python`, and `Azure Tools`.

- Click on the `Azure` icon, and `sign in` into your account. Allow the extension `Azure Resources` to sign in using Microsoft, it will open a browser window. After doing so, you will be able to see your subscription and resources.

- Under Workspace, click on `Create Function Project`, and choose a path in your local computer to develop your function.

- Choose the language, in this case is `python`:

- Select the model version, for this example let's use `v2`:

- For the python interpreter, let's use the one installed via `Microsoft Store`:

- Choose a template (e.g., **Blob trigger**) and configure it to trigger on new PDF uploads in your Blob container.

- Provide a function name, like `BlobTriggerContosoPDFInvoicesRaw`:

- Next, it will prompt you for the path of the blob container where you expect the function to be triggered after a file is uploaded. In this case is `pdfinvoices` as was previously created.

- Click on `Create new local app settings`, and then choose your subscription.

- Choose `Azure Storage Account for remote storage`, and select one. I'll be using the `contosostorageaidemo`.

- Then click on `Open in the current window`. You will see something like this:

- Now we need to update the function code to extract data from PDFs and store it in Cosmos DB, use this an example:

> 1. **Blob Trigger**: The function is triggered when a new PDF file is uploaded to the `pdfinvoices` container.

> 2. **PDF Processing**: The read_pdf_content function uses pdfminer.six to read and extract text from the PDF.

> 3. **Data Extraction**: The extracted text is processed to extract invoice data. The `generate_id` function generates a unique ID for each invoice.

> 4. **Data Storage**: The processed invoice data is saved to Azure Cosmos DB in the `ContosoAIDemo` database and `Invoices` container.

> `pdfminer.six` is an open-source framework. It is a community-maintained fork of the original PDFMiner,`designed for extracting and analyzing text data from PDF documents`. The framework is built in a modular way, allowing each component to be easily replaced or extended for various purpose

- Update the `function_app.py`:

| Template Blob Trigger | Function Code updated |
| --- | --- |
| | |

```python
import azure.functions as func
import logging
import json
import os
import uuid
import io
from pdfminer.high_level import extract_text
from azure.cosmos import CosmosClient, PartitionKey

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)

def read_pdf_content(myblob):
# Read the blob content into a BytesIO stream
blob_bytes = myblob.read()
pdf_stream = io.BytesIO(blob_bytes)

# Extract text from the PDF stream
text = extract_text(pdf_stream)
return text

def extract_invoice_data(text):
lines = text.split('\n')
invoice_data = {
"id": generate_id(),
"customer_name": "",
"customer_email": "",
"customer_address": "",
"company_name": "",
"company_phone": "",
"company_address": "",
"rentals": []
}

for i, line in enumerate(lines):
if "BILL TO:" in line:
invoice_data["customer_name"] = lines[i + 1].strip()
invoice_data["customer_email"] = lines[i + 2].strip()
invoice_data["customer_address"] = lines[i + 3].strip()
elif "Company Information:" in line:
invoice_data["company_name"] = lines[i + 1].strip()
invoice_data["company_phone"] = lines[i + 2].strip()
invoice_data["company_address"] = lines[i + 3].strip()
elif "Rental Date" in line:
for j in range(i + 1, len(lines)):
if lines[j].strip() == "":
break
rental_details = lines[j].split()
rental_date = rental_details[0]
title = " ".join(rental_details[1:-3])
description = rental_details[-3]
quantity = rental_details[-2]
total_price = rental_details[-1]
invoice_data["rentals"].append({
"rental_date": rental_date,
"title": title,
"description": description,
"quantity": quantity,
"total_price": total_price
})

logging.info("Successfully extracted invoice data.")
return invoice_data

def save_invoice_data_to_cosmos(invoice_data, blob_name):
try:
endpoint = os.getenv("COSMOS_DB_ENDPOINT")
key = os.getenv("COSMOS_DB_KEY")
client = CosmosClient(endpoint, key)
logging.info("Successfully connected to Cosmos DB.")
except Exception as e:
logging.error(f"Error connecting to Cosmos DB: {e}")
return

database_name = 'ContosoDBAIDemo'
container_name = 'Invoices'

try:
database = client.create_database_if_not_exists(id=database_name)
container = database.create_container_if_not_exists(
id=container_name,
partition_key=PartitionKey(path="/invoice_number"),
offer_throughput=400
)
logging.info("Successfully ensured database and container exist.")
except Exception as e:
logging.error(f"Error creating database or container: {e}")
return

try:
response = container.upsert_item(invoice_data)
logging.info(f"Saved processed invoice data to Cosmos DB: {response}")
except Exception as e:
logging.error(f"Error inserting item into Cosmos DB: {e}")

def generate_id():
return str(uuid.uuid4())

@app.blob_trigger(arg_name="myblob", path="pdfinvoices/{name}",
connection="contosostorageaidemo_STORAGE")
def BlobTriggerContosoPDFInvoicesRaw(myblob: func.InputStream):
logging.info(f"Python blob trigger function processed blob\n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")

try:
text = read_pdf_content(myblob)
logging.info("Successfully read and extracted text from PDF.")
except Exception as e:
logging.error(f"Error reading PDF: {e}")
return

logging.info(f"Extracted text from PDF: {text}")

try:
invoice_data = extract_invoice_data(text)
logging.info(f"Extracted invoice data: {invoice_data}")
except Exception as e:
logging.error(f"Error extracting invoice data: {e}")
return

try:
save_invoice_data_to_cosmos(invoice_data, myblob.name)
logging.info("Successfully saved invoice data to Cosmos DB.")
except Exception as e:
logging.error(f"Error saving invoice data to Cosmos DB: {e}")
```

- Now, let's update the `requirements.txt`:

| Template `requirements.txt` | Updated `requirements.txt` |
| --- | --- |
| | |

```text
azure-functions
pdfminer.six
azure-cosmos==4.3.0
```

- Since this function has already been tested, you can deploy your code to the function app in your subscription. If you want to test, you can use run your function locally for testing.
- Click on the `Azure` icon.
- Under `workspace`, click on the `Function App` icon.
- Click on `Deploy to Azure`.

- Select your `subscription`, your `function app`, and accept the prompt to overwrite:

- After completing, you see the status in your terminal:

> [!IMPORTANT]
If you need further assistance with the code, please click [here to view all the function code](./src/).

## Step 5: Test the solution

> Upload sample PDF invoices to the Blob container and verify that data is correctly ingested and stored in Cosmos DB.

- Click on `Upload`, then select `Browse for files` and choose your PDF invoices to be stored in the blob container, which will trigger the function app to parse them.

- Check the logs, and traces from your function with `Application Insights`:

- Under `Investigate`, click on `Performance`. Filter by time range, and `drill into the samples`. Sort the results by date (if you have many, like in my case) and click on the last one.