https://github.com/olafwrieden/batch-transcription-python-azure-databricks

Referencing a Blob Storage (in Azure) containing voice recordings, we want to be able to call the Speech to Text Cognitive Service in Azure to transcribe the respective recordings in parallel (batch), and save the transcript into another Blob Storage container.
https://github.com/olafwrieden/batch-transcription-python-azure-databricks

azure batch-processing databricks python speech-to-text transcription

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/olafwrieden/batch-transcription-python-azure-databricks
Owner: olafwrieden
Created: 2021-03-04T03:12:38.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-04-13T05:19:52.000Z (about 5 years ago)
Last Synced: 2025-03-25T12:02:24.854Z (about 1 year ago)
Topics: azure, batch-processing, databricks, python, speech-to-text, transcription
Language: Python
Homepage:
Size: 163 KB
Stars: 5
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Azure Batch Transcription using Python in Databricks

## Outcome & Overview

Referencing a Blob Storage (in Azure) containing voice recordings, we want to be able to call the [Speech to Text Cognitive Service](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/) in Azure to transcribe the respective recordings in parallel (batch), and save the transcript into another Blob Storage container.

Adapted from: [How to use the Speech Services Batch Transcription API from Python](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch/python)

## Setup

![Azure Resources](https://i.imgur.com/juD5Z2g.png)

__Navigate to Azure Portal and create the following services under a common Resource Group.__

- Storage Account (with two container)

  - Container 1: name: `recordings` access level: `Container` (for voice recording files to be transcribed)  

    Note: We use a container access level to allow our python code to enumerate the blobs (ie. recordings) within; for batch transcription.

  - Container 2: name: `transcriptions` access level: `Blob` (for output files)

- Cognitive Services (Speech Service) with S0 Tier (not Free Tier)

- Azure Databricks Service

__Collect the following details:__

1. Copy the `Subscription Key` and and `ServiceRegion (Location)` from your Speech Service's Keys and Endpoint tab.

2. Navigate to the recordings container and create a Shared Access Signature (SAS) with `Read` and `List` permissions. Copy the Blob SAS URL.

3. Navigate to the transcriptions container and create a Shared Access Signature (SAS) with `Write` permissions. Copy the Blob SAS URL.

__Create a new Databricks Workspace__

1. In a new Databricks Notebook, create a new cell and paste:

```python

# Install requests library

%pip install requests

# Install Speech to Text API

%pip install -e "git+https://github.com/olafwrieden/batch-transcription-python-azure-databricks/#egg=swagger-client&subdirectory=python-client"

```

2. Create a new cell underneath the above and paste the below, carefully adding the keys and URLs copied earlier, into their respective variables:

```python

import logging

import sys

import requests

import time

import swagger_client as cris_client

# Configure Logging

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format="%(asctime)s %(message)s", datefmt="%d/%m/%Y %I:%M:%S %p %Z")

# TODO: Paste your keys and URLs into their respective variables

# Your subscription key and region for the speech service

SUBSCRIPTION_KEY = # "YourSubscriptionKey"

SERVICE_REGION = # "YourServiceRegion"

NAME = "Phone Call Transcriptions"

DESCRIPTION = "Demo of container-based phone call transcriptions."

LOCALE = "en-US"

RECORDINGS_BLOB_URI = # ""

# Provide the URI of a container with audio files for transcribing all of them with a single request

RECORDINGS_CONTAINER_URI = # ""

```

3. Create a third cell beneath the above, then paste the following methods:

```python

def transcribe_from_single_blob(uri, properties):

    """

    Transcribe a single audio file located at `uri` using the settings specified in `properties`

    using the base model for the specified locale.

    """

    transcription_definition = cris_client.Transcription(

        display_name=NAME,

        description=DESCRIPTION,

        locale=LOCALE,

        content_urls=[uri],

        properties=properties

    )

    return transcription_definition

def transcribe_from_container(uri, properties):

    """

    Transcribe all files in the container located at `uri` using the settings specified in `properties`

    using the base model for the specified locale.

    """

    transcription_definition = cris_client.Transcription(

        display_name=NAME,

        description=DESCRIPTION,

        locale=LOCALE,

        content_container_url=uri,

        properties=properties

    )

    return transcription_definition

    

def _paginate(api, paginated_object):

    """

    The autogenerated client does not support pagination. This function returns a generator over

    all items of the array that the paginated object `paginated_object` is part of.

    """

    yield from paginated_object.values

    typename = type(paginated_object).__name__

    auth_settings = ["apiKeyHeader", "apiKeyQuery"]

    while paginated_object.next_link:

        link = paginated_object.next_link[len(

            api.api_client.configuration.host):]

        paginated_object, status, headers = api.api_client.call_api(link, "GET",

                                                                    response_type=typename, auth_settings=auth_settings)

        if status == 200:

            yield from paginated_object.values

        else:

            raise Exception(

                f"could not receive paginated data: status {status}")

```

4. Create a fourth cell underneath the above, and paste the batch transcription function:

```python

def transcribe():

    logging.info("Starting transcription client...")

    # configure API key authorization: subscription_key

    configuration = cris_client.Configuration()

    configuration.api_key["Ocp-Apim-Subscription-Key"] = SUBSCRIPTION_KEY

    configuration.host = f"https://{SERVICE_REGION}.api.cognitive.microsoft.com/speechtotext/v3.0"

    # create the client object and authenticate

    client = cris_client.ApiClient(configuration)

    # create an instance of the transcription api class

    api = cris_client.DefaultApi(api_client=client)

    # Specify transcription properties by passing a dict to the properties parameter. See

    # https://docs.microsoft.com/azure/cognitive-services/speech-service/batch-transcription#configuration-properties

    # for supported parameters.

    properties = {

        "punctuationMode": "DictatedAndAutomatic",

        "profanityFilterMode": "Masked",

        "wordLevelTimestampsEnabled": True,

        "diarizationEnabled": True,

        "destinationContainerUrl": "", # TODO: Supply SAS URI

        "timeToLive": "PT1H"

    }

    

    # =========================================================================================

    # CHOOSE YOUR TRANSCRIPTION METHOD

    # =========================================================================================

    # Use base models for transcription. Comment this block if you are using a custom model.

    # transcription_definition = transcribe_from_single_blob(RECORDINGS_BLOB_URI, properties)

    # Uncomment this block to use custom models for transcription.

    # transcription_definition = transcribe_with_custom_model(api, RECORDINGS_BLOB_URI, properties)

    # Uncomment this block to transcribe all files from a container.

    transcription_definition = transcribe_from_container(RECORDINGS_CONTAINER_URI, properties)

    # =========================================================================================

    created_transcription, status, headers = api.create_transcription_with_http_info(

        transcription=transcription_definition)

    # get the transcription Id from the location URI

    transcription_id = headers["location"].split("/")[-1]

    # Log information about the created transcription. If you should ask for support, please

    # include this information.

    logging.info(

        f"Created new transcription with id '{transcription_id}' in region {SERVICE_REGION}")

    logging.info("Checking status.")

    completed = False

    while not completed:

        # wait for 5 seconds before refreshing the transcription status

        time.sleep(5)

        transcription = api.get_transcription(transcription_id)

        logging.info(f"Transcriptions status: {transcription.status}")

        if transcription.status in ("Failed", "Succeeded"):

            completed = True

        if transcription.status == "Succeeded":

            pag_files = api.get_transcription_files(transcription_id)

            for file_data in _paginate(api, pag_files):

                if file_data.kind != "Transcription":

                    continue

                audiofilename = file_data.name

                results_url = file_data.links.content_url

                results = requests.get(results_url)

                logging.info(

                    f"Results for {audiofilename}:\n{results.content.decode('utf-8')}")

        elif transcription.status == "Failed":

            logging.info(

                f"Transcription failed: {transcription.properties.error.message}")

if __name__ == "__main__":

    transcribe()

```

5. Run each of the four cells sequentially at least once to install the libraries and register your methods. Then simply call the `transcribe()` method by running the last snippet. You should see the output in the terminal as the process runs. You should see transcriptions in your output folder ('transcripts') in Azure.

### Not seeing the transcription saved in your container?

__Note:__ If you want the transcription saved into the `transcription` container, please ensure you specify the `"destinationContainerUrl": ""` in the `properties = {}` object inside the `transcribe()` function.

## Custom Models

This demo does not show custom models, please implement the remaining Python functions [this script](https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py) (such as `transcribe_with_custom_model`) and set the appropriate variables as described.

## Looking for an event-driven architecture?

My colleagues have created an [Event-Driven Batch Transcription](https://github.com/timleyden/speechtotextdemo) demo, which is focused on autonomous transcriptions when new audio files are added to a Blob storage container.

## Resources

- [Azure Batch Transcription using Python](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch/python)

- [Installing pip library from GitHub (inline)](https://docs.databricks.com/libraries/notebooks-python-libraries.html#pip-install-vcs)

- [Installing pip library from GitHub Subdirectory Example (#6)](https://pip.pypa.io/en/stable/reference/pip_install/#examples)

- [Question: Installing Client Library in Databricks](https://forums.databricks.com/questions/39492/install-api-client-library-in-databricks.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/olafwrieden/batch-transcription-python-azure-databricks

Awesome Lists containing this project

README