Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/sethiaarun/mapping-dataflow-to-fabric-with-openai

Convert Azure Mapping dataflow to Microsoft Fabric PySpark Notebook using OpenAI
https://github.com/sethiaarun/mapping-dataflow-to-fabric-with-openai

Last synced: 22 days ago
JSON representation

Convert Azure Mapping dataflow to Microsoft Fabric PySpark Notebook using OpenAI

Lists

README

        

# Introduction

This tool uses OpenAI API to
convert [Azure Mapping Dataflow](https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview) code
to (Microsoft Fabric PySpark)[https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook) code using
OpenAI API.

The tool will use [ADF get REST API](https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/get?tabs=HTTP) to
get dataflow script code or File source and OpenAI to convert script code into PySpark Notebook.
We need to pass a few input parameters based on the source of the Mapping Dataflow.
We can also pass targeted Fabric resources like workspace ID, lakehouse name, and lakehouse ID to set these parameters
into notebook metadata.

**The tool is not tested with all transformations supported by Azure Mapping dataflow.**

[OpenAI Privacy](https://openai.com/enterprise-privacy).

:information_desk_person: You can try converting Azure Mapping dataflow using Scala Combinator custom parsers to Fabric Notebook from [mapping-data-flow-to-spark](https://github.com/sethiaarun/mapping-data-flow-to-spark)

## Design
![PlantUmlSequeneDiagram.png](plantuml%2Fdiagram%2FPlantUmlSequeneDiagram.png)
## Installation

- Python > 3.10.11
- ```pip install -r requirements.txt```

## Usages

Set following environment variables:

### Mandatory

- OPENAI_API_KEY - Your openap api key

### Optional

- LOG_LEVEL - Optional debug/info, default value from the application is `info`
- OPENAI_MODEL - [OpenAI model name](https://platform.openai.com/docs/models), default value from the application
is `gpt-4`

:warning:don't forget to read limitation before you run a large conversion

## Get DataFlow Script Lines from API

You need to pass following parameters:

- source=api
- rg - resource group name
- dataFlowName - data flow name
- factoryName - Azure data factory name
- lakeHouseId - Existing target Microsoft Fabric lakehouse Id
- lakeHouseName - Existing target Microsoft Fabric lakehouse name
- workSpaceId - Existing target Microsoft Fabric workspace Id
- subscriptionId - subscription id

```
python.exe main.py --kwargs source=api rg= dataFlowName= factoryName= \
lakeHouseId= lakeHouseName= workSpaceId= \
subscriptionId=
```

## Get DataFlow Script Lines from local file

You need to pass following parameters:

- source=file
- dataFlowName - data flow name
- lakeHouseId - Existing target Microsoft Fabric lakehouse Id
- lakeHouseName - Existing target Microsoft Fabric lakehouse name
- workSpaceId - Existing target Microsoft Fabric workspace Id

```
python.exe main.py --kwargs source=file sourceFile= dataFlowName=\
lakeHouseId= lakeHouseName= workSpaceId=
```

There are two output files will be generated:

1. Notebook with dataflow name
2. PySpark code in `.py` file

# Limitation

Since we use ChatCompletion API from OpenAI to generate a desired output or response, we need to consider **the length of the input text** based on the token limit of the model we choose to use.
For example, GPT-4 can handle up to 8,192 tokens per input, while GPT-3 can only handle up to 4,096 tokens.
This means that texts that are longer than the token limit of the model will not fit and may be cut off or ignored.

The max_tokens parameter in the ChatCompletion API allows you to limit the length of the input or output generated by
the model to a specified number of tokens. Tokens are chunks of text that language models read,
and they can be as short as one character or as long as one word, depending on the language and context.

Setting a very low value for max_tokens can result in the response being cut off abruptly, potentially leading to
an output that doesn't make sense or lacks context. The max_tokens parameter is a useful tool to control response
length, but setting it too low can negatively impact the quality and coherence of the responses.

**What does it mean to the user**? It means if your generated code length is > 8k, It will be truncated, and the result
will not be complete code.

OpenAI is [gpt-4-32k](https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4) has been around for a while,
but extremely limited rollout.

# Future Scope of work

1. Integration with [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service-b)
2. How to extend this when mapping dataflow script code is > 8192 tokens?

# References

1. [OpenAI API](https://platform.openai.com/docs/introduction)
2. [AI Model tokens](https://learn.microsoft.com/en-us/semantic-kernel/memories/#why-are-embeddings-important-with-llm-ai)
3. [How to count
tokens?](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)