Microsoft Fabric Updates Blog

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

Microsoft Fabric Data Factory’s data pipelines enable data engineers to build complex workflows that can orchestrate many different types of data processing, data movement, data transformation, and other activity types. In this post, I want to focus on some good practices when building Fabric Spark Notebook workflows using Data Factory in Fabric with data pipelines.

Execute Notebook activities in data pipelines using parameter settings
Data pipeline example in Fabric Data Factory executing Notebooks

In this sample pipeline above, I have 2 Notebook activities, 1 Teams Activity and 1 If Condition with 2 more activities inside the container. Both of the Notebook activities are at the start of the pipeline while only the activity that is named “Notebook Double” has a connector to the If Condition. What this means is that the Notebook activities will execute in parallel, while the If Condition will execute in sequence following the completion of the notebook from the “Notebook Double” activity. The connector between the Notebook Double activity and the If Condition uses the “On Success” port from the first activity, meaning that the If Condition will only execute follow a successful signal from the notebook. Notice there is a red connector as well connected to the Notebook Double On Failure port. If the notebook execution fails, the pipeline will subsequently call the Teams activity and send a Teams message. This is a common pattern to use the output ports from your notebook execution to use different paths for success and failure.

You may notice that the “Another Notebook” activity is grey, indicating that it has been set to “disabled”. This is another good practice when building and debugging your pipelines. I’ve not yet completed my configurations for this activity, but I need to test the pipeline with the first notebook path. Setting the activity to “Deactivated” essentially comments out that activity so that the pipeline can be saved and run as a test. To run the debug test, just click the Run button on the ribbon as shown below.

Screenshot of activity output ports to enable workflow redirection
Use activity output ports to redirect your workflow

Another good practice when orchestrating your Fabric Notebooks is to set the “Retry” property on activity to a value greater than 0. The transient and ephemeral nature of running many Spark notebooks in an automated pipeline environment, is that there may be occasions when the execution of the notebook fails because the pool, cluster, or session is busy or unavailable. Using the retry property inside the pipeline notebook activity allows Data Factory to try the execution again, based on the number of retries you have set. In my sample below, I’ve set retry 2 to, a very common practice.

The general tab of the notebook activity showing the name "Another Notebook" and the activity state set to deactivated
Use the activity settings panel to set activity state and retries

Let’s circle back to the original activity called “Notebook Double”. This activity is executing a very simple notebook that I authored in my workspace that takes a parameter value and exists by returning the incoming numeric value doubled. Inside of the pipeline activity, I am sending in a value to that parameter in the notebook, making the execution of this notebook dynamic based upon that parameter value. To make the automated execution of this pipeline even more dynamic, you can use a pipeline parameter and set the value as a parameter or expression using the “add dynamic content” link.

I’ll finish this article with the If Condition. This is a very common Data Factory pattern for pipelines where I’ve taken the On Success workflow from the Notebook Activity and from there, I will examine the output value that I am returning to the pipeline execution from my notebook.

You can use the pipeline If Condition activity for branching and conditional execution
Use the pipeline expression builder to configure the If condition

Inside the notebook, you must send the output value back to the pipeline using:

mssparkutils.notebook.exit(outputval)

This will allow you to examine the output of the notebook execution for branching and conditional execution inside of your pipeline. In my example, the If Condition uses this expression to extract the output value from the notebook:

@equals(activity('Notebook Double').output.result.exitValue,0)

The syntax used is the Data Factory pipeline control flow expression language and I am checking for a value of 0. If the value is 0, I will treat that as a bad result and send an automated email indicating an error return the notebook. This way, the False path of my If Condition can execute a Script activity that will log the results of the notebook. The updated If Condition will look like this:

Screenshot of the if condition inside of data pipeline with the True and False paths
Use If Then control flow to either send an email or logging

Bài đăng blog có liên quan

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

tháng 10 30, 2024 của Patrick LeBlanc

Welcome to the October 2024 Update! Here are a few, select highlights of the many we have for Fabric this month. API for GraphQL support for Service Principal Names (SPNs). Introducing a powerful new feature in Lakehouses: Sorting, Filtering, and Searching capabilities. An addition to KQL Queryset that will revolutionize the way you interact with … Continue reading “Fabric October 2024 Monthly Update”

tháng 10 29, 2024 của Leo Li

We’re excited to announce several powerful updates to the Virtual Network (VNET) Data Gateway, designed to further enhance performance and improve the overall user experience. These new features allow users to better manage increasing workloads, perform complex data transformations, and simplify log management. Expanded Cluster Size from 5 to 7 One of the key improvements … Continue reading “New Features and Enhancements for Virtual Network Data Gateway”