Microsoft Fabric Updates Blog

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

Microsoft Fabric Data Factory’s data pipelines enable data engineers to build complex workflows that can orchestrate many different types of data processing, data movement, data transformation, and other activity types. In this post, I want to focus on some good practices when building Fabric Spark Notebook workflows using Data Factory in Fabric with data pipelines.

Execute Notebook activities in data pipelines using parameter settings
Data pipeline example in Fabric Data Factory executing Notebooks

In this sample pipeline above, I have 2 Notebook activities, 1 Teams Activity and 1 If Condition with 2 more activities inside the container. Both of the Notebook activities are at the start of the pipeline while only the activity that is named “Notebook Double” has a connector to the If Condition. What this means is that the Notebook activities will execute in parallel, while the If Condition will execute in sequence following the completion of the notebook from the “Notebook Double” activity. The connector between the Notebook Double activity and the If Condition uses the “On Success” port from the first activity, meaning that the If Condition will only execute follow a successful signal from the notebook. Notice there is a red connector as well connected to the Notebook Double On Failure port. If the notebook execution fails, the pipeline will subsequently call the Teams activity and send a Teams message. This is a common pattern to use the output ports from your notebook execution to use different paths for success and failure.

You may notice that the “Another Notebook” activity is grey, indicating that it has been set to “disabled”. This is another good practice when building and debugging your pipelines. I’ve not yet completed my configurations for this activity, but I need to test the pipeline with the first notebook path. Setting the activity to “Deactivated” essentially comments out that activity so that the pipeline can be saved and run as a test. To run the debug test, just click the Run button on the ribbon as shown below.

Screenshot of activity output ports to enable workflow redirection
Use activity output ports to redirect your workflow

Another good practice when orchestrating your Fabric Notebooks is to set the “Retry” property on activity to a value greater than 0. The transient and ephemeral nature of running many Spark notebooks in an automated pipeline environment, is that there may be occasions when the execution of the notebook fails because the pool, cluster, or session is busy or unavailable. Using the retry property inside the pipeline notebook activity allows Data Factory to try the execution again, based on the number of retries you have set. In my sample below, I’ve set retry 2 to, a very common practice.

The general tab of the notebook activity showing the name "Another Notebook" and the activity state set to deactivated
Use the activity settings panel to set activity state and retries

Let’s circle back to the original activity called “Notebook Double”. This activity is executing a very simple notebook that I authored in my workspace that takes a parameter value and exists by returning the incoming numeric value doubled. Inside of the pipeline activity, I am sending in a value to that parameter in the notebook, making the execution of this notebook dynamic based upon that parameter value. To make the automated execution of this pipeline even more dynamic, you can use a pipeline parameter and set the value as a parameter or expression using the “add dynamic content” link.

I’ll finish this article with the If Condition. This is a very common Data Factory pattern for pipelines where I’ve taken the On Success workflow from the Notebook Activity and from there, I will examine the output value that I am returning to the pipeline execution from my notebook.

You can use the pipeline If Condition activity for branching and conditional execution
Use the pipeline expression builder to configure the If condition

Inside the notebook, you must send the output value back to the pipeline using:

mssparkutils.notebook.exit(outputval)

This will allow you to examine the output of the notebook execution for branching and conditional execution inside of your pipeline. In my example, the If Condition uses this expression to extract the output value from the notebook:

@equals(activity('Notebook Double').output.result.exitValue,0)

The syntax used is the Data Factory pipeline control flow expression language and I am checking for a value of 0. If the value is 0, I will treat that as a bad result and send an automated email indicating an error return the notebook. This way, the False path of my If Condition can execute a Script activity that will log the results of the notebook. The updated If Condition will look like this:

Screenshot of the if condition inside of data pipeline with the True and False paths
Use If Then control flow to either send an email or logging

相關部落格文章

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

10月 15, 2024 作者: Someleze Diko

This session is part of the Microsoft Fabric and AI Learning Hackathon which focuses on how you can leverage Copilot in Microsoft Fabric. It will guide you through the various capabilities that Copilot offers for you to use Microsoft Fabric, empowering you to enhance productivity and streamline your workflows. We will dive deep into practical … Continue reading “Microsoft Fabric and AI Learning Hackathon: Copilot in Fabric”

10月 10, 2024 作者: Abhishek Narain

At the Microsoft Fabric Community Conference Europe 2024, we announced the General Availability (GA) of Copilot for Data Factory. It operates like a subject-matter expert (SME), collaborating with you to design your dataflows. Find our Copilot for Data Factory GA announcement blog.   Today, we all brainstorm ideas and draw sketches before formalizing them. As … Continue reading “Use Azure OpenAI to turn whiteboard sketches into data pipelines”