Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows
Microsoft Fabric Data Factory’s data pipelines enable data engineers to build complex workflows that can orchestrate many different types of data processing, data movement, data transformation, and other activity types. In this post, I want to focus on some good practices when building Fabric Spark Notebook workflows using Data Factory in Fabric with data pipelines.
In this sample pipeline above, I have 2 Notebook activities, 1 Teams Activity and 1 If Condition with 2 more activities inside the container. Both of the Notebook activities are at the start of the pipeline while only the activity that is named “Notebook Double” has a connector to the If Condition. What this means is that the Notebook activities will execute in parallel, while the If Condition will execute in sequence following the completion of the notebook from the “Notebook Double” activity. The connector between the Notebook Double activity and the If Condition uses the “On Success” port from the first activity, meaning that the If Condition will only execute follow a successful signal from the notebook. Notice there is a red connector as well connected to the Notebook Double On Failure port. If the notebook execution fails, the pipeline will subsequently call the Teams activity and send a Teams message. This is a common pattern to use the output ports from your notebook execution to use different paths for success and failure.
You may notice that the “Another Notebook” activity is grey, indicating that it has been set to “disabled”. This is another good practice when building and debugging your pipelines. I’ve not yet completed my configurations for this activity, but I need to test the pipeline with the first notebook path. Setting the activity to “Deactivated” essentially comments out that activity so that the pipeline can be saved and run as a test. To run the debug test, just click the Run button on the ribbon as shown below.
Another good practice when orchestrating your Fabric Notebooks is to set the “Retry” property on activity to a value greater than 0. The transient and ephemeral nature of running many Spark notebooks in an automated pipeline environment, is that there may be occasions when the execution of the notebook fails because the pool, cluster, or session is busy or unavailable. Using the retry property inside the pipeline notebook activity allows Data Factory to try the execution again, based on the number of retries you have set. In my sample below, I’ve set retry 2 to, a very common practice.
Let’s circle back to the original activity called “Notebook Double”. This activity is executing a very simple notebook that I authored in my workspace that takes a parameter value and exists by returning the incoming numeric value doubled. Inside of the pipeline activity, I am sending in a value to that parameter in the notebook, making the execution of this notebook dynamic based upon that parameter value. To make the automated execution of this pipeline even more dynamic, you can use a pipeline parameter and set the value as a parameter or expression using the “add dynamic content” link.
I’ll finish this article with the If Condition. This is a very common Data Factory pattern for pipelines where I’ve taken the On Success workflow from the Notebook Activity and from there, I will examine the output value that I am returning to the pipeline execution from my notebook.
Inside the notebook, you must send the output value back to the pipeline using:
mssparkutils.notebook.exit(outputval)
This will allow you to examine the output of the notebook execution for branching and conditional execution inside of your pipeline. In my example, the If Condition uses this expression to extract the output value from the notebook:
@equals(activity('Notebook Double').output.result.exitValue,0)
The syntax used is the Data Factory pipeline control flow expression language and I am checking for a value of 0. If the value is 0, I will treat that as a bad result and send an automated email indicating an error return the notebook. This way, the False path of my If Condition can execute a Script activity that will log the results of the notebook. The updated If Condition will look like this: