Building a Custom Sparklens JAR for Microsoft Fabric
Problem Statement
In the previous blog on Profiling Microsoft Fabric Spark Notebooks with Sparklens, we covered how to run Sparklens to profile and tune the performance of your spark notebooks in Microsoft Fabric. In that blog, we used a custom Sparklens JAR. The Sparklens JARs available in the Maven Central repo supports only the Spark version 2.X, which is not compatible with Microsoft Fabric. In this blog, you will learn how to build the sparklens JAR for Spark 3.X, which can be used in Microsoft Fabric.
Prerequisite Reading
To learn what is Sparklens and how to run it on Microsoft Fabric Spark Notebook and optimize performance, please check out this blog: Profiling Microsoft Fabric Spark Notebooks with Sparklens
Discussion
Sparklens is an open-source Spark profiling tool to profile Spark jobs and Notebooks. Latest JARs in Maven Central repo support Spark 2.X and doesn’t work with Spark 3.X. Here are modifications you need to make to run on Spark 3.X.
Note: Sparklens is not owned/maintained by Microsoft, it’s crucial you implement all necessary security measures, similar to the precautions taken when using any package or library. Please check out Sparklens License details here.
Steps to run Sparklens on Spark 3.X:
1. Setup the Build Tool:
Sparklens is developed in Scala. To package a Scala project, you can use build tools like sbt (simple build tool). Ensure you have sbt installed on your local machine. This blog uses sbt version 0.13.18.
2. Prepare Your Development Environment:
Use your preferred IDE to make necessary changes. For this blog, Visual Studio Code is used. Open the terminal and navigate to the Sparklens directory:
cd sparklens
3. Clone the Repository:
Clone the Sparklens GitHub repository to your local machine from the following link: qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com).
git clone https://github.com/qubole/sparklens.git
4. Modify plugins.sbt:
Update the plugins.sbt file to comment out the existing addSbtPlugin
(addSbtPlugin(“org.spark-packages” % “sbt-spark-package” % “0.2.4”)):
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"
// addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")
5. Update build.sbt:
Make the following changes to the build.sbt file:
- Comment out spName, sparkVersion, and spAppendScalaVersion as they use the := operator, which is for setting keys in earlier sbt versions. Instead, declare these three as variables.
- Comment out the line that uses sparkVersion.version and replace it with sparkVersion since sparkVersion is a String and does not have a version property.
- Change the Scala version to 2.12.0 and the Spark version to 3.0.0. Add the spark-sql 3.0.0 library dependency.
Here is the updated sections in the build.sbt:
name := "sparklens"
organization := "com.qubole"
scalaVersion := "2.12.0"
crossScalaVersions := Seq("2.10.6", "2.12.0")
// spName := "qubole/sparklens"
// sparkVersion := "2.0.0"
// spAppendScalaVersion := true
val spName = "qubole/sparklens"
val sparkVersion = "3.0.0"
val spAppendScalaVersion = true
// libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion.version % "provided"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"
6. Update QuboleJobListener.scala:
In QuboleJobListener.scala (src/main/scala/com/qubole/sparklens/QuboleJobListener.scala), change attemptId to attemptNumber() as shown in this code snippet:
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
val stageTimeSpan = stageMap(stageCompleted.stageInfo.stageId)
if (stageCompleted.stageInfo.completionTime.isDefined) {
stageTimeSpan.setEndTime(stageCompleted.stageInfo.completionTime.get)
}
if (stageCompleted.stageInfo.submissionTime.isDefined) {
stageTimeSpan.setStartTime(stageCompleted.stageInfo.submissionTime.get)
}
if (stageCompleted.stageInfo.failureReason.isDefined) {
//stage failed
val si = stageCompleted.stageInfo
failedStages += s""" Stage ${si.stageId} attempt ${si.attemptNumber()} in job ${stageIDToJobID(si.stageId)} failed.
Stage tasks: ${si.numTasks}
"""
stageTimeSpan.finalUpdate()
}else {
val jobID = stageIDToJobID(stageCompleted.stageInfo.stageId)
val jobTimeSpan = jobMap(jobID)
jobTimeSpan.addStage(stageTimeSpan)
stageTimeSpan.finalUpdate()
}
}
7. Update HDFSConfigHelper.scala:
In the HDFSConfigHelper.scala (src\main\scala\com\qubole\sparklens\helper\HDFSConfigHelper.scala), SparkHadoopUtil class has been changed to a private class in Spark 3. Modify this as shown below:
import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkConf
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.SparkSession
object HDFSConfigHelper {
def getHadoopConf(sparkConfOptional: Option[SparkConf]): Configuration = {
if (sparkConfOptional.isDefined) {
val spark = SparkSession.builder.config(sparkConfOptional.get).getOrCreate()
spark.sparkContext.hadoopConfiguration
} else {
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.hadoopConfiguration
}
}
}
8. Compile the Revised Code: Run “sbt compile” to compile the project.
9. Package the Compiled Code: Run “sbt package” to package the project as a JAR file.
10. You can now use the JAR (target/scala-2.12/sparklens_2.12-0.3.2.jar) and run profiling on Microsoft Fabric Notebook: Profiling Microsoft Fabric Spark Notebooks with Sparklens.
Further Reading
qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com)
Profiling Microsoft Fabric Spark Notebooks with Sparklens | Microsoft Fabric Blog | Microsoft Fabric