Microsoft Fabric Updates Blog

Building a Custom Sparklens JAR for Microsoft Fabric

Problem Statement

In the previous blog on Profiling Microsoft Fabric Spark Notebooks with Sparklens, we covered how to run Sparklens to profile and tune the performance of your spark notebooks in Microsoft Fabric. In that blog, we used a custom Sparklens JAR. The Sparklens JARs available in the Maven Central repo supports only the Spark version 2.X, which is not compatible with Microsoft Fabric. In this blog, you will learn how to build the sparklens JAR for Spark 3.X, which can be used in Microsoft Fabric.

Prerequisite Reading

To learn what is Sparklens and how to run it on Microsoft Fabric Spark Notebook and optimize performance, please check out this blog: Profiling Microsoft Fabric Spark Notebooks with Sparklens

Discussion

Sparklens is an open-source Spark profiling tool to profile Spark jobs and Notebooks. Latest JARs in Maven Central repo support Spark 2.X and doesn’t work with Spark 3.X. Here are modifications you need to make to run on Spark 3.X. 

Note: Sparklens is not owned/maintained by Microsoft, it’s crucial you implement all necessary security measures, similar to the precautions taken when using any package or library. Please check out Sparklens License details here.

Steps to run Sparklens on Spark 3.X:

1. Setup the Build Tool:

Sparklens is developed in Scala. To package a Scala project, you can use build tools like sbt (simple build tool). Ensure you have sbt installed on your local machine. This blog uses sbt version 0.13.18.

2. Prepare Your Development Environment:

Use your preferred IDE to make necessary changes. For this blog, Visual Studio Code is used. Open the terminal and navigate to the Sparklens directory:

cd sparklens

3. Clone the Repository:

Clone the Sparklens GitHub repository to your local machine from the following link: qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com).

git clone https://github.com/qubole/sparklens.git

4. Modify plugins.sbt:

Update the plugins.sbt file to comment out the existing addSbtPlugin

(addSbtPlugin(“org.spark-packages” % “sbt-spark-package” % “0.2.4”)):

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"

// addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")

5. Update build.sbt:

Make the following changes to the build.sbt file:

  • Comment out spName, sparkVersion, and spAppendScalaVersion as they use the := operator, which is for setting keys in earlier sbt versions. Instead, declare these three as variables.
  • Comment out the line that uses sparkVersion.version and replace it with sparkVersion since sparkVersion is a String and does not have a version property.
  • Change the Scala version to 2.12.0 and the Spark version to 3.0.0. Add the spark-sql 3.0.0 library dependency.

Here is the updated sections in the build.sbt:

name := "sparklens"
organization := "com.qubole"

scalaVersion := "2.12.0"

crossScalaVersions := Seq("2.10.6", "2.12.0")

// spName := "qubole/sparklens"

// sparkVersion := "2.0.0"

// spAppendScalaVersion := true

val spName = "qubole/sparklens"

val sparkVersion = "3.0.0"

val spAppendScalaVersion = true


// libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion.version % "provided"

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"

6. Update QuboleJobListener.scala:

In QuboleJobListener.scala (src/main/scala/com/qubole/sparklens/QuboleJobListener.scala), change attemptId to attemptNumber() as shown in this code snippet:

override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
    val stageTimeSpan = stageMap(stageCompleted.stageInfo.stageId)
    if (stageCompleted.stageInfo.completionTime.isDefined) {
      stageTimeSpan.setEndTime(stageCompleted.stageInfo.completionTime.get)
    }
    if (stageCompleted.stageInfo.submissionTime.isDefined) {
      stageTimeSpan.setStartTime(stageCompleted.stageInfo.submissionTime.get)
    }

    if (stageCompleted.stageInfo.failureReason.isDefined) {
      //stage failed
      val si = stageCompleted.stageInfo
      failedStages += s""" Stage ${si.stageId} attempt ${si.attemptNumber()} in job ${stageIDToJobID(si.stageId)} failed.
                      Stage tasks: ${si.numTasks}
                      """
      stageTimeSpan.finalUpdate()
    }else {
      val jobID = stageIDToJobID(stageCompleted.stageInfo.stageId)
      val jobTimeSpan = jobMap(jobID)
      jobTimeSpan.addStage(stageTimeSpan)
      stageTimeSpan.finalUpdate()
    }
  }

7. Update HDFSConfigHelper.scala:

In the HDFSConfigHelper.scala (src\main\scala\com\qubole\sparklens\helper\HDFSConfigHelper.scala), SparkHadoopUtil class has been changed to a private class in Spark 3. Modify this as shown below:

import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkConf
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.SparkSession

object HDFSConfigHelper {
  def getHadoopConf(sparkConfOptional: Option[SparkConf]): Configuration = {
    if (sparkConfOptional.isDefined) {
      val spark = SparkSession.builder.config(sparkConfOptional.get).getOrCreate()
      spark.sparkContext.hadoopConfiguration
    } else {
      val spark = SparkSession.builder.getOrCreate()
      spark.sparkContext.hadoopConfiguration
    }
  }
}

8. Compile the Revised Code: Run “sbt compile” to compile the project.

9. Package the Compiled Code: Run “sbt package” to package the project as a JAR file.

10. You can now use the JAR (target/scala-2.12/sparklens_2.12-0.3.2.jar) and run profiling on Microsoft Fabric Notebook: Profiling Microsoft Fabric Spark Notebooks with Sparklens.

Further Reading

qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com)

Profiling Microsoft Fabric Spark Notebooks with Sparklens | Microsoft Fabric Blog | Microsoft Fabric

Bài đăng blog có liên quan

Building a Custom Sparklens JAR for Microsoft Fabric

tháng 10 29, 2024 của Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”

tháng 10 28, 2024 của Estera Kot

We’re thrilled to announce that the Native Execution Engine is now available at no additional cost, unlocking next-level performance and efficiency for your workloads. What’s New?  The Native Execution Engine now supports Fabric Runtime 1.3, which includes Apache Spark 3.5 and Delta Lake 3.2. This upgrade enhances Microsoft Fabric’s Data Engineering and Data Science workflows, … Continue reading “Native Execution Engine available at no additional cost!”