Microsoft Fabric Updates Blog

Public Preview of Native Execution Engine for Apache Spark on Fabric Data Engineering and Data Science

The Native Execution Engine showcases our dedication to innovation and performance, transforming data processing in Microsoft Fabric. We are excited to announce that the Native Execution Engine for Fabric Runtime 1.2 is now available in public preview.

The Native Execution Engine leverages technologies such as a columnar format and vectorized processing to boost query execution performance. Validated by the TPC-DS 1TB benchmark, our internal tests have shown a significant 4x speed enhancement over OSS Spark. This leap in performance is achieved by translating SparkSQL code into optimized C++ code, coupled with Microsoft’s internal efforts on query optimizations embedded in Apache Spark on Fabric. One of the standout features of the Native Execution Engine is its seamless integration with Apache Spark™ APIs. Whether you are working with PySpark, Scala, R, or Spark SQL, enabling this engine requires no code alterations or concerns about vendor lock-in. It works with both Parquet and Delta formats, ensuring your data, whether in OneLake or accessed via shortcuts, is processed efficiently. Users benefit from accelerated processing times, and that leads to significant operational cost savings. 

Powered by innovations

At the heart of the Native Execution Engine are two key open-source components: Velox, a community sustained C++ database acceleration library created and open sourced by Meta, and Apache Gluten (incubating) initiated by Intel, a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines. Gluten offers seamless integration with existing Spark applications and employs modern C++ for accelerated native code execution. Utilizing CPU vectorization and advanced memory management, Gluten provides features such as native query execution, columnar-based shuffle, and compatibility with Spark’s Catalyst optimizer. This combination enables the Native Execution Engine to leverage the native capabilities of underlying data sources, minimizing the overhead associated with data movement and serialization in traditional Spark workloads. This high-performance solution is poised to revolutionize big data analytics by significantly improving the efficiency of Spark SQL’s operations.

Query Optimizations improvements

Microsoft has enhanced Apache Spark query optimizations and integrated these into Apache Spark on Fabric, including: 

  • Split Block Bloom Filters to reduce false positives when verifying the existence of elements in large datasets. 
  • Parquet Footer Caching to reduce I/O operations by caching metadata, which speeds up data retrieval in large-scale analytics projects. 
  • Smart Shuffle Optimizations to improve data distribution across nodes, minimizing network overhead. 
  • Optimized Sorting for Window Functions to accelerate data sorting within partitions for more efficient processing. 

For further details on these native optimizations, refer to the article and previous blog posts: New Query Optimization Techniques in the Spark Engine of Azure Synapse, Apache Spark in Azure Synapse – Performance Update, and Speed up your data workloads with performance updates to Apache Spark 3.1.2 in Azure Synapse.

The Native Execution Engine architecture

In the realm of resource scheduling within the Spark framework, the Catalyst logical plan optimizer remains a pivotal component. Spark begins by parsing a query, applying numerous optimization rules to forge an optimized logical plan. This plan is subsequently transformed into a physical plan slated for execution. Gluten plays a crucial role here, converting this physical plan into a Substrait plan and utilizing JNI (Java Native Interface) calls to initiate execution in Velox. Velox, recognized as a robust C++ database acceleration library, offers reusable, extensible, and high-performance data processing components. In instances where Velox does not support a particular operator, the system reverts to the existing Spark JVM-based operator. This fallback mechanism, however, brings with it the overhead of converting data between columnar and row formats, impacting performance. 

The architecture diagram shows how Gluten and Velox interact with Apache Spark. 

Microsoft’s recent rollout of a Native Execution Engine marks a significant enhancement to Apache Spark on Fabric, building upon open-source software while also contributing innovations back to the Gluten and Velox projects. This initiative introduces several pivotal improvements: 

  • the integration of the Azure Blob Filesystem (ABFS) storage adapter; 
  • advanced operators such as Expand, BroadcastNestedLoopJoin, and CartesianProduct; 
  • the suite of new functions, including uuid, date_from_unix_date, and from_utc_timestamp; 
  • support for INT96/INT64 timestamps in Velox parquet scans; 
  • the capability to handle metadata columns within Gluten; 
  • significant reliability enhancements achieved by addressing over 300 unit tests across 40 suites for Apache Spark versions 3.3, 3.4, and 3.5. 

Scenarios for accelerated performance 

The current release of the Native Execution Engine performs particularly when: 

  • working with data in Parquet and Delta formats; 
  • handling queries that involve complex transformations and aggregations, benefiting from the engine’s columnar processing and vectorization capabilities; 
  • running computationally intensive queries, rather than simple or I/O-bound operations. 

Enabling the Native Execution Engine 

To use the Native Execution Engine, no code changes are required. During the preview stage, you have the flexibility to activate the Native Execution Engine either through your environment settings or selectively for an individual notebook or job. Enabling this feature within your environment item ensures that all subsequent jobs and notebooks associated with that environment will automatically inherit this configuration. For comprehensive guidance on how to activate this feature, please visit our documentation.

In the meantime, we are actively developing mechanisms to enable the Native Execution Engine at the tenant, workspace, and environment levels, ensuring seamless integration with the UI. 

Results from early adopters 

 ❝As part of the private preview testing of the Native Execution Engine, I designed typical analytical queries and compared their execution times with the default engine. I was impressed by the performance and reliability of the Native Engine, which handled all queries faster and more efficiently. I did not encounter any case where the queries were slower. All of them showed a significant improvement in speed, ranging from 50% to over 300%.❞

– Damian Widera, Data Platform MVP, Objectivity part of Accenture

The results shared by one of the private preview early adopters of the Native Execution Engine, are split into two charts to properly reflect the “y” axis, which represents execution times. The first chart is for queries running longer than 100 seconds. The second chart is for queries running less than 50 seconds.

Miles Cole (Director, Data & AI, Hitachi Solutions America) and Sandeep Pawar (Senior Power BI Architect, Hitachi Solutions America) tested the Native Execution Engine on various Data Engineering and Data Science workloads in Fabric. Miles focused on testing a wide array of Spark SQL functionality representative of typical analytical queries that included complex joins, aggregations, windowing and other expressions. He saw an average of 1.6x improvement in execution duration on the 100M row scale. Sandeep focused on testing data science workflows involving NLP applications, such as text processing and transformations, and training ML models. He saw on average a 1.3-2x speed up. Hitachi Solutions is excited about the Native Execution Engine and anticipates using it for all Spark workloads since it provides a significant improvement in execution time without any code changes.   

❝Our tests with the Fabric Spark Native Execution Engine on complex gold layer transformations led to a 30% performance improvement. Daily processing times for tables exceeding 5 billion rows dropped from 3 hours to 2, with zero code changes!❞  

– Andrew Dakhov, Managing Partner, Cloud Services

Better together – acknowledging our partners 

We want to recognize the invaluable contributions of our partners in developing the components of the Native Execution Engine:  

  • Meta – for their active contributions to Velox. 
  • IBM and members of the Presto foundation – for their active contributions to Velox.  
  • Intel – for their active contributions to Apache Gluten (incubating) and over the past year of close-knit collaboration with the Microsoft team, seamlessly incorporating the Native Execution Engine from Gluten and Velox into Apache Spark on Microsoft Fabric.

Additionally, we extend our gratitude to the more than 150 contributors to Velox and Gluten. Their commitment and ongoing efforts to accelerate big data query execution empower every individual and organization that processes big data to achieve more, thanks to elevated performance. This collaboration has also been essential in bringing the Native Execution Engine into the hands of our customers. 

❝Composable Data Management Systems point to an exciting future for Big Data. We are excited to partner with so many contributors and innovators across the industry to build this future and thank them for their efforts. With the Native Execution Engine Spark on Fabric brings together all these layers into a seamless offering that showcases the value across all aspects of data processing in a manner that only Fabric can. We see this as the early days of an emerging stack that already delivers amazing value to our customers. We look forward to a continued partnership with these communities.❞ 

– Ashit Gosalia, Partner – Group Engineering Manager, Spark on Fabric

❝Fragmentation in data management has hindered innovation. We believe that by converging efforts into unified and reusable components such as Velox, we can not only provide a more efficient platform, but also accelerate the pace of innovation. Our eventual goal is to commoditize execution and provide a common software layer that enables hardware accelerators to be more transparently (and pervasively) leveraged by data management. Collaborations with large-scale cloud vendors such as Microsoft Fabric are crucial in that journey.❞

– Pedro Eugenio Rocha Pedreira, Head of Velox & Software Engineer at Meta Platforms

❝For the last 2 years the team at IBM has worked closely with Velox contributors from Meta, Microsoft, Intel, and others. We’re excited to see how far the project has come in those two years with advancements in both the Presto and Spark native engines. Microsoft Fabric’s Native execution engine with Velox is a testament to this cross-functional community work. While it’s just the beginning, we’re looking forward to continue collaborating closely with the Velox community to bring the native engine further along in the market.❞ 

– Ali LeClerc, Chair of the Presto Foundation Outreach Committee | Co-chair of VeloxCon | Open Source at IBM 

Learn more, and help us with your feedback 

We encourage you to share your feedback directly to our product team by using this form. We look forward to your valuable input and are eager to discuss your findings in detail. 

Bài đăng blog có liên quan

Public Preview of Native Execution Engine for Apache Spark on Fabric Data Engineering and Data Science

tháng 10 31, 2024 của Jovan Popovic

Fabric Data Warehouse is a modern data warehouse optimized for analytical data models, primarily focused on the smaller numeric, datetime, and string types that are suitable for analytics. For the textual data, Fabric DW supports the VARCHAR type that can store up to 8KB of text, which is suitable for most of the textual values … Continue reading “Announcing public preview of VARCHAR(MAX) and VARBINARY(MAX) types in Fabric Data Warehouse”

tháng 10 29, 2024 của Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”