Microsoft Fabric Updates Blog

Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine learning model selection and hyperparameter tuning.

What is FLAML

FLAML (Fast and Lightweight AutoML) is an open-source library designed to streamline the process of automating machine learning tasks. AutoML, one of FLAML’s key capabilities, automates the often-complex process of model selection, hyperparameter tuning, and training. This automation makes FLAML particularly valuable for quickly building and optimizing models, even with minimal machine learning expertise. With its lightweight design and flexibility, FLAML empowers users to efficiently create high-performing models across a wide range of applications.

Scaling FLAML for Apache Spark Workloads

Recognizing the growing need for scalable solutions in large-scale data processing, we focused on enhancing FLAML’s capabilities for Spark workloads. Apache Spark is a powerhouse for big data processing, and integrating it seamlessly with AutoML processes is crucial for many enterprises looking to accelerate their machine learning pipelines. To address this, we’ve contributed several new Spark and non-Spark estimators to the FLAML project.

When setting `use_spark=True`, you can now parallelize your training and explore a broader range of non-Spark models, offering more flexibility in model selection. In addition, we’ve added more Spark learners, allowing users to experiment with a wider variety of Spark model flavors when working directly with Spark dataframes.

New Apache Spark Model estimatorsNew Non-Apache Spark model estimators
SparkAFTSurvivalRegressionEstimatorElasticNetEstimator
SparkGBTEstimatorLassoLarsEstimator
SparkGLREstimatorSGDEstimator
SparkLGBMEstimatorSVCEstimator
SparkLinearRegressionEstimatorAverage
SparkLinearSVCEstimatorLassoLars_TS
SparkNaiveBayesEstimatorNaïve
SparkRandomForestEstimatorSeasonalAverage
 SeasonalNaive
 TCNEstimator

These contributions significantly enhance FLAML’s versatility, making it a more powerful tool for users dealing with large datasets. Whether you’re handling complex workloads or managing distributed data processing, FLAML can now better support your needs, thanks to these new learners.

Improved MLflow Integration for Better Collaboration

In addition to expanding FLAML’s capabilities for Apache Spark, we’ve focused on improving its integration with MLflow, a widely used open-source platform for managing the entire machine learning lifecycle. We’ve enhanced this integration by adding support for automatically capturing key metrics, parameters, and models, even when autologging is disabled. Some of these metrics and parameters are unique to AutoML and aren’t captured by standard MLflow autologging. Additionally, we’ve streamlined the process by removing redundant intermediate runs that are typically logged by standard MLflow autologging but aren’t necessary for AutoML trials.

This improvement is crucial for ensuring the reproducibility of models. By automatically capturing the key details of each AutoML trial, users can easily track the parameters and metrics that were critical to the model’s performance. This not only promotes greater transparency but also improves collaboration, as teams can more effectively share insights and build upon each other’s work within the AutoML process.

Support for Python 3.11

We’ve also extended FLAML’s support to include Python 3.11, in addition to the previously supported versions, Python 3.8 and Python 3.10. This contribution ensures that FLAML remains accessible and compatible with the latest advancements in the Python ecosystem, allowing users to leverage the performance improvements and new features introduced in Python 3.11.

A Commitment to the Community

Our contributions to the FLAML project are rooted in a commitment to helping the community. We believe that by improving open-source tools, we empower more people to leverage advanced technologies in their work. Whether you’re a data scientist, a data analyst, or a researcher, these enhancements to FLAML are designed to make your workflow easier, enabling you to scale your machine learning projects with greater ease and confidence.

As we continue to innovate and collaborate, we look forward to seeing how you will utilize these new features and improvements.  We invite you to try out these new features and see how they can improve your machine learning projects.

Learn more

To dive deeper into what FLAML has to offer and get started with using its AutoML capabilities, check out the official documentation here. You’ll find comprehensive guides, examples, and resources to help you make the most of this powerful tool.

You can also try these capabilities within Fabric Data Science, where FLAML’s features are seamlessly integrated for an enhanced machine learning experience.

Bài đăng blog có liên quan

Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML

tháng 10 29, 2024 của Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”

tháng 10 28, 2024 của Estera Kot

We’re thrilled to announce that the Native Execution Engine is now available at no additional cost, unlocking next-level performance and efficiency for your workloads. What’s New?  The Native Execution Engine now supports Fabric Runtime 1.3, which includes Apache Spark 3.5 and Delta Lake 3.2. This upgrade enhances Microsoft Fabric’s Data Engineering and Data Science workflows, … Continue reading “Native Execution Engine available at no additional cost!”