Microsoft OneLake in Fabric, the OneDrive for data
Microsoft OneLake brings the first multi-cloud SaaS data lake for the entire organization
See Arun Ulagaratchagan’s blog post to read the full Microsoft Fabric preview announcement.
Organizations invest heavily in data lake strategies with the vision of having a central place to store all their data, break down silos, and simplify data blending, analysis, security, governance, and discovery.
In reality, the vision is highly illusive. Enterprise data lakes are mostly implemented as custom projects using raw storage covered with massive glue code designed to enable scalability, collaboration, compliance, security and governance. Data mesh patterns with independent business domain-driven lakes adds additional overhead and fragmentation with multiple teams managing their own siloed lake resources. To break down these silos, these organizations build additional complicated solutions with complex data movement to facilitate sharing and reuse. And for all this to become usable for the business side, IT organizations must also build data warehouses, data marts, and cubes creating additional copies the lake data. The resulting data lake implementation is often a complex and hard to manage system, rife with both siloes and redundant data.
Introducing Microsoft OneLake
Introducing Microsoft OneLake – “the OneDrive for Data”. OneLake is a complete, rich, ready-to-go enterprise-wide data lake provided as a SaaS service. Just like organizations are using OneDrive for their documents, they now have OneLake for their data. OneLake is the core of Fabric’s lake-centric approach. It provides customers with:
- One data lake for the entire organization at scale
- One copy of data for use across multiple analytical engines
- One security model living natively with the data in the lake (coming soon)
- A centralized OneLake data hub for data discovery and management
One data lake for the entire organization
OneLake improves collaboration over a single organization wide data lake. Each Fabric tenant will have exactly one OneLake where all the data of all the projects and for all the users will be stored. OneLake is automatically available with every Fabric tenant with no additional resources to setup or manage.
Governed by default with distributed ownership for collaboration
The concept of a tenant is a unique benefit of a SaaS service. It establishes clear governance and compliance boundaries controlled by the tenant admin and all data in OneLake is governed by the tenant policies. This well-controlled system allows OneLake to be open to every user to add their own contributions to OneLake from every part of the organization without any friction.
Just like every Office user can create a new Teams channel or SharePoint site without coordinating with the admin, OneLake enables similar distributed ownership through workspaces. Workspaces enable different parts of the organization to work independently while all building the same data lake. Each workspace has its own administrator and access control. Each workspace is powered by a capacity that resides in a user selected region. This means that OneLake fully accommodates customers doing business in multiple countries and natively supports local data residency requirements. OneLake spans the globe with different workspaces residing in different countries while still remaining part of the same logical lake.
Data mesh and domains
With Microsoft OneLake, we provide a unified data lake that eliminates all data silos. However, the capabilities extend further. OneLake also provides the ability to organize and manage data in a logical way allowing different business groups to efficiently operate and control their own data. This pattern is known as “data mesh”.
With Onelake native support for data mesh, organizations can easily define business domains, such as Marketing, Sales, Human resources, and more. Once domains are defined and contain the respective OneLake data, various consumption, and governance capabilities light-up for the domain. This allows more optimized consumption for business users, and more granular control per domain for administrators.
For example, data owners and businesses can discover and consume Onelake data filtered to their areas of interest, and administrators can delegate settings to the domain level, allowing different definitions and governance per business unit.
With the built-in OneLake domains, OneLake is the first data lake that provides native support for data mesh as a service.
Open at every level
OneLake is open at every level. Built on top of Azure Data Lake Storage Gen2, OneLake can support any type of file, structured or unstructured. All Fabric data items like data warehouses and lakehouses will automatically store their data in OneLake in delta parquet format. This enables data engineers to load a lakehouse using Spark, SQL developers to load data in fully transactional data warehouses using T-SQL, and all contributors to build the same data lake.
OneLake supports the same ADLS Gen2 APIs and SDKs to be compatible with existing ADLS Gen2 applications including Azure Databricks. Data in OneLake can be addressed as if it were one big ADLS storage account for the entire organization. Every Fabric Workspace appears as container within that storage account while different data items appear as folders under those containers.
OneLake file explorer for Windows
OneLake servs as the OneDrive for data. Just like OneDrive, OneLake data is easily accessed from Windows using the OneLake file explorer for Windows. In Windows, you can navigate all your workspaces, data items, easily upload, download or modify files just like you can do in OneDrive. The OneLake file explorer simplifies data lakes making them accessible to even non-technical business users.
One copy of data
OneLake aims to give you the maximum value out of a single copy of data without data movement or duplication. You will no longer need to copy data just to use it with another engine, or to break down silos so that data can be analyzed with other data.
Shortcuts let you connect data across business domains without data movement
A large organization will typically have lots of data domains with different data owners. Shortcuts provide connections between different data items across domains so that data can be virtualized into a single data product without data duplication, data movement, or changing the ownership of the data.
A shortcut is a symbolic link. It functions as metadata that points from one data location to another. They are similar to Windows shortcuts. When you create a shortcut from one location to another location, files will appear in the shortcut location as if they physically exist. Tables in a warehouse can be made available to another lakehouse without copying the data from the warehouse to the lakehouse. Since all the data is already in OneLake, you can just create a shortcut from the warehouse to the lakehouse and the data will appear in the lakehouse as if you had copied it. Because the data isn’t copied, there is no secondary copy to maintain. When data changes in the warehouse, those changes are automatically reflected in the lakehouse.
Shortcuts are used to consolidate data across workspaces and domains without changing the data ownership. The same data can be used multiple times across different locations while the original owner remains responsible for loading and managing it.
Shortcuts to Azure Data Lake Store gen2
Organizations already have their data in lakes outside of OneLake. We have extended shortcuts to also support these data stores. You can create shortcuts to existing ADLS gen2 accounts enabling all your data to be virtualized into OneLake and the data appears as if it physically exists. The owners of these accounts can continue to manage them independently of OneLake.
Shortcuts to S3 make OneLake the first multi-cloud data lake
OneLake goes beyond Microsoft and Azure to become the first multi-cloud data lake with shortcuts to Amazon S3 buckets. Through shortcuts, S3 buckets can be virtualized into OneLake. Their data is mapped to the same unified namespace and can be accessed using the same APIs including the ADLS gen 2 APIs. Notebooks, SQL queries and Power BI reports can all span multiple clouds without the end users needing to be aware that they are doing so. Transparent smart caching (coming soon) will bring data closer to compute and reduce egress costs.
Shortcuts to Dataverse (coming soon)
Dataverse generates shortcuts to Microsoft Fabric, generates Synapse Lakehouse and a SQL endpoint for Dynamics 365 and PowerApps data enabling next generation Power BI capabilities. This direct integration between Dataverse and Microsoft Fabric eliminates the need to build and maintain custom ETL pipelines or use third party data integration tools. Dataverse shortcuts ensure that data always remains within Dataverse and as data gets updated in Dynamics 365, changes are reflected in Power BI reports automatically.
Data analysts will be able to launch Microsoft Fabric directly from the PowerApps experience. Data engineers can launch Microsoft Fabric using Synapse Link and work with data using Python or Spark notebooks. Direct integration between Dataverse and Microsoft Fabric saves significant time and effort.
One copy of data with multiple analytical engines
Compute powers all the analytical experiences in Fabric. With OneLake in Fabric, compute is completely separate from storage. While OneLake represents the one data store for the entire organization, Fabric’s other multiple analytical computes can access the same copy of data without needing to import it into another copy. There is no longer a need to copy data just to use it with another engine. You are always able to choose the best engine for the job that you are trying to do.
For example, imagine you have a team of SQL engineers building a fully transaction data warehouse. They can use the T-SQL engine and all the power of T-SQL to create tables, transform, and load data to tables. If a data scientist wants to make use of this data, they no longer need to go through a special Spark/SQL driver. All the data is stored in OneLake in delta parquet format. Data scientists can use the full power of the Spark engine and its open-source libraries directly over the data.
Business users can build Power BI reports directly on top of OneLake using the new Direct Lake mode in the Analysis Services engine. The Analysis Services engine is what powers Power BI Datasets and has always offered two modes of accessing data, import and direct query. Direct Lake mode gives users all the speed of import without needing to copy the data, combining the best of import and direct query. Learn more about Direct Lake at https://aka.ms/DirectLake.
If you have a data engineering team which prefers to use Spark to build a lakehouse, they can use notebooks to land their data in OneLake in delta/parquet format. That data can be consumed automatically by all engines. The same is true for data that is landed using other engines using the ADLS DFS APIs, or virtually added through shortcuts. When defining your organization’s data strategy, you no longer need to optimize for different teams with different skillsets and preferences. Teams that want to work with SQL, can work with SQL. Teams that want to work with Spark, can work with Spark. Teams using other engines to land their data, can continue to do that. Everyone builds the same data lake. There are no silos.
One security model (coming soon)
Managing data security (table, column, and row levels) across different data engines is a persistent nightmare for customers. OneLake will bring with it a universal security model enabling you to define security definitions just once. Unlike other solutions which require you to define security definitions in some other layer, these security definitions will live in OneLake alongside the data. Security definitions will be enforced uniformly across all engines inside and outside of Fabric. This model is coming soon.
OneLake data hub
Lastly, OneLake has provided a central solution for all data, however how can this data be accessed, discovered, managed, reused? These aspects are key as organizations increasingly require easy access and discovery of high-quality data for reuse, decision-making and data-driven insights. We’re excited to introduce the OneLake data hub (an evolution of the Power BI data hub). The OneLake data hub serves as a centralized interface to all data housed within OneLake, including data warehouses, lakehouses and their SQL endpoints, KQL databases, datamarts, and datasets. OneLake data hub is the central location for easy data discovery, data management, and data reuse.
With the OneLake data hub users can see data across their business domains and filter to see a specific domain that they are interested in, see all authoritative endorsed data in one place and see all the data owned by users to make data management easy as possible in one central location.
The OneLake data hub is particularly powerful for users who have access to data across multiple workspaces. The OneLake data hub explorer offers an intuitive and efficient means of browsing through workspaces in order to locate specific data items. With the explorer, users can more quickly and easily access large volumes of data.
Once data is discovered, users can perform a large variety of actions: explore its properties and tables, identify whether it is marked as sensitive and should be treated with caution, track data lineage and perform impact analysis across workspaces, reuse that data and build on top, find valuable insights and make informed business decisions, or take further action.
The OneLake data hub is integrated into multiple experiences within both Fabric service and Power BI Desktop. This integration ensures that users can quickly and easily find necessary data in any context and in a consistent manner. For instance, in Power BI Desktop, users may access the OneLake data hub experience to browse available items and connect with them, thus avoiding the need to create new data sources. This approach fosters a culture of data reusability and helps organizations meet their goals more effectively.
- Read the OneLake documentation: https://aka.ms/onelakedocs
Watch the following video to see how to eliminate data silos with OneLake.
Get started with Microsoft Fabric
Microsoft Fabric is currently in preview. Try out everything Fabric has to offer by signing up for the free trial—no credit card information required. Everyone who signs up gets a fixed Fabric trial capacity, which may be used for any feature or capability from integrating data to creating machine learning models. Existing Power BI Premium customers can simply turn on Fabric through the Power BI admin portal. After July 1, 2023, Fabric will be enabled for all Power BI tenants.
If you want to learn more about Microsoft Fabric, consider:
- Signing up for the Microsoft Fabric free trial
- Visiting the Microsoft Fabric website
- Reading the more in-depth Fabric experience announcement blogs:
- Data Factory experience in Fabric blog
- Synapse Data Engineering experience in Fabric blog
- Synapse Data Science experience in Fabric blog
- Synapse Data Warehousing experience in Fabric blog
- Synapse Real-Time Analytics experience in Fabric blog
- Power BI announcement blog
- Data Activator experience in Fabric blog
- Administration and governance in Fabric blog
- Microsoft 365 data integration in Fabric blog
- Dataverse and Microsoft Fabric integration blog
- Exploring the Fabric technical documentation
- Reading the free e-book on getting started with Fabric
- Exploring Fabric learn modules
- Exploring Fabric through the Guided Tour
- Watching the free Fabric webinar series
- Joining the Fabric community to post your questions, share your feedback, and learn from others
- Visiting Microsoft Fabric Ideas to submit suggestions for improvements and vote on your peers’ ideas