Microsoft Fabric Updates Blog

Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL

In this new post of our ongoing series, we’ll explore setting up Azure Cosmos DB for NoSQL, leveraging the Vector Search capabilities of AI Search Services through Microsoft Fabric’s Lakehouse features. Additionally, we’ll explore the integration of Cosmos DB Mirror, highlighting the seamless integration with Microsoft Fabric. It’s important to note that this approach harnesses the search services’ capabilities, with Python coding facilitated through Lakehouse. This is just one of the myriad possibilities available within Fabric, particularly useful if your data resides in Cosmos DB and you wish to utilize Fabric’s integration capabilities for search or data mirroring. Whether it’s for search enhancement or data replication, Fabric stands ready for integration, offering flexibility and efficiency.

Vector Search

As for Azure Cosmos DB for No SQL specifically the configuration for Vector Search involves Azure Open AI and Cognitive search services.

You will need:

An Azure Cosmos for No SQL has already been deployed. You can even use the Serverless option for cost management. Follow some references if you are starting with Azure Cosmos for No SQL.
MS Docs:
Get started with Azure Cosmos DB for NoSQL – Training | Microsoft Learn
Quickstart – Create Azure Cosmos DB resources from the Azure portal | Microsoft Learn
End to End from Cyrille from MS FTA team:
Getting started with Azure Cosmos DB – end to end example – Azure Cosmos DB Blog (microsoft.com)
The plan is to use Vector Search through the Lakehouse. You also need a Microsoft Fabric workspace with a Lakehouse:
Getting Started | Microsoft Fabric
Get started with Microsoft Fabric – Training | Microsoft Learn
Create a lakehouse – Microsoft Fabric | Microsoft Learn
Create the Search Service: Introduction to Azure AI Search – Azure AI Search | Microsoft Learn.
Ensure you keep track of the Search key value by navigating to the Search Service, then accessing the keys section and copying the value provided. Additionally, copy the URL values available on the overview page of the Search Service for future reference.
Extracted from this post – Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search | Microsoft Fabric Blog | Microsoft Fabric. You will also need the Open AI service:
- Follow the docs to proceed with the request:
  - What is Azure OpenAI Service? – Azure AI services | Microsoft Learn
  - Request Access to Azure OpenAI Service (microsoft.com)
- Once the access is granted, proceed to deploy the Open AI service: Azure OpenAI Service embeddings tutorial – Azure OpenAI | Microsoft Learn
- Create a new deployment and choose the mode text-embedding-ada-002 version 2, this model refers to a specific language model provided by OpenAI for text embeddings (How-to: Create and deploy an Azure OpenAI Service resource – Azure OpenAI | Microsoft Learn)
- Keep track of the following values that we will need to connect to the Open AI service: Api_key and the Service Endpoint.Connect using API keys – Azure Cognitive Search | Microsoft Learn
You’ll need to upload the files containing the embeddings into Azure Cosmos DB for No SQL. These files can be found in the “Data” folder within the “Code_Samples” repository:

Azure-Samples/azure-vector-database-samples: A collection of samples to demonstrate vector search capabilities using different Azure tools like Azure AI Search, PostgreSQL, Redis etc. (github.com). This repo created by the Microsoft ISE team with many contributions (check the contribution list) but mainly by Jose Perales and Raihan Alam, it has solid examples for Vector Search implementations.

Please note the repo has also a pretty cool example with Fabric and Kusto made by Siliang Jiao and Gary Wang, I encourage you to try out: azure-vector-database-samples/code_samples/fabric_kusto at main · Azure-Samples/azure-vector-database-samples (github.com)

For more examples of Vector Search using different Cosmos versions,
this repo has some pretty cool samples and I also used as reference: AzureDataRetrievalAugmentedGenerationSamples/README.md at main · microsoft/AzureDataRetrievalAugmentedGenerationSamples (github.com)

Step By Step:

Considering that the service for Cosmos is already there (mentioned above) you will need to create a database, my example uses the name Vector_DB:
Quickstart – Create Azure Cosmos DB resources from the Azure portal | Microsoft Learn
Inside of the service, look for the URI and copy and paste in a notepad separated as Fig. 1-URI shows, mine for example is – https://lilem.documents.azure.com:443/:

3. Also, look for Keys inside of your Cosmos DB and copy the Primary Key in the notepad, as Fig 2 – Keys shows:

4. With the information provided above, let’s proceed to create the container within the Fabric Lakehouse. Alternatively, you can click and create the container through the Cosmos UI.

%pip install azure-cosmos

from azure.cosmos import CosmosClient from azure.cosmos import exceptions, CosmosClient, PartitionKey cosmos_db_api_endpoint="COPY THE URI HERE" cosmos_db_api_key = "COPY THE KEY HERE" database_name = "Vector_DB"###this is your Database name text_table_name = 'text_sample'###this is your container name

# Initialize the Cosmos DB client client = CosmosClient(cosmos_db_api_endpoint, credential=cosmos_db_api_key) database = client.create_database_if_not_exists(id=database_name)

try: container = database.create_container_if_not_exists( id=text_table_name, partition_key=PartitionKey(path="/id") ) print(f"Document {container} created successfully")
except Exception as e: print(f"Error: {e}")

5.Upload the data inside Cosmos.
Data:azure-vector-database-samples/code_samples/data/text/product_docs_embeddings.json at main · Azure-Samples/azure-vector-database-samples (github.com)

When it comes to insertion or uploading, you have the freedom to choose your preferred method. The repositories I mentioned earlier provide Python examples, and Microsoft Documentation offers some Bash examples as well. To simplify matters, I’ll proceed by inserting the embedding file directly from Onelake/ Fabric Lakehouse.

import pandas as pd cosmosdb_container_name = text_table_name container = database.get_container_client(cosmosdb_container_name)

# Read data from the JSON file text_df = pd.read_json('/API PATH/product_docs_embeddings.json') records = text_df.to_dict(orient='records')

# Iterate through the data and insert the files with the embeddings into the container
item['@search.action'] = 'upload' # Convert the 'id' attribute to a string item['id'] = str(item['id']) # Insert the item into the container container.create_item(body=item) print(f"Data items inserted into the Cosmos DB {cosmosdb_container_name}")

except exceptions.CosmosResourceExistsError as e: print(f"Document {container} with ID {item['id']} already exists...") print(f"Error: {e}")

except Exception as e: # Handle other exceptions print(f"Error: {e}")

6. Let’s use Azure AI services for the Search. Note: Vector database – Azure Cosmos DB | Microsoft Learn

Create the DataSource
First, let’s create the DataSource for Azure Cosmos DB, using the Search Service in Azure Portal. as Fig 3 – Datasource:

Connectionstring for my database example – Vector_DB: “AccountEndpoint=URI;AccountKey=YOURKEY==;Database=Vector_DB;”

Create the index

As shown in Figures 4 and 5 (Index and Fields, respectively), let’s continue within the Search Service interface. Utilizing the UI, we’ll configure the index. Since this process is performed via the UI, you’ll need to add each field individually.

Please note for title_vector and content_vector you have will one extra step which includes create the profile – Fig 6 – profile:
Note:
“We are using HSNW “Hierarchical Navigable Small World (HNSW): HNSW is a leading ANN algorithm optimized for high-recall, low-latency applications where data distribution is unknown or can change frequently. “ Ref: VectorSearch
About the Vector Size: “For each vector field, Azure AI Search constructs an internal vector index using the algorithm parameters specified on the field. Each vector is usually an array of single-precision floating-point numbers, in a field of type Collection(Edm.Single)“Ref: Vector Size

Once the fields are created as Fig 5 – Fields (above) just, define a name for the index and hit the button create.

Create the indexer

Now still inside of the Search Service -> create the indexer using the DataSource and the index that was created previously, save and run. As Fig 7 – indexer, shows

For more information how the Vector Search Service works: VectorSearch Works

“On the indexing side, Azure AI Search takes vector embeddings and uses a nearest neighbors algorithm to place similar vectors close together in an index. Internally, it creates vector indexes for each vector field.”

So, all the configuration is done, now let’s Search!!
Libraries:

%pip install azure-cosmos openai --upgrade azure-search-documents===11.4.0

import json import datetime import time from azure.core.exceptions import AzureError from azure.core.credentials import AzureKeyCredential from azure.cosmos import exceptions, CosmosClient, PartitionKey from azure.search.documents import SearchClient from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient from azure.search.documents.models import ( QueryAnswerType, QueryCaptionType, QueryType ) from azure.core.credentials import AzureKeyCredential import numpy as np from typing import List import pandas as pd from ast import literal_eval import openai

Functions for the vector search:

def get_embedding(text, model="text-embedding-ada-002"): text = text.replace("\n", " ") return client.embeddings.create(input = [text], model=model).data[0].embedding

def cosine_similarity(a, b): # Convert the input arrays to numpy arrays a = np.asarray(a, dtype=np.float64) b = np.asarray(b, dtype=np.float64)
# Check for empty arrays or arrays with zero norms if np.all(a == 0) or np.all(b == 0): return 0.0
dot_product = np.dot(a, b) norm_a = np.linalg.norm(a) norm_b = np.linalg.norm(b) similarity = dot_product / (norm_a * norm_b) return similarity

Initialize the Connection:

database_name = "Vector_DB" text_table_name = 'YOUR CONTAINER NAME'###mine is text_sample3 cosmos_db_api_endpoint="URI" cosmos_db_api_key = "YOUR KEY"

# Configure Azure Cognitive Search cog_search_endpoint = "https://YOURSERVICENAME.search.windows.net" cog_search_key = "KEY of your service"

index_name = "YOUR Index Name" ##my example is index_textsample3 credential = AzureKeyCredential(str(cog_search_key)) openai.api_type = "azure" openai.api_key = "YOUR open AI Key" openai.api_base = "https://YOUROpenAIService.openai.azure.com/"

cosmos_client = CosmosClient(cosmos_db_api_endpoint, cosmos_db_api_key) database = cosmos_client.get_database_client(database_name)

Script for the Search:

from openai import AzureOpenAI container_name =text_table_name

client = AzureOpenAI( api_key = openai.api_key, api_version = "2023-05-15", azure_endpoint = openai.api_base)

container = database.get_container_client(container_name) search_client = SearchClient(cog_search_endpoint, index_name, credential)

query = 'tools for software development'##example query_vector = get_embedding(query, model = model)

# Perform Azure Cognitive Search query search_results = search_client.search(search_text=query, select=["title", "content", "category", "title_vector", "content_vector"])

for result in search_results: result_vector = result.get("content_vector", None) if result_vector is not None and len(result_vector) > 0: similarity_score = cosine_similarity(query_vector, result_vector) print(f"Title: {result['title']}") print(f"Score: {result['@search.score']}") print(f"Content: {result['content']}") print(f"Category: {result['category']}") print(f"Cosine Similarity: {similarity_score}\n") else: print(f"Skipping result with empty or missing vector.\n")

Results – Fig 8- Search:

As for Vector Search, if you are interested, I encourage you to check the repositories I mentioned at the beginning of this post, and you will see the many options and implementations. The python code can be reused inside of the Lakehouse with a few changes.

Mirror Cosmos DB for No SQL

Our earlier example showcased how to build the Vector Search using the AI Search services with Cosmos for No SQL and the Lakehouse instead of Python scripts. Now, let’s explore another option: mirroring it into Fabric. Once the mirroring process is finalized, shortcuts can be established across Microsoft Fabric workspaces, directing to the mirror. Furthermore, the SQL Endpoint can be employed to create queries, it means you can use T-SQL commands that query data objects but not manipulate the data in teh SQl Endpoint, as it’s a read-only copy.
Note: Mirrors can be stopped at any given time.

Review the doc to understand the solution: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) – Microsoft Fabric | Microsoft Learn

Step by Step:

1 – Inside Fabric – Choose Mirror Azure Cosmos DB.

As Fig 9 – Cosmos DB option illustrates:

2 – Name the Mirror that will be created, as Fig 10 – Name mirror, shows:

3 – Choose Cosmos DB for No SQL currently in preview. As Fig 11 – Cosmos option will show as follows:

4 – Inside Azure Portal look for your Cosmos DB NO SQL, open and copy and paste the URI in a notepad as Fig. 12-URI shows, mine for example is – https://lilem.documents.azure.com:443/. <This is same step as I did for Vector Search.>

5 – Look for Keys inside of your Cosmos DB and copy one the Primary Key in the notepad, as Fig 13 – key shows:

6 – Use the information you copied earlier in step 4 and step 5 and input it into their respective fields as shown in Figure 14 – Mirror Fields.

7 – Next connect -> Select the database and start to mirror:

There are some preliminary steps missing in the mirror configuration. The error message indicates: “The database cannot be mirrored to Fabric due to the following error: Continuous backup must be enabled before you mirror an Azure Cosmos DB database to Fabric. Please enable 7-day or 30-day continuous backup on your Azure Cosmos DB account from the Azure portal.”
Therefore, before proceeding with the mirror setup, ensure that continuous backup is enabled on your Azure Cosmos DB account for either 7-day or 30-day retention period via the Azure portal.

According to the docs: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) – Microsoft Fabric | Microsoft Learn “When you enable mirroring on your Azure Cosmos DB database, inserts, update, and delete operations on your online transaction processing (OLTP) data continuously replicates into Fabric OneLake for analytics consumption.The continuous backup feature is a prerequisite for mirroring. “

So, let’s fix!!

Reopen the Azure Portal for Cosmos DB, locate the database you intend to mirror, and navigate to the Backup and Restore section. Select the continuous backup option, as indicated in the message. Refer to Figure 16 – Continuous, which illustrates this configuration.

After making this change, please wait for a moment. You’ll notice that the Point in Time Restore option mentioned in the documentation (Migrate an Azure Cosmos DB account from periodic to continuous backup mode | Microsoft Learn) will become available. If you select this option, you’ll see a message stating, “The Backup Policy is migrating,” as shown in Figure 17 – Policy. Hence, while the migration is in progress, please wait until it’s completed before attempting to restart the mirror in Fabric.

Once the backup policy is finished to be migrated, you can go back to Fabric and hit the Mirror button, as Fig 18 – Mirror, shows:

8 – Now you can query Cosmos DB for No SQL from the SQL Endpoint, as Fig 19 – CosmosSQL shows:

9 – You could even open Cosmos from SSMS by connecting to the SQL Endpoint. Copy the SQL Connection string as Fig 20- Endpoint and open SSMS as Fig 21 -SSMS:

And as mentioned before shortcuts from the Lakehouse can be created in different workspaces ( given the right permissions) to access the Cosmos DB mirrored Data, for example can be created to access the Mirror Data as Fig 22- Shortcuts:

Summary:

This post explores the diverse options available when integrating Cosmos DB for No SQL with Microsoft Fabric. It has delved into configuring Azure Cosmos DB for NO SQL with Vector Search services, leveraging Microsoft Fabric’s Lakehouse capabilities. We’ve also explored the integration of Cosmos DB Mirror, highlighting its seamless collaboration with Microsoft Fabric. It’s essential to recognize that this approach maximizes the search services’ potential, with Python coding streamlined through Lakehouse. This represents just one of the myriad possibilities within Fabric, particularly beneficial if your data resides in Cosmos DB, allowing you to harness Fabric’s integration capabilities for search or data mirroring needs. Whether it’s for enhancing search functionalities or replicating data, Fabric offers a versatile and efficient integration solution.

Visit our product blogs

Power BI

Please enter your information to subscribe to the Microsoft Fabric Blog.