Microsoft Fabric Updates Blog

Privacy by Design: PII Detection and Anonymization with PySpark on Microsoft Fabric

Introduction

Whether you’re building analytics pipelines or conversational AI systems, the risk of exposing sensitive data is real. AI models trained on unfiltered datasets can inadvertently memorize and regurgitate PII, leading to compliance violations and reputational damage. This blog explores how to build scalable, secure, and compliant data workflows using PySpark, Microsoft Presidio, and Faker—covering hands-on examples of detection, masking, hashing, and synthetic data generation that apply equally to data engineering and AI use cases.

Data Anonymization

Data anonymization is the process of transforming personal or sensitive data in such a way that the individuals to whom the data pertains can no longer be identified—either directly or indirectly. This is a critical step in ensuring compliance with privacy regulations like GDPR, PDPA, and HIPAA, and in enabling safe data sharing for analytics and AI model training.

Unlike encryption, which is reversible with a key, anonymization aims to be irreversible, ensuring that once data is anonymized, it cannot be traced back to an individual.

Why PII Anonymization matters?

PII such as names, emails, phone numbers, and national IDs can be inadvertently exposed during data processing. Anonymizing this data ensures:

  • Compliance with regulations like GDPR and PDPA
  • Reduced risk of data breaches
  • Fairness in AI/ML models by removing bias-inducing attributes

Common Data Anonymization Techniques

Data Anonymization Techniques
Data Anonymization Techniques

Effective data anonymization requires the application of techniques suited to the nature of the data and its intended use. Below are the most widely used methods:

Masking –

Masking involves replacing original data values with specified characters, either partially or completely. For example, sensitive information like contact numbers might appear as “XXX-XXX-XXXX.” This technique is particularly useful for safeguarding data in testing or development environments.

Original:jack@example.com -> masked_email = "j***@example.com"

Hashing –

Applies a one-way cryptographic function to convert data into a fixed-length string of characters (hash value). Useful for consistent anonymization across datasets.

Original (Customer Id): 234568 -> hash value: c9e1c6a7b5e2e3e9b8a7c2e6e1a5e8a7b8e1a5c0e6e1a5e8a7b8e1a5c0e6e1a5e8a7b8e1

Encryption –

Encryption secures data by encoding it with sophisticated algorithms, rendering it unreadable without authorized decryption keys. While ideal for securing data in transit and storage, encryption may not align with datasets intended for public or external sharing.

Original (Customer Id): 234568 -> Encrypted value: VGVzdFN0cmluZw==

Generalization –

This technique minimizes the specificity of data to reduce the risk of identification. For instance, sharing only the birth year rather than the complete date of birth is a form of generalization, often employed in demographic studies.

Original: 1985-07-23 → Generalized: 1985

Suppression –

Suppression eliminates sensitive information entirely from a dataset. While effective at maintaining privacy, the utility of the resulting data may be diminished due to the loss of critical details.

Original: "Her SSN number is 123-45-6789" → Suppressed: "Her SSN number is ***-**-****"

Perturbation –

Perturbation introduces noise or intentional modifications to data, creating uncertainty about individual records. Techniques such as differential privacy are categorized under perturbation, offering mathematical assurances of privacy preservation.

Original: "Her age is 35" -> Suppressed: "Her age is 36"

Synthetic Data Generation –

Creates artificial datasets that mimic the patterns and characteristics of real data but contain no actual personal information.

Original: Her SSN number is 123-45-6789 -> Synthetic: Her SSN number is 987-65-4321

Pseudonymization –

Pseudonymization replaces identifiable data with unique pseudonyms or identifiers. Unlike complete anonymization, pseudonymized data can be reverted to its original form if the linking key is retained, making it highly suitable for controlled environments.

Original: Her SSN number is 123-45-6789 -> Pseudonymized: Her SSN number is TOKEN-ABCD-EFGH

Fabric Implementation: Privacy by design at scale

To operationalize privacy in modern data ecosystems, the Fabric implementation brings together the power of Microsoft Fabric’s Lakehouse, Data Engineering and Data Factory with the precision of PySpark and Microsoft Presidio. This setup enables automated, scalable, and standardized PII protection across your data pipelines and AI workflows. Whether you’re masking emails, hashing IDs, texts, comment columns or generating synthetic personas, this stack ensures your data stays useful—without compromising privacy.  

Microsoft Fabric Implementation: Privacy by Design at scale

By integrating open-source tools like Presidio with PySpark, we can implement robust PII detection and anonymization strategies at scale that align with privacy-by-design principles.

PlatformMicrosoft Fabric (Lakehouse + Data Engineering + Data Pipelines)
Processing EnginePySpark
PII DetectionMicrosoft Presidio
Anonymization TechniquesMasking, Hashing, Encryption
Synthetic Data generationFaker library
Tech Stack Overview

While there are multiple approaches to implementing data privacy at scale—including third-party tools, built-in solutions like MIP labels and Microsoft Purview, or custom PySpark pipelines with AI functions—this blog focuses specifically on the following three approaches using PySpark as processing engine:

1. Identify and anonymizing PII in Structured and Unstructured Data Using Presidio

Microsoft Presidio is an open-source framework developed by Microsoft for detecting and anonymizing sensitive data. Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more. It supports both structured and unstructured data, making it ideal for use cases like:

  • Detecting and anonymizing PII in unstructured text, pdf, image files.
  • Detecting and anonymizing PII in free-text fields in structured data such as commentsfeedback, or survey responses.
  • Identifying sensitive entities like names, emails, phone numbers, and credit card numbers using NLP and regex-based recognizers.
Illustration – Customer Profile data

Illustration – Anonymize PII data from Customer Profile data

Sample Use Case: Use Presidio in a PySpark UDF to scan customer profile stored in a file and flag PII entities.

2. Generate Synthetic Data Using Faker for Anonymization

Once PII is detected, you can replace it with synthetic but realistic data using libraries like Faker. This is especially useful for:

  • Creating safe test datasets.
  • Preserving data utility while ensuring privacy.

Example: Replace detected names with fake names, emails with dummy addresses, and phone numbers with randomly generated ones.

This approach ensures that downstream analytics can continue without exposing real user data.

Illustration – Customer Profile Data
Illustration – Customer Profile Data replaced with Fake data

3. Use Built-in PySpark Functions for Hashing and Masking

For structured data like customer tables, PySpark provides native functions to anonymize data:

  • Masking: Use complete or partially hide/mask data (e.g., mask all but the last 4 digits of a phone number).
  • Hashing: Uses PySpark’s built-in sha2 function to compute the SHA-256 hash (hexadecimal string) for each value in the given column. (e.g. Customer ID is hashed).Since SHA-256 is a deterministic hashing algorithm, it ensures consistent output for the same input, making it suitable for anonymizing columns that need to be joined across multiple tables.

These techniques are efficient and scalable, making them ideal for large datasets processed in Fabric pipelines.

Illustration – Partial Masking of Data of Birth
Illustration – Customer Identifier Hash

Implementation Guide

The PII-SparkShield repository contains sample implementations and code for the three approaches discussed in the blog.

Conclusion

Data anonymization plays a pivotal role in safeguarding privacy while enabling the productive use of data. Employing techniques such as masking, perturbation, and pseudonymization allows organizations to navigate the delicate balance between privacy preservation and data utility. However, challenges such as re-identification risks and compliance intricacies highlight the need for continuous innovation and vigilance. As privacy concerns grow globally, organizations must prioritize anonymization practices to ensure trust and compliance in the digital landscape.

Acknowledgment

My sincere appreciation goes to Omri Mendels – Thanks for helping me whenever needed. This would not have been possible without your help.

Sincere thanks to Abhishek Narain, Noelle Li, Santosh Kumar Ravindran, Ron Shakutai & Gyani Sinha for their inputs, right from ideation to implementation.

Entradas de blog relacionadas

Privacy by Design: PII Detection and Anonymization with PySpark on Microsoft Fabric

julio 14, 2025 por Jai Maldonado (they/them)

The Preview of Cosmos DB in Microsoft Fabric, is now available to all users. Following its initial announcement at Microsoft Build 2025, several new capabilities have been added to improve data workflows. With this release, you can seamlessly access and analyze your operational data across the Fabric ecosystem. Leverage Real-Time Intelligence, Copilot-powered Power BI, and … Continue reading “Announcing Cosmos DB in Microsoft Fabric Featuring New Capabilities! (Preview)”

junio 25, 2025 por Patrick LeBlanc

Welcome to the June 2025 update. The June 2025 Fabric update introduces several key enhancements across multiple areas. Power BI celebrates its 10th anniversary with a range of community events, contests, expert-led sessions, and special certification exam discounts. In Data Engineering, Fabric Notebooks now support integration with variable libraries in preview, empowering users to manage … Continue reading “Fabric June 2025 Feature Summary”