top of page

What is Microsoft Purview Exact Data Match?

Updated: Jun 4

Cyclotron frequently deploys Microsoft Purview with large & small organizations. This blog aims to explain & educate clients on Exact Data Match, an underutilized feature in Purview to protect your highest value data.


Jordan Mapes

Security Engineer


Why Microsoft Purview Exact Data Match?

If you’ve spent any amount of time with Microsoft Purview, you’ve heard of Sensitive Information Types. These are built-in or customizable classifiers (or data definitions) that help you to identify sensitive data within your tenant.


Most times, an organization will opt to use one or more of Microsoft’s built-in SITs like U.S. Social Security Number or Credit Card Number in Data Loss Prevention policies. While this can be a great solution to prevent sensitive data from being shared across your company boundary, the pattern-based nature of these SITs may cause unwanted false positives for your admins.


These SITs also rely on confidence levels, meaning how much supporting evidence is detected around the suspected sensitive data. Raising the confidence level can help you to manage the number of alerts coming in, but this could also increase your number of false negatives, allowing actual sensitive data to slip by. Conversely, lowering the confidence level could significantly increase the number of false positives for some SITs that are detected based on a simple pattern.

 

Detected/blocked by DLP

Missed by DLP

Bad content violating policy

True positive – our tool works!

False negative - can ruin careers!

Good content allowed by policy

False positive - noise for admins, wastes time for admin & users

True negative - we don’t care about these

A simple detection matrix with DLP. Exact Data Match helps increase true positives and reduce false positives & false negatives.


How does Exact Data Match work?

We like to describe it to clients this way:

  • Built-in SITs will find any SSNs.

  • Exact Data Match will find your SSNs.


A common scenario we see is that an organization has a database, such as employee records, patient records, bank account records, or PII records that need to be protected. These databases include fields like names, addresses, identifying numbers, and potential medical or financial information. Rather than setting up a policy to hopefully catch this information with a pattern or proximity to a keyword, they want to detect with certainty when the specific content from this database is shared.


If the above scenario describes your organization, Exact Data Match (EDM) can meet your needs. EDM differs from built-in SITs in its detection method. As mentioned before, SITs detect based on a pattern, keyword, character proximity, and confidence level, but EDM detects exact values from your database. You schedule secure export of a copy of your data, it’s encrypted, and Purview compares your scanned files and emails against the hashed database records.


You can think of EDM as a complementary addition to your existing DLP solution. The built-in SITs can be used for general detection of sensitive data like Social Security Numbers or Bank Account Numbers, while the EDM classifier can be used for higher profile sensitive data involving critical patient records, HR records or top-secret project data.


Here’s a look at the steps involved in creating an EDM classifier from Microsoft’s portal:

Step-by-step guide to creating an Exact Data Match classifier.

The process of setting up an EDM classifier may seem tedious, but each step is designed to ensure that no part of your sensitive database is exposed during configuration or upload. EDM supports up to 100 million rows of data and can be refreshed up to 5 times in a 24-hour period.


Example: Healthcare records

Now we know what Exact Data Match is and how to set it up, what does it look like in practice?


Let’s say a healthcare organization has a patient database they want to protect. They’ve created their classifier and uploaded the source data, selecting Social Security number as their primary element for detection and set additional records (name, address, DOB, etc.) to increase confidence in detections. The next step is to create a DLP policy and select the newly created EDM classifier as a condition. From an admin perspective, this will look similar to the built-in SSN definition, but with the added confidence of knowing if that SSN is in their database.   


An example strategy using EDM:

  • Audit when 1 user’s exact SSN is sent outside the org.

  • Block when at least 1 user’s exact SSN plus any context (such as a name, address, DOB, etc.) is included.

  • Block when at least 2 user’s exact SSNs are sent outside the org.

  • Audit when at least 5 or more user’s exact SSNs are found in any file in the org shared internally.


In summary, Exact Data Match offers a flexible, secure, and accurate approach to managing specific sensitive information. Unlike built-in SITs, you can have much stronger confidence that EDM detects the specific data you intend, not just the generic data found by existing SIT definitions. It can be employed alongside standard SITs to form a comprehensive data protection strategy within your organization, while helping to reduce the number of false positives and avoiding alert fatigue.


Frequently Asked Questions

What licensing is required to use Exact Data Match?

Exact Data match is included with Microsoft 365 E5, Microsoft 365 E5 Compliance, Office 365 E5, or the E5 Information Protection and Governance add-on.


Is there a limit to how much data can be uploaded?

Yes. Your database can contain up to 100 million rows of data, 32 columns per data source, and up to 10 columns marked as searchable.


How often does my data need to be uploaded?

If your database is static, you only need to upload your data once. However, if you have a dynamic database that changes as employees leave or patient records change, you can refresh the database upload up to 5 times in a 24-hour period.


Why would I need two servers?

This is the most secure way to hash and upload your sensitive information source table. In this scenario, you would perform hashing on a server without internet connection, and then copy the encrypted files over to a separate server that can connect to your tenant for upload. Technically, this can be accomplished with one server; however, by performing the hashing process on a server connected to the internet, you run the risk of exposing the clear-text sensitive information if the server becomes compromised. Cyclotron recommends the two-server process to ensure plain-text data is never exposed.


Cyclotron helps ensure your organization’s data is secure with Exact Data Match. Contact Nathan.berger@cyclotron.com to engage with Cyclotron for help on Microsoft Purview implementations including Exact Data Match.

93 views

Recent Posts

See All

Comments


bottom of page