Using Regular Expressions to Build a Microsoft Purview Custom Sensitive Information Type
Introduction
A colleague of mine needed to create a Microsoft Purview custom Sensitive Information Type (SIT) for a client– The client wanted to use this SIT as part of their implementation of Microsoft Purview. The client needed the SIT to match on occurrences of an organizational identifier that met specific criteria. There is a library of available SITs, but none of them were applicable. SITs are leveraged by several components in the Purview product family and in other areas of the Microsoft compliance universe.
This article will show how you can use Regular Expressions (RegEx) to help create a matching pattern in the “Primary Element” of a custom SIT. We will not be discussing how to configure the other three (3) components of the pattern in the SIT. We will begin with what a SIT is and how they are used in the Microsoft Purview solution suite to protect organizations and their priceless information assets. Microsoft Purview is a cornerstone solution that eGroup | Enabling Technologies uses to help our customers implement and maintain our overriding information security and compliance philosophy:
If you need to create a custom SIT, you will need to use some RegEx. This article is not going to be a lesson on how to write RegEx rules but on how to create a custom SIT with a RegEx rule to provide the matching.
Sensitive Information Types
Sensitive Information Types (SITs) are used to identify and classify sensitive “items” that are in your organization’s data inventory. There are four (4) types of SIT:
SITs are leveraged currently by these components of the Microsoft Security and Compliance suite of products:
SITs are used to detect sensitive information in an organization’s documents, files, emails, chats, etc. Policies can be created to take an action if the SIT gets a match. For example, a policy:
Every SIT has a Name, Description, and a Pattern. The Pattern is the definition of what the SIT is looking for. It is composed of four (4) components:
In this article, we are going to focus on creating a RegEx rule for the Primary Element that will match the criteria of the client’s internal identifier.
What are Regular Expressions (RegEx)?
Regular Expressions (RegEx) have been around since 1951. The concept was originated by mathematician Stephen Cole Kleene. It is a syntax that can be used to search for a pattern in text. They can be used in “find” or “find and replace” operations. They first came into popular use in Unix text-processing utilities. RegEx has been incorporated into most common programming languages including:
Over the years, many versions of the RegEx syntax have evolved. Microsoft 365 SITs use the Boost.RegEx 5.1.3 engine. This version of Boost is sometimes referred to by a different version number, 1.66. Microsoft Teams uses the .NET Framework version of RegEx. For a .NET Framework RegEx quick reference from Microsoft, refer to the Regular Expression Language – Quick Reference web page.
If you aren’t confused at this point, the SITs have some additional RegEx validation rules that you need to be aware of. Believe me, if you violate any of these rules, Microsoft 365 will let you know!
At eGroup | Enabling Technologies, we have been using RegEx for over a decade. RegEx has been the required syntax when creating telephone number Normalization Rules in Dial Plans from the days of Live Communications Server 2005 through Lync, Skype for Business, and Microsoft Teams. Some Session Border Controllers also use RegEx in their manipulation engines. The most common Dial Plan Normalization Rules translate four (4) digit telephone extensions (5100) dialed by a user into twelve (12) digit e.164 phone numbers (+14436255100).
There are many sources and resources available on the web as well as those old-fashioned things called books! RegEx is supported in many programming languages. If you want to learn RegEx in general, avoid using a resource specific to a programming language, you won’t get the full picture.
Creating the Custom Sensitive Information Type
You can create a custom SIT in the Microsoft Purview Compliance Portal. They can also be created offline in an XML file called a rule package. A colleague of ours wrote about this a few years ago, How to Create Data Loss Prevention Custom Sensitive Information Types.
The client asked us to create a SIT that would produce a match if a number in a document, e-mail, chat, etc. matched the definition of an organizational identifier with these criteria:
Here are examples of numeric strings that meet the criteria:
And non-matching numeric strings:
The criteria proved to be more challenging than expected. The restriction on not having more than three (3) repetitions of the same number was the most difficult. The rule was first developed and tested using an online RegEx tester; there are several available, RegEx Tester and Regular Expressions 101 to name a few. Once we had the rule tested, we started the process of creating the SIT itself.
3. Click on “Sensitive info types”
4. Click “Create sensitive info type”
5. Type in a name in the “Name” field for the SIT
6. Add a description to the “Description” field. Descriptions are required.
7. Click the “Next” button
8. Click “Create pattern”
9. Click the “+ Add primary element” drop-down
10. Click on “Regular Expression”
11. In the “ID” field, type in a name for the Regular Expression
12. Paste the RegEx into the Regular Expression field. Obviously, even though the rule’s syntax was fine in the tool we used to create it, Microsoft 365 didn’t like the syntax. We’ll discuss this below.
13. Select “String Match”. A match will occur even if the matched number is contained within preceding and/or ending text. The rule would match for “ID:1234567891Number”. If you select “Word Match”, the rule will only match instances of the string that “stand” by themselves. The example string would not match if you had selected “Word Match”.
14. After fixing the syntax, the errors will clear
15. Click the “Done” button
16. Change the “Character proximity” as needed
17. Add “Supporting Elements”
18. Click the “Create” button
19. Click the “Next” button
20. Select a Confidence level
21. Click the “Next” button
22. Click the “Create” button
23. Wait for the SIT to be created then click the “Done” button
3. Click the “Test” button
4. Click “Upload file”
5. Select the test file.
6. Click the “Open” button.
7. Click the “Test” button.
8. Wait for the test to complete, review the results.
9. Click the “Finish” button.
10. Correct the SIT as needed.
The Primary Element’s RegEx Rule of the SIT
((?!.*(0000|1111|2222|3333|4444|5555|6666|7777|8888|9999).*)(?!(0))[0-9]{9}(?!(0))[0-9])
(?!\d{0,6}(0{4}|1{4}|2{4}|3{4}|4{4}|5{4}|6{4}|7{4}|8{4}|9{4})\d{0,6})(?!(0))\d{9}(?!(0))\d
All these risk reductions are real, they are valuable, and they should be a part of any discussion about moving systems or applications to the cloud. This isn’t to minimize the shared responsibility model that we all need to follow (see Microsoft’s diagram of this below), but up to half (half!!) of the boxes below are Microsoft’s responsibility, depending on the system. Oh, except for on-premises. You have to manage that. All on your own….🙂
(?!\d{0,6}(0{4}|1{4}|2{4}|3{4}|4{4}|5{4}|6{4}|7{4}|8{4}|9{4})\d{0,6})
(?!(0))
\d{9}
(?!(0))\d
Summary
eGroup | Enabling Technologies is available and ready to help you harden and protect your organization and its information assets. SITs and custom SITs are basic components used by many of the security and compliance tools in the Microsoft product family. Determining which SITs you need to leverage or create is not always straightforward.
We have been writing RegEx rules for the Microsoft Unified Communications products and using them in various programming projects, PowerShell scripts, etc. for over fifteen (15) years. The need to use RegEx when creating custom SITs is something we are very comfortable with and ready to assist our customers in their implementation.
This series is part of our effort to help our customers implement a “Trust No One and Harden Everything” security infrastructure. If you need help in planning and implementing your organizational security infrastructure, please contact us!
Cloud Solutions Architect - eGroup | Enabling Technologies