To improve the accuracy of the “U.S. Social Security Number” (SSN) sensitive information type, we are making the following changes to its definition:
- Three discreet confidence levels (High, Medium, and Low) depending on the level of accuracy. The three levels indicate the likelihood of a true positive considering the following:
- When the SSN was issued. SSNs issued pre-2011 had relatively strong definition due to additional checks.
- Whether the SSN are formatted (ddd dd dddd or ddd-dd-dddd) or unformatted (ddddddddd).
- Whether a keyword is found in proximity to the SSN.
- An additional pattern which does not require mandatory keywords in proximity to reduce false negatives. The current definition requires keywords like “SSN” or “Social Security Number” in proximity to the actual number, which can sometimes lead to valid numbers not being detected (i.e. in an Excel spreadsheet where the supporting keyword is present only in the header row).
- Added intelligence to detect high volume SSNs in tabular data, like an Excel spreadsheet where keyword is present only in the header of the table. Use “High confidence” or “Medium confidence” in your policy for this. Please note that this requires at least one instance to be detected with a keyword in proximity.
See details of current definition vs. new definition below.
Updated July 19, 2021: We have made the decision to make additional changes before we proceed with the rollout. To ensure that your policies continue to behave as they do today without impacting the accuracy, we are delaying this release until we work on the necessary changes. We will send out another communication with the next updates. Thank you for your patience.
Affected Workloads
- Microsoft 365 suite
When this will happen
We will communicate via Message center when we are ready to proceed.
How this will affect your organization
Your existing policies, including data loss prevention policies, do not need to be changed. However, depending on your needs, you may wish to change the confidence level for US SSN within your policies (such as data loss prevention, communication compliance, sensitivity labeling, or records management). For example, if you wish to have minimal false positives, you may set the confidence level to High, and you can set the confidence level to Low if you want minimal false negatives.
- We recommend that you use High confidence level in your policies for minimal false positives.
- If you wish to detect unformatted numbers like 123121234 as well, you should use Medium confidence level.
- Using Low confidence may result in a lot of false positives due to the weak definition of US SSN, where any 9-digit number can be a valid SSN. Please note that using Medium or High confidence will still detect high volume SSNs without keywords, provided at least one instance has keyword in proximity.
What you need to do to prepare
Review your policies and set the appropriate confidence level for the US SSN sensitive information type based on what you want to detect.
Learn more
Message ID: MC256841
Published: 17 May 2021
Updated: 19 July 2021