What Is Data Masking and How Do We Do It?
MAY, 2022
by Michiel Mulders.
Modified by Eric Goebelbecker.
Authors
This post was originally written by Michiel Mulders. Modified for re-publication by Eric Goebelbecker.
Michiel Mulders Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!
Eric Goebelbecker Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).
With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24 million per breach. It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking.
Enov8 Test Data Manager
*aka ‘Data Compliance Suite’
The Data Securitization and Test Data Management platform. DevSecOps your Test Data & Privacy Risks.
In this post, I’ll talk you through the pros and cons of data masking and show you some techniques you can use to mask data. Let’s start with a detailed introduction to data masking.
What Is Data Masking?
Enterprises use data masking or data obfuscation to identify and hide sensitive data. This sensitive data can vary from personal data to intellectual property. There are several ways of data masking, but the purpose is to ensure the data is safe. A common example is a credit card number that has been scrambled or blurred.
Maybe you’ve already come across different types of data masking, such as static or dynamic masking. Static masking is what we call masking data in place. Dynamic data masking is when the masking happens at request time, based on the requestor’s identity.
Why Is Data Masking Important?
As the amount of data we need increases, so does the risk of a data breach. It’s impossible to create foolproof protection for every copy of your data. This is especially true when not everybody with access has the technological literacy we’d hope. But, an enterprise can neutralize several factors that make data breaches so expensive. A good rule of thumb is that no active database should contain unmasked data.
The goal of masking is to protect the data from abuse while still providing developers with data for testing. This reduces the impact of data breaches and improves data security. Data masking achieves this because no actual sensitive customer information links to the values.
Who Uses Data Masking?
Every enterprise that handles personal or private information should consider masking data. Especially larger organizations, which often deal with large amounts of data. These companies should handle their data with extreme care.
In addition to this, some companies have to deal with GDPR. Let’s take a closer look at what that entails.
GDPR and Data Masking
The 2018 General Data Protection Regulation (GDPR) imposes severe penalties for mishandling data. So, many businesses are looking for effective ways of protecting this information.
The most common types of information enterprises want to mask include:
- Personally identifiable information (PII) like name, address, sex, etc.
- Protected health records
- Transaction history and card or payment information
- Intellectual property
All these data types need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.
Next, let’s dive further into the need for data masking.
Why do We Need Data Masking?
We have already learned about the EU’s GDPR requirements introduced in 2018. However, there are many more reasons to mask your data.
- Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to ensure only the vendors can access the data.
- Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
- Not all operations require real data. For example, testing an application can be easily executed with randomly generated data. I recommend using randomly generated data during testing, as an application still in the testing phase might leak sensitive data.
Data Masking Techniques
There are many data masking techniques. However, start by assessing what data you’re using and if you’re returning the minimal required data.
You need only the address and age of a certain user in your application. When you query for the user, you could have the app return all underlying data. But from a security standpoint, it’s better to return only the necessary information and reduce the risk of leaking information. Therefore, we can apply a view that returns only the address and age of this user.
Now that we know to limit the data, we return to only the data we need. So, let’s explore five different data masking techniques.
1. Substitution
The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.
For example, let’s say we’re returning a user object with a name and address. To mask the data, we can substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.
2. Shuffling
Next, data shuffling refers to mixing data. However, we want to ensure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while ensuring we have real relationships between the data.
Let’s say we have customer objects in the database linked with purchases. We want to retain this link between the tables for customers and their purchases. So, mix the first and last name of the customer with another customer in the table.
Again, this technique allows you to safely use production data in a test environment.
3. Blurring
Another data masking technique is blurring, often used for indirect data identifiers like age.
Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.
Therefore, we can use the blurring technique. Blurring anonymizes data points.
For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.
4. Credit Card Masking
Credit card masking is tricky, as valid credit card numbers contain a checksum. The final digit of a credit card holds this checksum number. Therefore, we must pay attention when masking credit cards with random numbers, as we don’t want validation to fail for our masked data. Many tools can generate new credit card numbers with a valid checksum.
5. Nulling Masking
Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. Applying the nullification technique makes it impossible to discover the original value based on the null value.
I want to introduce you to dynamic masking as a final masking technique.
Dynamic Data Masking vs. Traditional Data Masking
We use dynamic data masking in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.
With dynamic masking, only authorized users can view the original data. However, the application scrambles the data on the spot for unauthorized users. It’s a performant technique for data masking that protects the production database.
In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you copy the production database and decide upon a data masking technique for the production data. After you’re done, you can safely use the data for testing in our testing environment.
Get Started With Data Masking
Before starting data masking, assess what data you are returning. Always make sure to return the minimum required data.
Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.
The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But the shuffling technique allows you to retain logical relationships in your database.
If you’re working with highly sensitive data, consider using the nullification method. The nullification technique ensures that no sensitive data is exposed.
Other Reading
Enjoy what you read? Here are a few more articles that you might find interesting.
Enov8 Blog: Types of Test Data you should use for your Software Tests?
Enov8 Blog: Why TDM is so Important!
Enov8 Blog: What is Data Fabrication in TDM?
Relevant Articles
What makes a Good Deployment Manager?
Deployment management is a critical aspect of the software development process. It involves the planning, coordination, and execution of the deployment of software applications to various environments, such as production, testing, and development. The deployment...
DevOps vs SRE: How Do They Differ?
Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation...
Self-Healing Data: The Power of Enov8 VME
Introduction In the interconnected world of applications and data, maintaining system resilience and operational efficiency is no small feat. As businesses increasingly rely on complex IT environments, disruptions caused by data issues or application failures can lead...
What is Data Lineage? An Explanation and Example
In today’s data-driven world, understanding the origins and transformations of data is critical for effective management, analysis, and decision-making. Data lineage plays a vital role in this process, providing insights into data’s lifecycle and ensuring data...
What is Data Fabrication? A Testing-Focused Explanation
In today’s post, we’ll answer what looks like a simple question: what is data fabrication? That’s such an unimposing question, but it contains a lot for us to unpack. Isn’t data fabrication a bad thing? The answer is actually no, not in this context. And...
Technology Roadmapping
In today's rapidly evolving digital landscape, businesses must plan carefully to stay ahead of technological shifts. A Technology Roadmap is a critical tool for organizations looking to make informed decisions about their technological investments and align their IT...