The Pros and Cons of Test Data Synthetics (or Data Fabrication)
AUG, 2022
by Louay Hazami.
*Update from October 2020
Author Louay Hazami
This post was written by Louay Hazami. Louay is a deep-learning engineer specialized in computer vision and NLP. He’s passionate about AI, and he speaks four languages fluently.
Data privacy is one of the most pressing issues in the new digital era. Data holds so much value for normal internet users and for all types of companies that are looking to capitalize on this new resource. To keep data anonymous and private, two types of solutions present themselves: masked data or synthetic data.
In the industry, synthetic data is more widely used. That’s because it’s much more secure than masked data. Therefore, I will be focusing on synthetic data. However, we can further our learning about synthetic data by comparing it with masked data.
So, in this post, I will be discussing the pros and cons of creating a synthetic dataset for testing purposes. First, I will define synthetic data by comparing it with masked data. I will also talk about how we can create synthetic data. Finally, we’ll discuss the challenges and benefits of using synthetic data.
Synthetic vs. Masked Data
First, let me explain each type of data:
- Masked data: Modified data that has a similar structure to the real data. Generally, we anonymize only sensitive data from the original data. It could be something as simple as changing the variable name. In contrast with synthetic data, we can keep some of the original data in the end product.
- Synthetic data: Data that is artificially manufactured from the original data. It has the same statistical properties. Creating synthetic data keeps all of the original data anonymous and unidentifiable. We can use all sorts of algorithms to map the synthetic data to the original data. Unlike with masked data, when we only give the synthetic data to the user, they have no idea what data they are dealing with most of the time. But because the synthetic data has the same statistical properties as the original data, the user can still use it to reach relevant conclusions.
For example, let’s say we have a database containing the credit card numbers of our customers. Using the anonymization process, we can simply hide most of the numbers of the credit card (571 is masked to become **1). Therefore, masked data still leaves some of the real information out there. However, we can use a machine learning algorithm to map the credit card numbers to another arbitrary number (571becomes 41273). And this is called synthetic data.
How to Create Synthetic Data
There are two main methods of creating synthetic data:
- Distribution-based modeling: This method relies on reproducing the statistical properties of the original data. For example, we can reproduce the variance or the mean of the data. Basically, we create new data points that have these same properties.
- Agent-based modeling: This method relies on creating a model. This model focuses on learning the behavior of the data algorithmically on its own. Depending on the data, this behavior can be simple or complex. It can also represent relationships between the different variables of the data. Then, the agent creates random data based on the observed properties.
In short, synthetic data can be created in two ways with varying levels of complexity. It may appear as a simple operation. I would like to present to you some of the tools used to create synthetic data next.
Tools to Create Synthetic Data
There are multiple open-source resources available to create synthetic data. I will introduce three tools you can use for different methods of creating synthetic data:
- sklearn.datasets: The Python package scikit-learn contains many tools that data scientists use. It also contains a module called datasets. In sklearn.datasets, we can find many methods for generating data samples. For example, a method for generating a dataset for a regression problem, make_regression, is available. Using make_regression, we can do agent-based modeling for example. The regressor here is our agent.
- pydbgen: This Python package allows us to generate random tables and databases. It also allows us to include specific types of data such as emails or dates in the tables.
- DCGANs: Deep Convolutional Generative Adversarial Networks (DCGANs) are neural networks capable of generating artificial images that look real. They are trained to use real images and output new images that look similar to the real images. Generally, this deep learning technique is used in data augmentation. We use data augmentation when we don’t have enough original images. So, we resort to using synthetic images to compensate for it.
Depending on the task, different techniques can be used. Generally, I think it’s not necessary to know how to use all of these methods. But knowing at least a few by heart can be helpful. Most importantly, I believe we need the ability to comprehend and study all the techniques whenever we encounter a problem where we need to use them.
When To Use Synthetic Test Data?
Having covered the what and how of synthetic test data, I think “when” makes for a great next question. So, when should you use synthetic data? Here are a few scenarios that can help you decide:
- When there’s no real production data for you to copy. To perform production cloning, you need production data. If you’re starting out a new product, there simply isn’t any, so you have to resort to synthetic data for your test data needs.
- When you don’t need “realistic” data. For some reason, you might need test data that doesn’t resemble data generated by real users. You might need invalid data for chaos testing purposes, for instance. Synthetic data generation can come in handy here.
- When you can’t use data masking. If, for some reason, you can’t use a data masking solution, then it’s a no-brainer: synthetic data is your only choice if you want to remain compliant to GDPR and other similar regulations—and of course, you want!
In short, synthetic data is a valuable alternative in situations in which you can’t or won’t use production data for your test data management needs.
Pros and Cons of Using Synthetic Test Data
Synthetic test data generation is a valuable and powerful process to have at your disposal. However, it’s no silver bullet; you might face non-trivial challenges when implementing it. Let’s now cover some of the pros and cons of using synthetic test data, starting with the cons, or “challenges”, as we dub them.
Challenges of Using Synthetic Data
Undoubtedly, the ability to create a new set of data without revealing precious private information is very powerful. It offers us numerous possibilities for research and testing. However, we need to keep in mind something important: Not everything that works on the synthetic data will work on the original data. We can only study general trends with synthetic data. It’s very important to remember this during the testing phase. But there are other challenges as well:
- Users may lack trust in the data.
- If the quality of the data model is not high, we will reach the wrong conclusions.
- Complex data behavior can be very complex to replicate.
- If we simplify the representations in the data, the algorithms will not perform well.
- Synthetic data will require validation against the original data.
Because of these challenges, it can be difficult to create and use synthetic data to its full potential. We need to have a detailed strategy before diving in. For example, it’s good to plan to create different types of synthetic data for different use cases.
Benefits of Using Synthetic Data
However difficult it is to use synthetic data, it comes with benefits. Synthetic data becomes more important as the database grows. It helps us overcome the complexity of the data and privacy issues.
Firstly, synthetic data can replicate the trends of the original data. As a result, we can use it without breaking privacy rules. Moreover, we can use it to simulate new situations and conditions, such as rare weather or equipment malfunctioning scenarios. It also works well for prototype testing.
Furthermore, we can use synthetic data to combat overfitting. Overfitting happens when the algorithm performs well while training but fails during testing. In this case, we can use synthetic data to help train the algorithm. We can use it to make the data well balanced.
To sum up, synthetic data has multiple use cases. It also provides many solutions to real-world problems.
Synthetic Data as a Secret Recipe
After reading this post, do you feel that synthetic data will revolutionize your project? Certainly, it is exciting. We can solve many issues using it and improve our data models.
However, it’s important to rein in the expectations we have from synthetic data. More importantly, we need to understand how to use them and to what extent. Relying solely on synthetic data can make many algorithms basically useless in the long run.
Companies should test on synthetic data, but they have to test on the original data before deploying their algorithms.
It is important to keep in mind that synthetic data is not 100% accurate. In short, it does not fully match the statistical properties of the original data. This is a persistent problem, but advancements are being made in this area.
In conclusion, I would like to say that synthetic data is an important tool in any data-related job. We should not overlook its benefits, but we should not glorify it. Like any solution, it has its limitations. And, as a data scientist, you should be aware of those limits. It’s important to communicate to management that they shouldn’t rely heavily on it because of privacy rules. They should know that the end goal of using synthetic data during testing is to get a good performance on the real-world data.
Next Steps
Looking to address the needs of Data Management?
Why not ask us about DCS, our DataSec & DataOps solution.
A platform that uses automated intelligence to identify where data security exposures reside, rapidly remediate these risks without error (mask or encrypt) and centrally validate your compliance success. Solution also comes with IT delivery accelerators. Including DataView for Data Mining & a DataOps Library for automation.
Innovate with Enov8, the IT Environment & Data Company.
Specializing in the Governance, Operation & Orchestration of your IT systems and data.
Delivering outcomes like
- Improved visibility of your IT Fabric,
- Streamlined Delivery of IT Projects,
- Operational Standardization,
- Security & Availability,
- DevOps / DataOps Automation,
- Real-Time insights supporting decision making & continuous optimization.
Our Key solutions include
- Environment Manager for IT & Test Environment Management.
- Release Manager for Enterprise Release Management & Implementation Planning.
- Data Compliance Suite (DCS) for Test Data Management, including Data/Risk Profiling /Discovery, Automated Remediation & Compliance Validation
Other TDM Reading
Enjoy what you read? Here are a few more TDM articles that you might find interesting.
Enov8 Blog: Types of Test Data you should use for your Software Tests?
Enov8 Blog: Why TDM is so Important!
Enov8 Blog: What is Data Fabrication in TDM?
Relevant Articles
Revolutionize Your IT Landscape with Digital Twins
In today’s fast-paced digital landscape, organizations seek innovative strategies to increase operational visibility, improve decision-making, and fuel business agility. One emerging powerhouse concept that addresses these needs is the Digital Twin—the practice of...
What makes a Good Deployment Manager?
Deployment management is a critical aspect of the software development process. It involves the planning, coordination, and execution of the deployment of software applications to various environments, such as production, testing, and development. The deployment...
DevOps vs SRE: How Do They Differ?
Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation...
Self-Healing Data: The Power of Enov8 VME
Introduction In the interconnected world of applications and data, maintaining system resilience and operational efficiency is no small feat. As businesses increasingly rely on complex IT environments, disruptions caused by data issues or application failures can lead...
What is Data Lineage? An Explanation and Example
In today’s data-driven world, understanding the origins and transformations of data is critical for effective management, analysis, and decision-making. Data lineage plays a vital role in this process, providing insights into data’s lifecycle and ensuring data...
What is Data Fabrication? A Testing-Focused Explanation
In today’s post, we’ll answer what looks like a simple question: what is data fabrication? That’s such an unimposing question, but it contains a lot for us to unpack. Isn’t data fabrication a bad thing? The answer is actually no, not in this context. And...