Synthetic Data Generation

Product Highlights

We provide a solution for Synthetic Data Generation.

The problem

Deep learning has been successful in a wide range of application domains such as computer vision, information retrieval, and natural language processing due to its superior performance and promising capabilities. However, its success is heavily dependent on the availability of a massive amount of training data. So, the progress in deploying such deep learning models can be crippled in certain critical domains such as healthcare, where data privacy issues are more stringent, and large amounts of sensitive data are involved.

What makes the problem a problem?

Hence, to effectively utilize specific promising data-hungry methods, there is a need to tackle the privacy issues involved in the medical domains for example. To handle the privacy concerns of dealing with sensitive information, a common method that is often used in practice is the anonymization of personally identifiable data. But, such approaches are susceptible to de-anonymization attacks. That made researchers explore alternative methods. To make the system immune to such attacks, privacy-preserving machine learning approaches have been developed. This particular privacy challenge is usually compounded by some auxiliary ones due to the presence of complex and often noisy data.

Who benefits from this?

One of the most promising privacy-preserving approaches is Synthetic Data Generation (SDG). Synthetically generated data can be shared publicly without privacy concerns and provides many collaborative research opportunities, including for tasks such as building prediction models and finding patterns.

The Power of Our Model

Generative Aspects

As SDG inherently involves a generative process, Generative Adversarial Networks (GANs) attracted much attention in this research area, due to their recent success in other domains. GANs are not reversible, i.e., one may not use a deterministic function to go from the generated samples to the real samples, as the generated samples are created using an implicit distri-bution of the real data. Due to the reasons mentioned above, we use GANs to further protect the privacy and generate superior realistic samples.

Privacy Amplification

However, a naive usage of GANs for SDG does not guarantee the system being privacy-preserving by just relying on the fact that GANs are not reversible, as GANs are already proven to be vulnerable. This problem becomes much more severe when such privacy violations can have serious consequences, such as when dealing with patient sensitive data in the medical domain. The primary ambiguity here is to understand how a system can claim to preserve the original data’s privacy. More precisely, how much private a system is? How much information is leaked during the training process? Thus,there is a need to measure the privacy of a system– to be able to judge if a system is privacy-preserving or not. To guarantee privacy, we employ differential privacy in our system.

step 3

Why Differential Privacy?

Differential Privacy (DP) provides a mechanism to ensure and quantify the privacy of a system using a solid mathematical formulation. Differential privacy recently became the de facto standard for statistical exploration of databases that contain sensitive private data. The power of differential privacy is in its accurate mathematical representation, that ensures privacy, without restricting the model for accurate statistical reasoning. Furthermore, by utilizing differential privacy, one can measure the privacy level of a system. Differential privacy, as a strong notion of privacy, becomes crucial in machine learning and deep learning as the system by itself often employs sensitive information to augment the model performance for prediction.


Instill AI is GDPR compliant. Your data is protected. For further details, please read more in our privacy policy.

Interested to know more about our products and services?