Improved Method for Generating Synthetic Data Solves Major Privacy Issues in Research

The lack of data is a major bottleneck for many kinds of research, and especially for the development of better medical treatments and drugs. These data are extremely sensitive and, understandably, people and companies alike are often unwilling to share their information with others.

Researchers at the Finnish Center for Artificial Intelligence (FCAI) have developed a machine learning-based method that produces synthetic data on the basis of original datasets, making it possible for researchers to share their data with one other. This could solve the ongoing problem of data scarcity in medical research and other fields where information is sensitive.

The generated data preserves privacy, remaining similar enough to the original data to be used for statistical analyses. With the new method, researchers can conduct an infinite number of analyses without compromising the identities of the individuals involved in the original experiment.

“What we do is we tweak the original data sufficiently so that we can mathematically guarantee that no individual can be recognized,” explains Samuel Kaski, Aalto University professor and director of FCAI, who coauthored the study.

Researchers have produced and used synthetic data before, but the new study solves a major problem with existing methods stemming from how synthetic data need to be very similar to the original dataset in order to be useful in research. In practice, it has occasionally been possible to identify individuals’ identities despite anonymization.

To address this problem, FCAI researchers make use of artificial intelligence, specifically probabilistic modelling. This enables them to use prior knowledge about the original data without getting too close to the properties of the particular dataset used as basis for the synthetic data.

Making use of prior knowledge has also made the synthetic datasets more useful for making correct statistical discoveries—even in cases where the original dataset is limited in size, which is common in medical research.

The results were published June 7 in the journal Patterns.

Edited by Gary Cramer