Twinify solves data privacy issues

This video is a part of the FCAI success stories series. In the video series, we explain why fundamental research in AI is needed, and how research results create solutions to the needs of people, society and companies.

Researchers at FCAI have developed a machine learning-based method that can produce research data synthetically. The application is based on a method that allows academics and companies to share data with each other without compromising the privacy of the individuals involved in the study.

Data driven technologies are revolutionizing many industries. However, in many areas of research – including health and drug development – there is too little data available due to its sensitive nature and the strict protection of individuals.

“When a person gets sick, of course, they want to get the best possible care. Then it would be important to have the best possible methods of personalized healthcare available”, says Samuel Kaski, Academy Professor and the Director of the Finnish Center for Artificial Intelligence FCAI.

However, developing such methods of personalized healthcare requires a lot of data, which is difficult to obtain because of ethical and privacy issues surrounding the large-scale gathering of personal data.

“For example, I myself would not like to give insurance companies my own genomic information unless I can decide very precisely what the insurance company will do with the information,” says Professor Kaski.

Many industries want to protect their own data so that they do not reveal trade secrets and inventions to their competitors. This is especially true in drug development, which requires big investments with high financial risk. As a result, the development of new drugs has stalled. If pharmaceutical companies could share their data with other companies and researchers without disclosing their own inventions, everyone would benefit.

The ability to produce data synthetically solves these problems. FCAI researchers found that synthetic data can be used to draw as reliable statistical conclusions as the original data.

"The strong privacy guarantee by differential privacy allows conducting an unlimited number of future analyses on the synthetic data without further privacy concerns, which was not possible with previous approaches”, says Joonas Jälkö, a doctoral student in Professor Kaski's group.

The application works as follows: The researcher enters the original data set into the application, from which the application builds the synthetic dataset. They can then share their data to other researchers and companies in a secure way.

Researchers are further improving the application to make it easier to use, and to add functionality.

“We released the application early to contribute to solving data scarcity during the pandemic. But the method is widely usable for other types of data as well. We welcome others to join developing it further!” says Kaski.

Twinify your dataset and share the synthetic twin without sacrificing privacy! 

Text and video production by Mia Paju.

Twinify is a software package for privacy-preserving generation of a synthetic twin to a given sensitive data set. The machine learning methods are developed in the Probabilistic Machine Learning Group in collaboration with the rest of FCAI’s Privacy-preserving and secure AI research program.

Learn more:

Preprint version of the research article: https://arxiv.org/abs/1912.04439
Code: https://github.com/DPBayes/twinify

FCAI success storiesMia PajuDecember 16, 2020Research, Health