Quantifying and Protecting the Privacy of Genomic Data

Genomic data carries a lot of sensitive information about its owner such as his predispositions to sensitive diseases, ancestors, physical attributes, and genomic data of his relatives. Individuals share vast amount of information on the Web, and some of this information can be used to infer their genomic data. Hence, there is a need to clearly understand the privacy risks on genomic data of individuals considering publicly available information on the Web. It is also crucial to protect genomic privacy of individuals without compromising the utilization of genomic data in research and healthcare.

The two main objectives of this project are (i) to develop a new unifying framework for quantification of genomic privacy of individuals and (ii) to establish a complete framework for privacy-preserving utilization, sharing, and verification of genomic data.

In the first workpackage, we developed a new unifying framework to quantify genomic privacy of individuals and significantly contributed to the state-of-the-art. First we showed how to match profiles of users from different platforms (and hence de-anonymize individuals). This developed algorithm can also be used for de-anonymizing the profiles of individuals from genome sharing websites. Then, we showed how to infer the missing parts of the genomes of individuals. Our results show that the attacker’s inference power (on the genomic data of individuals) significantly improves by using complex correlations and phenotype information (along with information about their family bonds). We believe that this work would be a significant step towards establishing a greater understanding of the privacy risks on the genomic data of individuals.

In the second workpackage, we developed techniques that provide recommendations such as how much genomic data to share, what regions of the DNA to share publicly, and what and how much information to share on health-related websites, social networks, and genealogy websites without compromising genomic privacy of individuals. First, we proposed an optimization-based framework for the sharing of genomic data in public datasets while protecting inference of kinship relationships between individuals. Our work is the first in the literature to propose a solution to this privacy leakage problem. Next, we developed a differential privacy-based framework for sharing individuals’ genomic data while preserving their privacy. Different from existing differential privacy-based solutions for genomic data (which consider privacy-preserving release of summary statistics), we focused on privacy-preserving sharing of actual genomic data. As opposed to traditional differential privacy-based data sharing schemes, the proposed scheme does not intentionally add noise to data; it is based on selective sharing of data points. The proposed framework can be seen as a new formulation of differential privacy (that does not rely on noise addition as opposed to existing schemes) for genomic data sharing. We think that it will also have implications in other domains as well.

In the third workpackage, we developed develop techniques to support privacy-compliant credibility check of genomic data even when it is partially shared. We also developed techniques to address the liability issues of genomic data when it is shared without the authorization of its owner. Thus, first, we proposed a scheme that is based on both homomorphic signature and aggregate signature that links the information about the legitimacy of the data to the consent and the phenotype (or the identity) of the individual. Thus, in order to verify the data, a party also needs to use the correct consent and phenotype of the individual who owns the data. We emphasize that the proposed scheme can be easily adopted by existing works on privacy-preserving processing of genomic data in order to have a complete pipeline. Next, we proposed a novel optimization-based watermarking scheme for sharing of genomic data. In the case of an unauthorized sharing of sensitive data, the proposed scheme can find the source of the leakage by checking the watermark inside the leaked data. the proposed schemes guarantees with a high probability that (i) the malicious service provider (SP) that receives the data cannot understand the watermarked data points, (ii) when more than one malicious SPs aggregate their data, they still cannot determine the watermarked data points, (iii) even if the unauthorized sharing involves only a portion of the original data or modified data (to damage the watermark), the corresponding malicious SP can be kept responsible for the leakage, and (iv) the added watermark is compliant with the nature of the corresponding data. We believe that the proposed techniques will help both the users and service providers while sharing and collecting genomic data.

In the fourth workpackage, we explored the privacy risks on interactive genomic databases. Initially, we focused on genomic data sharing beacons. We proposed a novel re-identification attack and showed that the privacy risk is more serious than previously thought. Our attack needs less than 0.5% of the number of queries that existing works require, to determine beacon membership under the same conditions. We further showed that countermeasures such as hiding certain parts of the genome or setting a query budget for the user would fail to protect the privacy of the participants under our adversary model. In an ongoing work, we are working on other scenarios for the identified attack and also working on protection techniques that would help us to develop dynamic access control for genome sharing beacons.

In the last workpackage, we developed privacy-preserving genomic data sharing and utilization techniques between different entities. Notably, (i) we developed a privacy-preserving solution for compressed storage of raw genomic data that outperforms all existing techniques (both in terms of storage overhead and privacy), (ii) we developed, for the first time, a system with one-time programming functionality for genomic testing, and (iii) we developed a system for brute-force resilient management of healthcare (and also genomic) data.

Overall, this project had a positive impact on the European Union. The project results have provided a new vision for protection of healthcare data. The project idea and results have been presented to several research groups in the EU (including Luxembourg, Belgium, France, and Norway). The project has also contributed toward European policies on data protection. Via an invited talk from the EU, Dr. Ayday presented his research ideas about GDPR and provided recommendations.

This project is a significant step towards understanding the privacy risks on genomic data of individuals and protecting the privacy of genomic data. It provides a new vision for security and privacy of health-related data in general and will find many implications in other domains such as banking and online social networks. The results of the project also have an impact on future policies and legislation about protection of health-related data.

Periodic Reporting for period 1 - GenoPri (Quantifying and Protecting the Privacy of Genomic Data)

Diese Seite teilen

Herunterladen