Computer scientists to develop machine learning tools to analyze massive sets of genomic data

Jun 8, 2017

By UCLA Samueli Newsroom

National Science Foundation grant could help lead to developing better treatments for genetic disease

UCLA computer scientists Sriram Sankararaman and Ameet Talwalkar have received a $718,000 grant from the National Science Foundation to apply machine learning tools to analyze massively large sets of human genomic data. Ultimately, this research could lead to developing better treatments for genetic diseases.

Specifically, the pair will focus on genome-wide association studies, which examine the entire genetic code of anywhere from thousands to millions of people in order to find genetic variations associated with a certain disease. These genetic variations can be used to develop better strategies to detect, treat, and prevent diseases.

“Currently, there’s a good set of tools to deal with moderately sized genomic data sets,” said Sankararaman, an assistant professor of computer science. “Naturally, the next generation of tools should target modern, massive data sets. Ultimately, we want to enable the next set of discoveries.”

In the previous decade, landmark studies such as the Human Genome Project, which sequenced the entire human genome, and the International HapMap Project, which aimed to describe common patterns of human genetic variation involved in human health and disease, helped make genomic sequencing data readily available.

However, with these vast quantities of data available, a current problem now lies in the statistical and computational challenges of processing and analyzing these large amounts of data. Sankararaman and Talwalkar hope to utilize their expertise in computational genomics and machine learning in order to develop methods for performing statistical analyses on these large-scale genomic data sets. Specifically, they will focus on providing the tools and software needed to analyze large-scale genomic data sets

“We’re very excited to be working on this,” said Talwalkar, an assistant professor of computer science. “This is a relatively new field of research for us as well, so it’s exciting to be able to continue working on it.”

Sankararaman’s current research interests focus on developing novel statistical models and algorithms to analyze large-scale genomic data, with the aim of understanding evolutionary processes and the genetic basis of complex phenotypes – for example, identifying how genetic changes affect risk for a disease.

Talwalkar’s main research interests include problems related to scalability and ease-of-use in the field of statistical machine learning, with applications in computational genomics.

While this research project is one of their first collaborations together, Sankararaman and Talwalkar have known each other for a long time. In 2010, both were at UC Berkeley – Sankararaman as a fourth-year Ph.D. student, and Talwalkar as a postdoctoral scholar. And both were advised by Michael I. Jordan, a leading researcher in machine learning, statistics, and artificial intelligence. Now, Sankararaman and Talwalkar have neighboring offices, and they are confident that their complementary skill sets will allow them to make extensive progress in this field of research.

In the long run, they hope to push the field forward in terms of being able to make statistical insights on massive data sets. An additional plus is being at UCLA, a leading research institution in computational biology and bioinformatics which consistently ranks within the top ten universities worldwide in the broad subject area of biological sciences.

Given the current bottleneck on the computation and analyses of these massive data sets, Sankararaman and Talwalkar’s research has enormous implications for facilitating current and future research efforts within the human genetics community.

By Lucy Liao
Originally published at the UCLA Computer Science Department website.
Image L to R: Sriram Sankararaman and Ameet Talwalkar.

Share this article