Data mining for new medicines: 3Qs with Yizhou Sun

Sep 29, 2017

By UCLA Samueli Newsroom

Yizhou Sun, an assistant professor of computer science at UCLA, looks at large-scale information and social networks. Beyond just the data that are in those networks, Sun and her students are looking design scalable algorithms that take a much deeper look at those millions and millions of connections and nodes in them. Specifically they want to find ways to extract insights and predictions that, on the surface, would not be so obvious. Sun joined UCLA in 2016, after starting her career in academia at Northeastern University in Boston. Her honors include a CAREER award from the National Science Foundation, one of the top honors for researcher in the early part of their careers. In July, Sun received a second NSF grant. She answered questions on her research, as well as how she gets students interested in the field.

For students who may not have an initial interest in large-scale information networks and data mining, what do you tell them about what you’re looking at, and why they might be interested in being involved in research in this emerging area?

We are living in an interconnected world. Most of data or informational objects, individual agents, groups, or components are interconnected or interact with each other – forming numerous, large, interconnected, and sophisticated networks. Without loss of generality, such interconnected networks are called information networks. For example, in social networks, people connect to each other via friendships. Links in such networks indicate the interactions between objects in the network, and usually imply similarity or influence between the objects that are sometimes hardly be able to be expressed by traditional features. Clearly, information networks are ubiquitous and form a critical component of modern information infrastructure. The analysis of information networks, or their special kinds, such as social networks and the Web, has gained extremely wide attentions nowadays from researchers in computer science, social science, physics, economics, biology, and so on, with exciting discoveries and successful applications across all the disciplines.

The research supported by the CAREER Award is on social networks. What kind of information have you found in exploring these giant datasets, and what can this information be used for?

In this line of research, we try to understand the motivation behind people’s behaviors that can be modeled as interactions between people and other entities, rather than treat them as nodes and links. We have explored many social network and social media datasets, such as Facebook, Twitter, online forums, and political voting networks. We have proposed algorithms that can successfully identify people’s political ideology, their stance on some particular political issues, their personal traits, and their roles in some online events. These techniques can be used for better policy designing, political campaign, online extremist detection, as well as identifying users that need special help.

The research supported by the most recent NSF grant you received will look to improve how users interact with a massive database of biomedical literature. For the doctors and scientists who use this massive database, what do they “see” now when the search? And then, following this work, what do you hope they’ll see, how is it organized behind the screen? And then finally, how can they then utilize that information?

PubMed is an online biomedical literature database that collects millions of publications every year. Unfortunately, no single human being can read all of these publications, even the ones most related to their research or practice area. The current system provides some primitive search functions that are mostly based on keyword matching, which can hardly meet the needs of the doctors and scientists, as the returning results are articles exactly containing the input keywords. We propose a new data-to-network-to-knowledge (D2N2K) paradigm to transform massive, unstructured but interconnected research text data into actionable knowledge, by integrating semi-structured and unstructured data. By doing so, the users can issue much more advanced queries, such as finding top recommended treatments for a disease and top related gene mutations for a disease, without going through the details of all the papers. The system can be very helpful for the doctors and scientists, which can provide better diagnosis for patients, recommend better treatments to the patients, and even form new hypotheses.

Share this article