How researchers are fighting lung cancer using PageRank

Google’s PageRank algorithm has forever changed the way we access information by putting the best stuff first, and now researchers are using the same mathematical models that Google uses to fight the spread of lung cancer within the human body. While there’s no “best” when it comes cancer cells, the aim is to identify tumors more likely to metastasize and then hit them with targeted treatment before the cells have a chance to spread.

The researchers — who come from the University of Southern California, Scripps Clinic, the Scripps Research Institute, the University of California, San Diego Moores Cancer Center and Memorial Sloan-Kettering — combined autopsy data from 163 cancer cases (all from before the advent of radiation therapy in order to analyze the natural spread) with applied mathematics in order to carry out their study. What they found, according to a press release about the research is that

metastatic lung cancer does not progress in a single direction from primary tumor site to distant locations, which has been the traditional medical view. Instead … cancer cell movement around the body likely occurs in more than one direction at a time.

How cancer cells spread. Source: PLOS One
How cancer cells spread. Source: PLOS One

Moreover, they found certain organs tend to spread cancer cells more aggressively, while others tend to act as sponges for cancer cells. These sponge organs might still grow tumors, they just don’t disperse the cells.

The PageRank analogy

The mathematics involved here — called Markov chain models — are similar to what Google uses to determine what web pages are the highest-quality for any given search query. Only whereas Google uses the number and quality of links to determine the probability of a web surfer landing on any given page, these researchers are trying to predict the PageRank of tumors, if you will. So, generally speaking, a kidney would likely have a higher PageRank than a liver because the kidney is more likely to spread cancer cells throughout the body (or, in web-search terms, generate a lot of links to itself).

The network path of cancer cells from lung to liver. Source: PLOS One
The network path of cancer cells from lung to liver. Source: PLOS One

As data volumes proliferate and relationships between data points become more complex, Markov models are actually becoming pretty popular. Netflix uses them in order to predict the movies users will want to watch next.

The weighted connections between various states or web pages or whatever someone is ranking are often expressed as the nodes and edges of a graph. Graphs, of course, have become part of the everyday web lexicon thanks to the various social graphs and interest graphs that analyze who we’re connected to (and how) and the types of topics we browse online.

The web as a data science proving ground

So in the end, perhaps, the most-important contribution of the worldwide web won’t be the revolution in terms of how we access information, but the web’s function as a proving ground for advanced statistical methods starring very large and complex data sets like those found in the medical world. Already, for example, another group of medical researchers has used a Markov variant in order to create a model they think can prescribe better treatment plans because it analyzes the costs and patient outcomes usually associated with a given treatment for a given symptom.

Tracking a cholera outbreak across a river network. Source: Physical Review Letters
Tracking a cholera outbreak across a river network. Source: Physical Review Letters

Last year, a group of Swiss researchers developed an algorithm that, having access to a relatively small amount of data, can track anything from Twitter rumors to disease outbreaks back to their source. A company called Syapse uses the graph structure to chart the relationships among words across different medical specialties.

One would also be remiss in ignoring the computing and data-storage innovation spurred by the web that has improved our ability to handle massive amounts of genetic and other data. As the lung cancer researchers explain in their paper:

One of the strengths of such a statistical approach is that we need not offer specific biomechanical, genetic, or biochemical reasons for the spread from one site to another, those reasons presumably will become available through more research on the interactions between CTCs and their microenvironment. We [have created] a quantitative and computational framework for the seed-and-soil hypothesis as an ensemble based first step, [that] then can be further refined primarily by using larger, better, and more targeted databases such as ones that focus on specific genotypes or phenotypes, or by more refined modeling of the correlations between the trapping of a CTC at a specific site, and the probability of secondary tumor growth at that location.

The long story short is that the more data we have and the easier we can analyze and map it, the better we can treat — and perhaps even cure — cancer and other complicated diseases.

Feature image is a network map of how lung cancer spreads between organs, where each numbered node correlates with a specific organ.