Biomarker discovery & validation – How to effectively manage data to maximize odds of projects succeeding

The ability to find and validate novel biomarkers rests on a combination of data available from literature sources, highly qualitative high-throughput data and unbiased analysis. The approaches used by research groups can be broadly classified as Data driven, Knowledge driven and Integrated. Out of the three, the integrated approach is likely to yield the best results and maximize the odds of biomarker projects succeeding. However, for this approach to be successfully implemented much work still needs to be done in developing the tools and infrastructure required.

Data driven approach:

High throughput data is analyzed on its own to determine the critical marker set. The data is subjected to various statistical methods, mainly grouped as data reduction, clustering and visualization.

  • Data reduction makes complex biological datasets easier to understand and help to eliminate the noise in measurements.
  • Clustering helps in classifying molecules based on expression pattern, phenotype, disease sub type, tumor size, serological markers or metastasis. This process helps in identifying similar markers, predicting function performed by unknown markers and determining differential gene set. Regression analysis, Support Vector Machine, Decision tree and Random forest are some of the common methods used for performing clustering analysis.
  • Visualization allows a user to look at an image and extract conclusions that would not be evident otherwise. An additional benefit of visualization is that, the process of generating summary and drawing images often becomes an analysis in itself, thus yielding novel observations. For visualization per se Principal component analysis and Network analysis are widely used.

Knowledge driven approach:

Data generated become meaningful only when it gets refined to generate hypothesis, robust reports and analysis and reliable results. System biology has emerged as a good approach to facilitate understanding of biological processes at molecular level and disease biology. The information extracted helps in reducing the dimensionality of large datasets. Interaction, networks and pathways helps in identifying patterns across biological processes and support exploratory analysis.

  • Protein-Protein interaction network: Physical binding interactions between proteins play a key role in cellular processes. To understand the mechanisms underlying disease progression at a molecular level, it is critical to identify, characterize and interpret protein interactions. Most interaction data are built around experimental procedure and mechanism of protein interactions.
  • Pathway analysis: The concept of pathway arises from the fact that genes does not function in isolation. A gene can only be studied, if it’s upstream and downstream molecules are known, its regulators and activators are known. And also the pathway visualization also helps in comprehending the gene behaviour in a disease state.
  • Disease specific repository: Disease is the outcome of variation arising not only at gene level, but also from changes happening at protein level, protein-protein interaction, cellular interactions, molecular synthesis, degradation and transport mechanism. Disease specific platform gives to the user at a single point overview on various factors contributing towards disease. System biology approach employed goes one step deeper wherein researcher would also be able to visualize the molecular and cellular changes happening following changes in molecular concentration or time elapsed.

Integrated approach:

This approach brings together very noisy but large dataset with well interconnected knowledge base created using various ontologies and literature sources. This approach can overcome the limitations of data driven approach and knowledge based approach when used in isolation. This approach integrates data exploration (data preprocessing, reduction, clustering) with confirmatory analysis (Pathways, networks, diseasome). Using this approach a researcher could study and organize the data based on data prevalence, repeated occurrence, weighted algos, literature proofs and many more. This approach is now widely used in Biomarker discovery using Gene expression data analysis, wherein expression pattern is overlaid on canonical and disease specific pathways and hypothesis and models are derived and verified using both exploratory and confirmatory analysis approach.

To accelerate biomarker discovery and validation, it is necessary to integrate high throughput omics data, literature proofs with clinical patient data. Though their exist challenges like ontology mapping, data complexity, data handling and transformation, to develop new precise and accurate biomarker candidates, it is essential to bring together clinical observation, curated phenotypic data, adverse reactions, biomarker expression data, pathway data, genotype profile and unstructured text data from journals, conference abstracts. TransSMART, an open source data management system to store, share and analyze patient data is a good platform and lot of research groups are working around this system to develop an end-to-end biomarker discovery platform.

The must have features of an integrated platform

  • Ability to both integrate and mine complex data sets from different sources.
  • Algorithms to perform biomarker validation and selection process in a robust and unbiased manner.
  • Option to perform disease/biomarker comparison studies across different clinical trial parameters (patient demographics, disease subtypes, therapy used etc)
  • Option to perform functional analysis to determine biomarker against canonical and manually curated disease specific pathways
  • Includes sample collection, processing and patient demographics data while validating a biomarker candidate
  • Option to integrate additional data from public repositories like Entrez Gene, Swissprot, dbSNP, COSMIC, SNOMED, MeSH etc
  • Intuitive interface with pre-packaged analysis and workflows
  • Scalable platform capable of easily integrating with third party databases/tools
data flow from heterogeneous sources to data warehouse to web portal

data flow from heterogeneous sources to data warehouse to web portal

As newer techniques are evolving, increasing amounts of big data is being generated. Research faculties and Bioinformaticians will have to approach the data management using a combination approach employing statistical techniques, text analytics, cause-effect knowledge models and dynamic data warehouse platforms. It is time to tell which of this combination is going to deliver the most successful data management solution to handle biomarker data.

What are some problems you face with managing data from biomarker related studies? Are there solutions you’re coming across that are helping improve your workflow and increasing odds of success?

Please share comments below.