Data Science Research

On this page...

Multi-type Data Integration and Fusion

Data Integration ExampleBrandt-Pearce, Brown, Guerlain, and Horowitz have extensive research programs funded by the Department of Defense in signal processing, data fusion, visualization and human factors, and cyber security with results that we intend to use to directly support this IGERT. The military is both a producer and consumer of vast quantities of data. The data come from a variety of sensors to include radars, hyperspectral imagery, and receivers across the electromagnetic spectrum. At abstract level we can view these data as the outputs from point or vector processes in space and time.

The fusion processes developed in our Predictive Technology Laboratory (Brown) take these data from multiple sources and combine them using hierarchical models. These hierarchical models have components that represent the different sources of data and enable the estimation of the dependencies between the components. For example, we can combine multiple layers of remotely sensed data about an area such as slope, vegetation, surface materials, roads, hydrology, and man-made obstacles and use the resulting integrated model to predict land use.

To make predictions for evolving or changing processes we use dependency structures. The methods we have developed can exploit massive amounts of contextual data, such as that shown, but they can also use other aspects of the dynamic environment, such as, movements by objects and the changing characteristics of objects in the area.

The overall hierarchical modeling framework we have constructed has broad applicability. However, it is only now being implemented in high performance environments for use on large, multi-type data integration problems. The key ongoing research in support of this IGERT is the translation of this framework to XSEDE and other CI elements and its use and extensions to more extensive data types found in broader range of interdisciplinary problems.

Brandt-Pearce’s research in sensor networks data processing has focused on applications in health and defense. Body sensor networks (BSN) collect biometric data from individuals, and can result in enormous volumes of data if used over extended periods of time. In Brandt-Pearce’s research six degree-of-freedom gait data was collected using a BSN (developed at UVa) and processed to assist in the diagnosis of normal pressure hydrocephalus (NPH). The data was augmented by physician’s input and clinical observation, making the resulting problem typical of multi-type signal processing. Meaningful features were extracted from the data and a support vector machine was subsequently used to classify the data. Her signal processing results in these healthcare and defense applications have direct applicability to application and the creation of new methods for use on the research projects by students in this IGERT.

Systems Biology

With the development of high-throughput experimental techniques, biology has experienced an explosion of data characterizing the entire component list of cells and tissues. Hundreds of genome sequences, associated gene and protein expression arrays, genome-wide transcription factor binding data, and genome-scale protein-protein interaction maps are available and have necessitated the development of quantitative frameworks that integrate this data into predictive, computational models. With such models, we can begin to predict how cells and tissues respond to a variety of genetic and environmental perturbations. At the University of Virginia, researchers are applying computational tools to integrate large, heterogeneous data to build models, make predictions, and experimentally validate these predictions to address a variety of questions. The members of our IGERT team, Peirce- Cottler, Janes, Papin, and Wu, are working to address questions in microbial physiology, evolution, cellular signaling, and tissue patterning, with potential implications for a host of human pathologies including cancer, cardiovascular disease, and infectious disease and for a host of bioengineering applications including bioenergy and bioremediation.

Biological NetworksPeirce-Cottler’s research for this IGERT is interested in answering research questions pertaining to the growth and regeneration of blood vessels, processes that are highly complex and span a wide range of spatial and temporal scales; processes that can only be described by multi-type data sets. We are specifically interested in understanding how blood vessels (and the biological cells that comprise them) respond to environmental changes (biochemical and mechanical), which are elicited by diseases such as diabetes, cancer, and heart disease. In turn, we investigate how these responses impact the vessels’ functional abilities to deliver blood and nutrients to tissues. Fundamental research questions include: how do changes in blood flow to a tissue impact the regenerative capacity of vascular cells within that tissue; what cell types invigorate a vascular growth response and how do biochemical signals orchestrate this process; what is the role of stem cells; why do vascular beds in different tissues behave differently from one another; and what gives rise to person-specific variability in this context? The types of data that are needed to answer these questions are as varied as the biological components in this complex system, and include gene, protein, cell, and tissue-level data. Our use of agent-based modeling to assemble disparate data sets into a cohesive computational framework enables us to uncover the systems-level cause-and-effect relationships that are unattainable using standard experimental methods.

Janes’ research for this IGERT addresses key questions in the regulation of intracellular processes. A major component of cellular regulation involves diffusible ligands, which bind to transmembrane receptors and thereby activate multiple signaling pathways inside the target cell. These signaling pathways crosstalk with one another and are further controlled by multiple layers of feedback. Many important classes of signaling proteins are known. However, measurements of their function are usually time-consuming and limited in the number of samples that can be handled simultaneously. The Janes group approaches the challenge of signaling-network measurements from an engineering perspective. The design goal is to develop bioassays that are sensitive, quantitative, and as high-throughput and multiplex as possible. Application of these new assays typically generates many thousands of experimental data points that must be mined for meaning. Thus, the Janes’s research group actively uses, refines, and develops “data-driven” modeling approaches that can be immediately applied to multi-type biological data.

Papin’s research for this IGERT is focused on the study of metabolism and transcriptional regulation in microbes and human cancer cells. The activity and control of metabolic processes are fundamental to 1) the growth of a microbial population in its environmental niche as well as in pathological settings, and 2) the uncontrolled proliferation of a cancer cell, leading to tumor development. Metabolic and transcriptional networks are highly interconnected with thousands of genes associated with thousands of reactions interconverting thousands of metabolites. The experimental technologies used to interrogate these networks can characterize multiple facets of these processes, yet each only provides a snapshot of the diverse components and processes necessary for their complete functionality. Papin’s work develops computational methods to integrate these multi-type data to generate predictive models of cellular function and high-throughput methodologies to validate these predictions.

The IGERT research interests of Wu focus on studying how microbes living on human bodies (also known as human microbiota) affect human health and diseases. Human microbiota encompasses thousands of different bacterial species and they form dynamic and complex interaction networks that aid in the metabolism of nutrient, outcompete the pathogenic bacteria and modulate the development of host immune system. To this end, terabytes of 16S rRNA and metagenomic sequences have been generated by the scientific community to survey microbial communities from dozens of body sites from hundreds of individuals. Wu’s work develops high-throughput methods to synthesize the meaningful ecological units of bacterial species from this massive data collection and uses novel statistic approaches such as Maximum Information Coefficient to mine millions of sequence reads to detect significant interaction between bacterial species. The relative abundance of each bacterial species is also routinely integrated with other types of data (human health and disease status, cytokine level, human SNPs etc) to identify possible associations between bacterial species and other host traits.


Our view of the chemical composition of the Universe will be transformed by next-generation radio astronomy observatories like the NSF-supported Atacama Large Millimeter/Sub-millimeter Array (ALMA) and Jansky Very Large Array (JVLA) that are operated by the National Radio Astronomy Observatory (NRAO) with headquarters on the grounds of the University of Virginia. These instruments collect and frequency resolve the radiation reaching Earth in the microwave and millimeter-to-THz frequency ranges where atmospheric opacity, principally due to water vapor absorption, is not a limiting factor.

The frequency range of radio astronomy is important for interstellar chemistry for two reasons. First, it is currently believed that the synthesis of molecules under the extreme conditions of the interstellar medium occurs on nanoscale particulates, called grains or dust. This dust shields the active chemical regions from damaging short wavelength radiation by scattering and absorbing it. Therefore, the long wavelength light observed by radio astronomy is needed to penetrate the dust in these star and planet forming regions. Second, this frequency region carries the rotational spectra of the gas-phase molecules formed in these chemically active regions providing a fingerprint for their detection.

Overview of radio astronomy data acquisition, data integration, and analysis for molecular properties.The image shows an overview of the acquisition, integration, and processing of data from next-generation radio astronomy observatories like ALMA and JVLA. The data from these observatories present new opportunities to study the structure and evolution of astronomical objects based on their chemical composition. This new interdisciplinary field of research requires integration of several types of data from chemistry and astronomy. Laboratory spectroscopy can be used to identify interstellar molecules as shown in the figure by a comparison of the recent science verification observation of Orion by ALMA (black line) with a laboratory measurement of the spectrum of ethyl cyanide (blue line). Once the molecules are identified by their spectral fingerprint, they can be imaged based on their column density. A goal of the IGERT project is to understand how to use the correlations of different molecular images to understand the chemical reaction processes at work. This analysis requires coupling the astronomical images with large scale chemical kinetics models. The IGERT team will also explore ways to use these chemical images to better characterize the astronomical object. This work may include combining the radio astronomy data with observations at other wavelengths.

An important advance of these new observatories is the ability to acquire broadband spectral data, containing the spectral signatures of the molecules in the astronomical object, with high spatial resolution. This new capability will change molecular astronomy in a fundamental way. In the past, the observatories generally only had enough bandwidth to monitor a single rotational transition of a single molecular species. Applications were thus limited to “molecular imaging” where the column density of single, high-abundance species, such as NH3, is used to “trace” the distribution of matter in the interstellar medium. The introduction of broadband spectral coverage means that every observation will now be able to simultaneously detect multiple molecular species revealing the rich chemical composition of the observed object and permitting the construction of a much richer “chemical image” that conveys information about the molecular composition and distribution. These observations bring unprecedented big data and multi-type data integration challenges to radio astronomy and astrochemistry. Single observation data sets can be on the order of 300 GB and the data archives of radio astronomy are expected to group by rates of about 1 TB/day beginning this year.

Hawley, Herbst, Oberg, Johnson, and Pate are currently exploring two grand challenge problems of importance to this IGERT and its students: 1) How does chemistry emerge in the universe and 2) Is there a link between chemical evolution and star and planet formation?

Pate, Herbst, and Oberg are developing the field of “mechanistic interstellar chemistry” to fill a current gap in our understanding of the emergence of chemistry in the Universe. One of the greatest achievements of 19th and 20th Century chemistry was the development of mechanistic organic chemistry. The systematic, quantitative study of the structure and reactivity of organic molecules has provided chemists with the blueprints to construct new, complex molecules with important physical, medicinal, and materials properties. The stage of chemical development under study by the IGERT group is the initial synthesis of molecules, especially organic molecules, from the elements and from small, abundant interstellar molecules. This area of chemistry provides a link between chemical physics and organic chemistry in the interstellar medium.

The unusual and inhospitable conditions of the interstellar medium require nature to be creative in the ways it constructs molecules. A combination of theory and experimental physical chemistry is needed to test the viability of a wide range of novel synthetic routes. In the gas-phase, these include ion-molecule reactions, radiative association, and low-temperature tunneling reactions. In addition, cosmic ray and extreme ultraviolet processing of the interstellar ices coating silicate nanoparticles are thought to generate reactive species that can undergo barrierless reactions to build larger molecules. Although these mechanisms may be exotic, they have created nature’s largest reservoirs of chemically-bonded matter in the space between the stars.

The search for a link between chemical evolution and star and formation (the second grand challenge problem under study by Hawley, Herbst, Oberg, and Johnson) forms the basis of an interdisciplinary field of science. In this new field an understanding of mechanistic interstellar chemistry becomes essential for astronomers to understand the structure and evolution of astronomical objects. If a common chemical evolution associated with the star and planet formation process can be identified, then the chemical composition – its molecular fingerprint - of the object becomes a new way to identify the stage of astronomical evolution. More broadly, the next-generation radio astronomy interferometers provide a fundamentally new way to observe and understand chemically rich astronomical objects. Instead of using the electromagnetic spectrum emitted by the object, the dominant way of studying objects in the Universe, the “chemical image” can be constructed by substituting molecular species for colors of light.

Although the unusual reaction conditions encountered in the interstellar medium often produce species that are exotic by terrestrial standards, observations over the past 40 years have produced another result, which has even deeper implications for some of the most fundamental questions in science. Chemistry in the dense molecular clouds associated with star and planet formation yields molecules in essentially all of the families of organic chemistry including simple alkanes, alkenes, alkynes, arenes, alcohols, ethers, aldehydes, ketones, esters, carboxylic acids, amines, and amides. Chemically reactive organic species like radicals and protonated species (for acid catalyzed reactions, for example) are also known. The closer the interstellar conditions approach the requirements for star and planet formation, the closer the chemistry approaches the chemistry of Earth.

Research on these and other grand challenge problems of this group requires the integration of multiple data types including laboratory and observatory spectroscopy data sets, quantum chemical calculations of molecular properties, and kinetics information to obtain a quantitative description of the novel chemical environment of star and planet forming regions of the interstellar medium. At the same time, the amount of data being produced has increased dramatically making the traditional analysis approaches of chemistry and astronomy obsolete. Finally, even with an algorithmic approach to speed data analysis, it will be necessary for the field to recruit a larger pool of scientists to extract the chemical information in the radio astronomy archives. New ways to distribute and analyze these large data sets are needed to broaden the participation in this area of science.

Regulation of Heartbeat in Health and Disease

Our major clinical research effort centers on sepsis, a life-threatening infection of the bloodstream and a major cause of morbidity and mortality in premature newborn infants. Currently, the diagnosis is often not suspected until late in the course of the illness when the infant is very ill indeed. We have developed a new strategy for early diagnosis based on the finding that signs of illness are preceded by abnormal heart rate characteristics (HRC) of reduced variability and transient decelerations. Using a validated predictive algorithm for continuous HRC monitoring, we have recently diagnosed and treated sepsis in infants who never became ill. We are conducting a randomized clinical trial to test the hypothesis that HRC monitoring improves the outcomes of infants in the neonatal intensive care unit. The techniques involve clinical neonatology and mathematical biostatistics.

Our major basic science research effort centers on the FXYD family of single transmembrane proteins that modulate membrane ion transport processes. Of particular interest is FXYD 1, or phospholemman (PLM), a major substrate for diverse protein kinases in heart that modulates Na,K-ATPase, Na-Ca exchanger, and to form osmolyte-selective channels. We utilize reagents and models ranging from highly purified wild-type and mutant protein to the knock-out mouse using most imaginable techniques of animal and cellular cardiac physiology, electrophysiology, structural biology, biochemistry, molecular biology, and cell imaging. Our goal is to understand the physiological role of PLM in heart, where it is the major substrate for phosphorylation by PKA (activated by beta adrenergic receptors) and PKC (activated by angiotensin-II receptors and a-adrenergic receptors). Since the major interventions to prolong life in congestive heart failure, which affects millions of Americans, are blockade of beta-adrenergic receptors and angiotensin-II receptors, the potential clinical importance of understanding PLM function is enormous.

See the page of Dr. J. Randall Moorman for more information.