2020 Research Session Materials

Researchers will introduce the thought-provoking content provided here, followed by discussion groups focused on these data science projects and ideas

Session Format: Each presenter will take a few minutes to point out the highlights of their work, then we'll have a discussion about these data science methods starting with an exchange between presenters, then based on questions entered by the audience at menti.com (code 35 92 12 1)

Chris Miller

Speaker Affiliation: CU Denver

Title: Learning Microbial Community Metabolisms from Terrabasepairs of DNA Sequencing Data.

Abstract: Our laboratory develops and applies computational and high-throughput DNA sequencing methods in order to study microbial communities. Since the original human genome draft was completed in 2001, the cost of DNA sequencing has decreased by about 400,000-fold. The application of this new economy to microbial ecology has made the assembly of microbial genomes and partial genomes recovered directly from the environment routine. Inference of key microbial traits and ecosystem processes that drive biogeochemical cycles relies on the computational inference of biochemical protein function encoded in recovered genomes. Although we regularly use state-of-the-field similarity-search-based approaches to annotate protein function, these methods in many cases fail to annotate half or more of the predicted proteins in any given genome. We are starting a new project in order to learn protein-protein association networks across thousands of high-throughput DNA sequencing datasets, with the goal of developing new algorithms that infer protein function for unannotated proteins by their learned associations with annotated proteins. The successful implementation of these algorithms at scale will allow for new insights into the functioning of microbial communities in a wide range of environments

Material: Chris Miller's Data Science Symposium Research Session

Masoud Asadi-Zeydabadi, Weldon Lodwick, and Massimo Buscema

Speaker Affiliation:

Title: Analysis of COVID-19 Modeling, Using Topological Weighted Centroid (TWC)

Abstract: In Topological Weighted Centroid (TWC), the spatial distribution of the confirmed cases of COVID-19 for a particular date is used, to estimate the places where the dynamics of disease outbreak start to accelerate. From those geographical locations, the dynamics are potentially most effective in causing the spread that has produced the collected data. There are different algorithms, TWC alpha, beta, gamma, theta, and iota, to estimate the most effective geographical locations that potentially could cause spreading disease in the past, present, and future. Identifying those locations could be helpful in controlling the disease outbreak. TWC algorithms are based on the concept of statistical thermodynamics and they have been developed in Semeion Research Center, Rome, Italy.

Material: Masoud Asadi-Zeydabadi, Weldon Lodwick, and Massimo Buscema's Data Science Symposium Research Session

Tiffany Callahan

Speaker Affiliation: CU Anschutz

Title: Systematic Integration of Biomolecular Mechanistic Knowledge and Medical Record Data for Deep Machine Learning

Abstract: My presentation will touch on the integration of biomolecular mechanistic and genomic knowledge to expand the inferencing capabilities of existing medical records data. Despite large-scale biobanking efforts, hospitals do no systematically integrate patient-level genomic data nor do they have the infrastructure to enable meaningful linkages between these data and current sources of biomedical knowledge. By linking knowledge from generalized molecular experiments to clinical observations, it is possible to infer unobserved molecular mechanisms for each patient. To create biologically accurate representations of molecular mechanisms that are also machine-readable, I have developed software that constructs large scale biomedical knowledge graphs using a wide variety of publicly available data. To connect mechanistic paths from the knowledge graph to specific patterns of clinical events observed within each patient's medical record, I developed and validated the first hospital-scale mappings between codes in standard clinical terminologies to biological concepts in Open Biomedical Ontologies. By enabling a meaningful integration between clinical codes and biomedical knowledge, we are able to perform more precise patient subphenotyping and generate mechanistic explanations of patient treatment trajectories, which cannot be replicated with just observable medical record data alone. My talk will provide an overview of this work and will include links to companion open-source software.

Material: Tiffany Callahan's Data Science Symposium Research Session

Amy Roberts

Speaker Affiliation: CU Denver

Title: Jupyter Analysis Environment for Dark Matter Research

Abstract: Physics research requires significant software infrastructure. At best, poor infrastructure slows down and limits the science we can do. At worst, poor infrastructure excludes students who don't have access to experts. This talk focuses on some of the resources we've used to improve our collaboration's software infrastructure. Many of these resources are available to anyone doing research at a US institution, and I hope you'll find them useful!

Material: Amy Roberts' Data Science Symposium Research Session

Video:

Alper Yayla

Speaker Affiliation: The University of Tampa

Title: Explainable artificial intelligence in defending critical infrastructure: Artificial neural network for smart grid intrusion detection

Abstract: As a result of the increasing complexity of systems and sophistication of attacks, cybersecurity has become a central issue in recent years. One of the current approaches to overcome these challenges is the use of artificial intelligence (AI) based controls. These defensive AI methods implement various machine learning algorithms for various controls, such as intrusion detection and malware detection. The AI-based controls require less human intervention and are more effective than traditional signature-based and heuristics-based controls. However, the growing adoption of advanced deep learning algorithms is turning these AI-based controls into black box systems. Conversely, there is a growing initiative for the use of more explainable algorithms in various fields as a result of the increasing scrutiny around black box automated decision-making applications. Parallel to these trends, we postulate that the use of black box algorithms in cybersecurity controls would make proper risk management and informed decision-making challenging. In this paper, we make a call to action for the explainability of the AI-based controls in research and practice. Using the smart grid cybersecurity as our context, we illustrate our arguments by modeling an artificial neural network (ANN) using simulated attack data and outlining a risk assessment plan to discuss the transparency and interpretability of the proposed model. We posit that an integrated risk assessment plan would provide a platform for more explainable algorithms, which in turn, would lead to more accountability, better compliance, increased awareness, emphasis on ethical considerations, and balanced privacy and security focus.

Diane Fritz

Speaker Affiliation: Auraria Library, Geospatial Services Specialist

Title: Big Data in Geochemistry: Taking Advantage of a Disciplinary Repository

Abstract: In data science, we often think about analyzing immense datasets to discover new information from patterns, correlations, and other relationships that emerge. But “big data” can mean different things depending on the discipline. In geochemistry, where obtaining and processing of samples can be exceedingly time-intensive, “big data” is orders of magnitude lower than what it typically is in other fields. In this study, we investigated published Ta and Th abundances data from ~2,000 whole‐rock samples of mafic to intermediate composition, Cenozoic volcanic rocks in southwestern North America to look for patterns elucidating the evolution of the deep continental lithosphere. This particular data now resides in a disciplinary repository called Earthchem.

Material: Diane Fritz's Data Science Symposium Research Session

Ryan Lusk

Speaker Affiliation: CU Anschutz

Title: Alternative Polyadenylation Transcriptome Analysis from RNA Sequencing and DNA Sequencing Information

Abstract: Aptardi accurately incorporates expressed polyadenylation sites into sample-specific transcriptomes using a multi-omics deep learning approach.

Material: Ryan Lusk's Data Science Symposium Research Session

Julia Wrobel

Speaker Affiliation: CU Anschutz

Title: Understanding Circadian Rhythms Using Accelerometer Data

Abstract: My statistical research area is functional data analysis, which models curves, functions, and trajectories. Often I apply these methodologies to wearable device data and neuroimaging studies.

Material: Julia Wrobel's Data Science Symposium Research Session

College of Liberal Arts and Sciences

CU Denver

North Classroom

1200 Larimer Street

5014

Denver, CO 80204

303-556-2557

[email protected]

Join for announcements

Tools

Resources

Schools & Colleges

Campus Affiliates

Data Science

2020 Research Session Materials

Researchers will introduce the thought-provoking content provided here, followed by discussion groups focused on these data science projects and ideas

Chris Miller

Masoud Asadi-Zeydabadi, Weldon Lodwick, and Massimo Buscema

Tiffany Callahan

Amy Roberts

Alper Yayla

Diane Fritz

Ryan Lusk

Julia Wrobel

College of Liberal Arts and Sciences