Session Format: Each presenter will take a few minutes to point out the highlights of their work, then we'll have a discussion about these data science methods starting with an exchange between presenters, then based on questions entered by the audience at menti.com (code 35 92 12 1)
Speaker Affiliation: CU Denver
Title: Learning Microbial Community Metabolisms from Terrabasepairs of DNA Sequencing Data.
Abstract: Our laboratory develops and applies computational and high-throughput DNA sequencing methods in order to study microbial communities. Since the original human genome draft was completed in 2001, the cost of DNA sequencing has decreased by about 400,000-fold. The application of this new economy to microbial ecology has made the assembly of microbial genomes and partial genomes recovered directly from the environment routine. Inference of key microbial traits and ecosystem processes that drive biogeochemical cycles relies on the computational inference of biochemical protein function encoded in recovered genomes. Although we regularly use state-of-the-field similarity-search-based approaches to annotate protein function, these methods in many cases fail to annotate half or more of the predicted proteins in any given genome. We are starting a new project in order to learn protein-protein association networks across thousands of high-throughput DNA sequencing datasets, with the goal of developing new algorithms that infer protein function for unannotated proteins by their learned associations with annotated proteins. The successful implementation of these algorithms at scale will allow for new insights into the functioning of microbial communities in a wide range of environments
Material: Chris Miller's Data Science Symposium Research Session
Speaker Affiliation:
Title: Analysis of COVID-19 Modeling, Using Topological Weighted Centroid (TWC)
Abstract: In Topological Weighted Centroid (TWC), the spatial distribution of the confirmed cases of COVID-19 for a particular date is used, to estimate the places where the dynamics of disease outbreak start to accelerate. From those geographical locations, the dynamics are potentially most effective in causing the spread that has produced the collected data. There are different algorithms, TWC alpha, beta, gamma, theta, and iota, to estimate the most effective geographical locations that potentially could cause spreading disease in the past, present, and future. Identifying those locations could be helpful in controlling the disease outbreak. TWC algorithms are based on the concept of statistical thermodynamics and they have been developed in Semeion Research Center, Rome, Italy.
Speaker Affiliation: CU Anschutz
Title: Systematic Integration of Biomolecular Mechanistic Knowledge and Medical Record Data for Deep Machine Learning
Abstract: My presentation will touch on the integration of biomolecular mechanistic and genomic knowledge to expand the inferencing capabilities of existing medical records data. Despite large-scale biobanking efforts, hospitals do no systematically integrate patient-level genomic data nor do they have the infrastructure to enable meaningful linkages between these data and current sources of biomedical knowledge. By linking knowledge from generalized molecular experiments to clinical observations, it is possible to infer unobserved molecular mechanisms for each patient. To create biologically accurate representations of molecular mechanisms that are also machine-readable, I have developed software that constructs large scale biomedical knowledge graphs using a wide variety of publicly available data. To connect mechanistic paths from the knowledge graph to specific patterns of clinical events observed within each patient's medical record, I developed and validated the first hospital-scale mappings between codes in standard clinical terminologies to biological concepts in Open Biomedical Ontologies. By enabling a meaningful integration between clinical codes and biomedical knowledge, we are able to perform more precise patient subphenotyping and generate mechanistic explanations of patient treatment trajectories, which cannot be replicated with just observable medical record data alone. My talk will provide an overview of this work and will include links to companion open-source software.
Material: Tiffany Callahan's Data Science Symposium Research Session
Speaker Affiliation: CU Denver
Title: Jupyter Analysis Environment for Dark Matter Research
Abstract: Physics research requires significant software infrastructure. At best, poor infrastructure slows down and limits the science we can do. At worst, poor infrastructure excludes students who don't have access to experts. This talk focuses on some of the resources we've used to improve our collaboration's software infrastructure. Many of these resources are available to anyone doing research at a US institution, and I hope you'll find them useful!
Material: Amy Roberts' Data Science Symposium Research Session
Video:
Speaker Affiliation: The University of Tampa
Title: Explainable artificial intelligence in defending critical infrastructure: Artificial neural network for smart grid intrusion detection
Abstract: As a result of the increasing complexity of systems and sophistication of attacks, cybersecurity has become a central issue in recent years. One of the current approaches to overcome these challenges is the use of artificial intelligence (AI) based controls. These defensive AI methods implement various machine learning algorithms for various controls, such as intrusion detection and malware detection. The AI-based controls require less human intervention and are more effective than traditional signature-based and heuristics-based controls. However, the growing adoption of advanced deep learning algorithms is turning these AI-based controls into black box systems. Conversely, there is a growing initiative for the use of more explainable algorithms in various fields as a result of the increasing scrutiny around black box automated decision-making applications. Parallel to these trends, we postulate that the use of black box algorithms in cybersecurity controls would make proper risk management and informed decision-making challenging. In this paper, we make a call to action for the explainability of the AI-based controls in research and practice. Using the smart grid cybersecurity as our context, we illustrate our arguments by modeling an artificial neural network (ANN) using simulated attack data and outlining a risk assessment plan to discuss the transparency and interpretability of the proposed model. We posit that an integrated risk assessment plan would provide a platform for more explainable algorithms, which in turn, would lead to more accountability, better compliance, increased awareness, emphasis on ethical considerations, and balanced privacy and security focus.
Speaker Affiliation: Auraria Library, Geospatial Services Specialist
Title: Big Data in Geochemistry: Taking Advantage of a Disciplinary Repository
Abstract: In data science, we often think about analyzing immense datasets to discover new information from patterns, correlations, and other relationships that emerge. But “big data” can mean different things depending on the discipline. In geochemistry, where obtaining and processing of samples can be exceedingly time-intensive, “big data” is orders of magnitude lower than what it typically is in other fields. In this study, we investigated published Ta and Th abundances data from ~2,000 whole‐rock samples of mafic to intermediate composition, Cenozoic volcanic rocks in southwestern North America to look for patterns elucidating the evolution of the deep continental lithosphere. This particular data now resides in a disciplinary repository called Earthchem.
Material: Diane Fritz's Data Science Symposium Research Session
Speaker Affiliation: CU Anschutz
Title: Alternative Polyadenylation Transcriptome Analysis from RNA Sequencing and DNA Sequencing Information
Abstract: Aptardi accurately incorporates expressed polyadenylation sites into sample-specific transcriptomes using a multi-omics deep learning approach.
Material: Ryan Lusk's Data Science Symposium Research Session
Speaker Affiliation: CU Anschutz
Title: Understanding Circadian Rhythms Using Accelerometer Data
Abstract: My statistical research area is functional data analysis, which models curves, functions, and trajectories. Often I apply these methodologies to wearable device data and neuroimaging studies.
Material: Julia Wrobel's Data Science Symposium Research Session