Blogpost on: Clinical and Single-Cell Transcriptomics for Pneumonia Codeathon
Overview
In the vibrant community of data science, codeathons serve to ignite the right level of “friendly competition” to tackle pressing challenges, develop open solutions, and forge new thoughts. Hosted in September 2023 at Northwestern University in Chicago, USA, the “Clinical and Single-Cell Transcriptomics for Pneumonia Codeathon” enabled its 23 participants and 5 teams to leverage a massive, yet unpublished clinical and transcriptomic dataset. The scientific goal was to compare and develop computational approaches to better understand pneumonia, one of the leading causes of death in the USA. Datasets, code, and manuscripts are updated on github.
The idea for a codeathon originated in a working group of Systems Biology Centers of the National Institute of Allergy and Infectious Diseases (NIAID). There, Michael Yeaman, the PI of a Systems Biology Centers for Infectious Diseases, ‘System Epigenomics of Persistent Bloodstream Infection (UCLA), suggested a shared activity around coding to encourage exchange between experts of different institutions. Thomas Stoeger, a member of Northwestern’s Successful Clinical Response In Pneumonia Therapy (SCRIPT) Systems Biology Center (PI:Richard G. Wunderink) led the implementation, with further support of the NIAID’s Bacterial and Viral Bioinformatics Resource Center, and the Chan Zuckerberg Initiative.
The codeathon featured 23 participants from 12 institutions and 9 countries. Additionally, 11 people from SCRIPT were present to support efforts and help on with data and science. A frequent question, raised by multiple teams, was how the human host response differs between bacterial and viral pneumonia, or which molecular and clinical measurements inform on the future recovery and successful treatment of patients. Using distinct analytical strategies, the five teams all arrived at complementary findings.
Team Projects
Team 1, led by Ewa Szcurek, came up with a large language models-based approach to the problem of identifying pathogen-specific immune system responses to ICU pneumonia from single cell and clinical data. To this end, the samples were divided into four groups based on the infection status of the patients: bacterial, virus, both, and no infection. The team first constructed transformer-based classifiers, discerning each of these groups from the rest based on single cell RNA-sequencing and extracted genes that the model paid most attention to when making the decisions. Next, they identified significantly differently attended pathways between the viral and bacterial groups, including interferon gamma and alpha signaling as well as TNF-alpha signaling via NF-kappaB. Furthermore, the team developed a novel large language model-based model, called EHRformer, and pretrained it on clinical data for all samples. This model was then fine-tuned to again classify the same infection-based groups. This analysis revealed that specific clinical features such as procalcitonin were differentially attended when predicting virus and bacteria.
Team 2, led by Slim Fourati, determined whether previously published transcriptomic signatures to distinguish bacterial and viral infections successfully separate bacterial and viral infections in the new unpublished datasets. The team detected these transcriptomics signatures primarily in myeloid cells (viral: macrophages/T-cells and bacterial: monocytes). Genes in those signatures can also predict clinical outcomes, such as the need for ventilation for more than 14 days and mortality after hospitalization.
Team 3, led by Jackson Chin and Michael Yeaman, applied PARAFAC2—a tensor-based decomposition method—to reduce single-cell transcriptomic measurements to a set of principal components that capture immunological patterns across patients, cells, and genes. They discovered that PARAFAC2 components are strongly associated with infection outcome, predicting pneumonia mortality with an accuracy over 70%. Interpretation of the PARAFAC2 components yielded novel insights into the determinants of pneumonia mortality, identifying over-proliferative inflammatory responses, altered immune competence, and bacterial etiologies as critical drivers of mortality (Figures 1B-1E). Collectively, these findings surprised the participants of the codeathon by demonstrating the potency of tensor-based methods in analyzing single-cell transcriptomic measurements and highlight critical immunological determinants involved in pneumonia mortality.
Team 4, led by Yixiang Deng, leveraged graph neural networks on single-cell RNA sequencing data. They uncovered novel cellular interactions and gene expression patterns critical in pneumonia pathology. This finding has potential implications for identifying unique biomarkers and understanding the cellular heterogeneity in patient responses within critical care. Integrating electronic health records and cytokine profiles, holds potential to improve the existing framework to enhance personalized medicine.
Team 5, led by Meghan Hutch and Jenny Ding, predicted prolonged respiratory failure (PRF, patients requiring intubation ≥14 days for this task) in patients requiring mechanical ventilation. Their team applied supervised and unsupervised machine learning models on an ICU cohort with suspected pneumonia for PRF prediction. Group-based multivariate trajectory modeling (GBMT) on five ventilator parameters collected during the first five days of intubation identified four groups that represent unique phenotypes and are predictive of PRF. Multivariate Time Series Transformer (MVTS) showed superior discrimination of PRF using time-series clinical data in comparison with XGBoost on baseline data. Their work underscored the critical need to develop risk models that can optimize clinical treatment strategies by alerting physicians sooner to patients most at risk for PRF and aid in facilitation of ICU resources.
Preparations and Datasets
Finding good data is a challenge for data science and essential for a codeathon. The Clinical and Single-Cell Transcriptomics for Pneumonia Codeathon used clinical and single-cell transcriptomic from Northwestern’s SCRIPT cohort of severely ill intubated patients with suspected pneumonia. Cell populations used for transcriptomics were obtained through bronchoalveolar lavages, allowing the direct sampling of the lung of patients during their stay in the intensive care unit.
For 691 patients the dataset described a total of 99 clinical and demographic properties, with 48 clinical measurements having been measured daily, on a total of 15,306 days. For single-cell transcriptomics, data originated from a total of 171 patients, with some patients having provided samples over multiple days, yielding a total of 266 different single-cell transcriptome samples. The dataset will be made public summer 2024 (see here for availability). In essence, this dataset allows to correlate and combine clinical and molecular data and identify signatures that inform on subsequent change, including recovery.
Getting this dataset ready for the codeathon required the input of many people. First, we wanted to ensure that data respects the patient’s privacy, rights, and consent to SCRIPT. Luke Rasmussen and Marjorie Kang de-identified the data, and Luke Rasmussen and Justin Starren developed a legal agreement for participants of the codeathon, including requirements toward cybersecurity. For elevated security this agreement required participants to hold the data to the standards required for data that contained identifiable information. Second, the size of the single-cell transcriptomic data was large compared to many other efforts, requiring additional quality control to ensure that participants could jump into the data without further cleaning. Sample and data processing was done by Alexander Misharin and his team, particularly Nickolay Markov, Samuel Fenske, Stanislav Bratchikov, and Karolina Senkov.
Additional preparations of the dataset included the creation of wiki documenting the information contained in the datasets, which involved additional help from Cathy Gao. Further, Scott Coughlin from Northwestern IT allocated for each team a computer within Northwestern’s High-Performance cluster. Datasets, documentation, and computing environment were tested by James “Jim” Davis and Marcus Nguyen from the Bacterial and Viral Bioinformatics Resource Center, another NIAID supported initiative. This resulted in adjustments of the computing environment, and creation of subsampled data to promote quick exploration of the data and development of code.
Outcomes
Following the codeathon, teams presented their findings in a symposium, which was also attended by further members of the systems biology centers of the National Institute of Allergy and Infectious Diseases (NIAID). Since then three teams have continued to work towards manuscripts from their work. Further, the organizers will write a manuscript to accompany the publication of the full dataset, that introduces the data and framework behind the codeathon as a learning resource for everyone.
Acknowledgment
This codeathon was supported by an NIAID grant for Systems Biology for Infectious Diseases (U19AI135964) and an NIAID contract for Bioinformatics Center for Infectious Diseases (75N93019C00076) and the Chan Zuckerberg Initiative.