Giacomo Marino

A list of projects I've led or contributed to. All are open source and available on GitHub:

Typescript, PL/pgSQL, PostGraphile, Rust, Python, Next.js, TailwindCSS, Docker

RummaGEO

Automatically generated signatures from GEO

The Gene Expression Omnibus (GEO) is a major open biomedical research repository for transcriptomics and other omics datasets. It currently contains millions of gene expression samples from tens of thousands of studies collected by many biomedical research laboratories from around the world. While users of the GEO repository can search the metadata describing studies and samples for locating relevant studies, there is currently no method or resource that facilitates global search of GEO at the data level. To address this shortcoming, we developed RummaGEO, a webserver application that enables gene expression signature search against all human and mouse RNA-seq studies deposited into GEO. To enable such a search engine, we performed offline automatic identification of conditions from uniformly aligned GEO studies available from ARCHS4, and then computed differential expression signatures to extract gene sets from these signatures. In total, RummaGEO currently contains 178,975 human and 203,427 mouse gene sets from 30,576 GEO studies. Overall, RummaGEO provides an unprecedented resource for the biomedical research community enabling hypotheses generation for many future studies.

React, PL/pgSQL, Python, Next.js, MaterialUI, Docker

TargetRanger

Immunotherapy target discovery

TargetRanger is a web-server application that identifies targets from user-inputted RNA-seq samples collected from the cells we wish to target. By comparing the inputted samples with processed RNA-seq and proteomics data from several atlases, TargetRanger identifies genes that are highly expressed in the target cells while lowly expressed across normal human cell types, tissues, and cell lines.

Python, Flask, Docker

D2H2

Diabetes Data and Hypothesis Hub (D2H2)

There is a rapid growth in the production of omics datasets collected by the diabetes research community. However, such published data are underutilized for knowledge discovery. To make bioinformatics tools and published omics datasets from the diabetes field more accessible to biomedical researchers, we developed the Diabetes Data and Hypothesis Hub (D2H2).

Python

MadHappy

Applying real-time video filters based on emotional state

We use a deep learning model to detect the user's emotional state in real-time. Based on the user's emotional state, we apply a video filter to the user's face. We use Tensorflow to detect the user's emotional state and the OpenCV library and computer vision to apply the video filter.

Python, Tensorflow

Plant Cell Segmentor

Deep learning fraemwork for plant cell segmentation

A deep learning framework for 2d plant cell segmentation. This model was developed based off the work of Wolny et. al’s paper “Accurate and versatile 3D segmentation of plant tissues at cellular resolution.” The model architecture exists in segmentor.py and the training and testing as well as visualization functions exist in assignment.py. Data is available at https://osf.io/uzq3w and the specific sets that were used in preprocessing and testing were the LateralRootPrimordia images in the test and train folders. With only a small set of these images a 97% accuracy was achieved.

Python, Appyter, Jinja2

Tumor Gene Target Screener (Appyter)

Gene expression across human cell types and tissues

This Appyter is inspired by the work of Bosse, Kristopher R et al. which compared neurobastomas vs normal tissue in GTEx to identify a promising candidate immunotherapeutic target. The goal is to allow rapid screening of targets with the help of normal tissue data from GTEx and GEO data through ARCHS4, as well as single-cell data from Tabula Sapiens and the Human Cell Atlas. The Appyter takes tumor expression data and attempts to rank significantly differentially expressed genes when compared with with either bulk RNA-seq data from GTEx or ARCHS4, or single-cell RNA-seq data from Tabula Sapiens or Human Cell Atlas, across all tissues. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. GTEx Version 8 gene counts was processed to produce gene summary statistics. ARCHS4 provides access to gene counts from HiSeq 2000, HiSeq 2500 and NextSeq 500 platforms for human and mouse experiments from GEO and SRA. We processed ARCHS4 Version 11 to produce gene summary statistics. The Tabula Sapiens dataset was created by the The Tabula Sapiens Consortium. We processed the Tabula Sapiens dataset to produce gene summary statistics. The Human Cell Atlas provides access to single-cell data contributed by the scientific community. We combined and processed 15 datasets from the Human Cell Atlas to produce gene summary statistics. Immunotherapeutic candidates must have limited expression in normal tissues to be considered safe targets, so proteomic visualizations of the highly expressed genes in normal tissues may be useful in assessing gene candidacy. Proteomics data were obtained from the Human Protein Atlas with IHC-based expression profiling, the Human Proteome Map with MS-based expression quantification, and a GTEx proteome project using TMT MS.

Python, Jinja2, Svelte, Appyter, Docker

Multiomics2Targets

Identification of Cell Surface Targets and Driver Kinases from Multiomics Data

The availability of data from the profiling of cancer patients with multiomics technologies is rapidly increasing. However, integrative analysis of such data for knowledge extraction and practical hypotheses generation for clinical applications is not trivial. Here we present Multiomics2Targets, a bioinformatics workflow that enables users to upload three data matrices collected from the same cohorts of cancer patients. After uploading transcriptomics, proteomics, and phosphoproteomics data matrices as well as accompanying metadata, Multiomics2Targets produces a report that resembles a research publication. The uploaded data matrices are processed, analyzed, and visualized using the tools Enrichr, KEA3, ChEA3, Expression2Kinases, and TargetRanger to produce ~80 figures and ~30 tables. Figure and table legends, as well as descriptions of the methods and results are provided. The reports include an abstract, an introduction, methods, results, discussion, conclusions, and references sections. Multiomics2Targets reports can be exported as PDF or Jupyter Notebooks, and can be cited. Additionally, since the pipeline is implemented as a Jupyter Notebook, the source code used to perform the analysis and produce the report is embedded within the report and can be easily viewed, modified, and run locally. Multiomics2Targets can be used to perform alternative analyses when only one or two omics datasets are uploaded..

Python, Flask, Docker

lncHUB2

Functional predictions of human long non-coding RNAs

A long non-coding RNA (lncRNA) is a transcript with more than 200 nucleotides that is not translated into protein. Based on gene-gene co-expression correlations created from ARCHS4's processed RNA-seq samples, we present 18,705 human and 11,274 mouse landing pages for long non-coding RNAs that include expression statistics across tissues and cell lines, predicted biological functions, pathway membership, subcellular localization, and predicted small molecules and CRISPR KO genes that may regulate their expression.

MATLAB

Hippocampal Replay

Simutaing Hippocampal Replay with Reinforcement Learning

The hippocampus is a brain region that plays a key role in memory formation and recall. The hippocampus replays memories, which is thought to be important for memory consolidation. However, the mechanisms underlying hippocampal replay are not well understood. In this project, we use reinforcement learning to simulate hippocampal replay. We train an agent to navigate a maze and then replay the agent's trajectory.

React, PL/pgSQL, Python, Next.js, MaterialUI, Docker

GeneRanger

Gene and transcript Expression across human tissue and cell atlases

GeneRanger is a web-server application that provides access to processed data about the expression of human genes and proteins across human cell types, tissues, and cell lines from several atlases. A sister-site to TargetRanger

Python, Appyter, Jinja2

Gene Expression across Cells and Tissues (Appyter)

Gene expression across human cell types and tissues

The Gene Expression across Cells and Tissues Appyter takes as input a human gene symbol to produce box plots that display its expression across human cell types and tissues at the mRNA and protein levels. This appyter utilizes normal tissue gene and protein expression from GTEx, ARCHS4, and the Tabula Sapiens, the Human Protein Atlas, the Human Proteome Map, the GTEx proteome project, and the CCLE. GTEx Version 8 and the ARCHS4 Version 11 gene counts were processed to produce gene summary statistics for cell types and tissues. The Tabula Sapiens dataset was processed to produce expression values for all human genes in 469 cell types from 456,101 single cells collected from 14 donors. Proteomics data were obtained from the Human Protein Atlas with IHC-based expression profiling, the Human Proteome Map with MS-based expression quantification, and a GTEx proteome project using TMT MS.

Typescript, PL/pgSQL, PostGraphile, Rust, Python, Next.js, TailwindCSS, Docker

LINCS L1000 Signature Search (L2S2)

L2S2 includes over 1.4 million chemical perturbation and over 280,000 CRISPR knockout signatures

As part of the Library of Integrated Network-Based Cellular Signatures (LINCS) NIH initiative, 248 human cell lines were profiled with the L1000 assay to measure the effect of 33 621 small molecules and 7508 single-gene CRISPR knockouts. From this massive dataset, we computed 1.678 million sets of up- and down-regulated genes. These gene sets are served for search by the LINCS L1000 Signature Search (L2S2) web server application. With L2S2, users can identify small molecules and single gene CRISPR KOs that produce gene expression profiles similar or opposite to their submitted single or up/down gene sets. L2S2 also includes a consensus search feature that ranks perturbations across all cellular contexts, time points, and concentrations. To demonstrate the utility of L2S2, we crossed the L2S2 gene sets with gene sets collected for the RummaGEO resource. The analysis identified clusters of differentially expressed genes that match drug classes, tissues, and diseases, pointing to many opportunities for drug repurposing and drug discovery. Overall, the L2S2 web server application can be used to further the development of personalized therapeutics while expanding our understanding of complex human diseases.