# COTAN project

COTAN is a mathematical model and a statistical framework to study an important category of biological datasets.

In recent years there was an exceptional growth
of *omics* techniques and approaches. One of the
most important is transcriptomics, that is the detection
of RNA molecules inside the cells, also called RNA-seq, of
which there are several kinds. The oldest is bulk RNA-seq,
that can only count RNA molecules in sets of thousands of
cells. Its evolution, works at the level of single cells
and hence is called scRNA-seq.

Single-cell RNA sequencing comes with several drawbacks. The first is its bad SNR, as it is based on PCR amplification (an exponential process which is more apt to qualitative detection of few genes, than quantitative genome-wide analysis). The second is its very low efficiency, that leads to dropout artefacts (random disappearence of genes that should be present).

While not much can be done to solve the latter, the recent introduction of unique molecular identifiers (UMI) almost solved the former, allowing several authors to deal with the low efficiency through suitable probability models. COTAN is our attempt in this direction and it is described in [15].

Excerpt from the introduction: *COTAN is based on
three main elements: a robust estimation of the UMI
detection efficiency (UDE) of each cell, a flexible
model for the probability of zero UMI counts, and a
generalized contingency table framework for
zero/non-zero UMI counts for couples of genes. In fact,
COTAN estimates the co-expression of gene pairs by
comparing the number of cells that have zero UMI counts
for both genes, with the expected value under
independence hypothesis. This approach is based on the
fact that technical zeros are always independently
distributed, while biological zeros may be correlated
for genes associated to cell differentiation, providing
a way to recover information on the joint
distribution.*

Our model assumes that for each gene there is a real expression Λ with unknown distribution, and that every cell has a coefficient ν for the detection efficiency there. The actual read count is modelled as a discrete random variable R with conditional Poisson distribution with mean νΛ.

We use the method of moments to estimate the parameters, and then compute the estimated probability that a gene-cell results in R=0 molecule counts. To this end, we include another parameter a for each gene, and fit a simple universal family of functions for this probability.

Then we can compute classical 2x2 contingency tables for all couples of genes and use the expected probability of R=0 to get the "expected" numbers under the independence hypothesis.

The inference is then performed as usual with 1 df chi-squared distribution, as was verified with simulated data.

One of the main outputs of the analysis is a matrix of correlation of genes, which proved to be much more robust than Pearson's or Spearman's with this kind of data.

[15] | COTAN: scRNA-seq data analysis based on gene co-expression | Galfrè, Morandin, Pietrosanto, Cremisi, Helmer-Citterich | NAR Genomics | 2021 |