Hi-C analysis
TBC
Compartments
TBC
call compartments
- input: a dense cis matrix
- output: eigvec_table the contribution (projection coefficient) of each column to the PCA1 vector
-
perquisite: ref to PCA
- Why need observed_over_expected
observed can not discriminate the “biological interaction”
- calculate observed_over_expected matrix: Since observed matrix is what you get from the valid pairs, the key step is to calculate the expected matrix.
Following the expected interacting frequency in polymer model (reference (Equa(1), Equa(4) & Equa(5)).
To put it simple: The diagonals of the matrix, corresponding to contacts between loci pairs with a fixed distance, are grouped into exponentially growing bins of distances; the diagonals from each bin are normalized by their average value.
- hand painted illustration
data[offset + j, j] /= mean_pixel return data
-
eigendecomp: (if symmetric)
mat -= 1.0 _eigvals, _eigvecs = scipy.sparse.linalg.eigsh(mat, _n) return _eigvecs
Solves
A * x[i] = w[i] * x[i]
, the standard eigenvalue problem for w[i] eigenvalues with corresponding eigenvectors x[i].- since the variance of A projecting onto x[i] is
x[i]^T * A
, to maximize it means to find the maximized w[i] with the corresponding x[i] - why the final pca1 values we need is between [-1,1]: the maximal projection direction is the colliner direction, with the projection coefficient
cos(0)=1
- since the variance of A projecting onto x[i] is
saddle plot and compartment strength
- input: a dense cis matrix, eigvec_table
- output: a matrix to be visualized, compartment strength
- sort the dense matrix by PC1 values
- aggregate the sorted matrix
- calculate the compartment strength:
a: $mean (mat_i,j)$, where $i, j \geq percentile(.75)[sorted_index] or \leq percentile(.25)[sorted_index]$
b: $mean (mat_i,j)$, where $i, j \geq percentile(.75)[sorted_index] or \leq percentile(.25)[sorted_index]$
c: $mean (mat_i,j)$, where $only one of i, j \geq percentile(.75)[sorted_index] or only one of i, j \leq percentile(.25)[sorted_index]$
$compartment strength = a+b-c$
TADs
call TADs
- insulation score:
- delta vector
For each bin (reference point) the average insulation differences are calculated between all points up to 100 kb left of the reference point relative to the reference point. The same is repeated for all points up to 100 kb right of the reference point. The delta value is then defined as the difference between the mean (left difference) and mean (right difference).
- call boundary
local minima in delta vector with boundary strength bigger than a threshold
ChIP-seq
peak calling
signal enrichment
chromHMM
Agglomeric clustering
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:
Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient s for a single sample is then given as:
The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
Sampling
TBC
Single cell tutorial
TBC