2021-03-31

K-means clustering

  • k-means clustering is a method of cluster analysis which aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean.
  • It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.

How does K-Means work?

  • We would like to partition that data set into \(K\) clusters \(C_1, ..., C_K\)
  • Each observation belong to at least one of the \(K\) clusters
  • The clusters are non-overlapping, i.e. no observation belongs to more than one cluster
  • The objective is to have a minimal “within-cluster-variation”, i.e. the elements within a cluster should be as similar as possible
  • One way of achieving this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.

\[minimize_{C_1, ..., C_K}\{\sum_{k=1}^K\frac{1}{\lvert C_k \rvert}\sum_{i,i' \in C_k}\sum_{j=1}^p(x_{ij}-x_{i',j})^2\}\]

K-means clustering algorithm

  • Initialize: choose \(k\) points as cluster means

  • Repeat until convergence:
    • Assignment: place each point \(X_i\) in the cluster with the closest mean.
    • Update: recalculate the mean for each cluster
  • K-means always converges.
    • The assignment and update steps always either reduce the objective function or leave it unchanged.

K-means clustering algorithm

K-means clustering

  • Advantage: gives sharp partitions of the data
  • Disadvantage: need to specify the number of clusters (K).
  • Goal: find a set of \(k\) clusters that minimizes the distances of each point in the cluster to the cluster mean:

\[centroid_j = \hat{Y_j} = \frac{1}{N_{Y_j}}\sum_{i \in Y_j}{X_i}\]

\[argmin_C \sum_{i=1}^k \sum_{j \in C(i)} {|X_j - \hat{Y_i}|^2}\]

K-means steps

  • Simplified example
  • Expression for two genes for 14 samples
  • Some structure can be seen

K-means steps

  • Choose \(K\) centroids
  • These are starting values that the user picks.
  • There are some data driven ways to do it

K-means steps

  • Find the closest centroid for each point
  • This is where distance is used
  • This is “first partition” into \(K\) clusters

K-means steps

  • Take the middle of each cluster
  • Re-compute centroids in relation to the middle
  • Use the new centroids to calculate distance

K-means steps

  • Expression for two genes for 14 samples

PAM (K-medoids)

  • Centroid - The average of the samples within a cluster
  • Medoid - The “representative object” within a cluster
  • Initializing requires choosing medoids at random.

K-means limitations

  • Final results depend on starting values
  • How do we chose \(K\)? There are methods but not much theory saying what is best.
  • Where are the pretty pictures?

Alternatives

K-means

  • Initialize: choose \(k\) points as cluster means
  • Repeat until convergence:
    • Assignment: place each point \(X_i\) in the cluster with the closest mean.
    • Update: recalculate the mean for each cluster

Fuzzy k-means

  • Initialize: choose \(k\) points as cluster means
  • Repeat until convergence:
    • Assignment: calculate probability of each point belonging to each cluster.
    • Update: recalculate the mean for each cluster using these probabilities

Alternatives

Limits of K-means

K-means uses Euclidean distance

\[centroid_j = \hat{Y_j} = \frac{1}{N_{Y_j}}\sum_{i \in Y_j}{X_i}\]

\[argmin_C \sum_{i=1}^k \sum_{j \in C(i)} {|X_j - \hat{Y_i}|^2}\]

  • Gives most weight to largest differences
  • Can’t be used if data are qualitative
  • Centroid usually does not represent any datum

Self-organizing (Kohonen) maps

  • Self organizing map (SOM) is a learning method which produces low dimension data (e.g. \(2D\)) from high dimension data (\(nD\)) through the use of self-organizing neural networks
  • E.g. an apple is different from a banana in more then two ways but they can be differentiated based on their size and color only.

Dimensionality reduction methods

Big picture

In general, most dimensionality reduction methods can be placed into one of two categories: feature-mapping approaches versus distance-preserving approaches.

Feature mapping approaches directly find a mapping function from the original, high-dimensional space to a low-dimensional space, subject to some optimization goal. A commonly used feature mapping technique is principal component analysis (PCA), which aims to find a linear transformation of a given set of features while preserving, as much as possible, the variance of the original data.

Distance preserving approaches project the data points to a low-dimensional space in which the original pairwise distances (or similarities) between pairs of data objects is preserved. Multi-dimensional scaling (MDS, Kruskal and Wish, 1977) is a canonical example of a distance-preserving approach.

Dimensionality reduction overview

  • Dimensionality Reduction (DR) essentially aims to find low dimensional representations of data while preserving their key properties
  • DR methods can be modeled after physical models with attracting and repelling forces (Force Directed Methods), projections onto low dimensional planes (PCA, ICA), divergence of statistical distributions (SNE family), the reconstruction of local spaces or points by their neighbors (LLE)

Guido Kraemer, Markus Reichstein, and Miguel D. Mahecha, “DimRed and CoRanking Unifying Dimensionality Reduction in R,” The R Journal, 2018, https://journal.r-project.org/archive/2018/RJ-2018-039/index.html.

Dimensionality reduction overview

Projection methods (PCA)

  • Linearly decompose the dataset into components that have a desired property.
  • There are largely two kinds of projection methods: principal component analysis (PCA) and independent component analysis (ICA).

  • PCA produces a low-dimensisonal representation of a dataset.
  • Each successive principal component is selected to be orthonormal to the previous ones, and to capture the maximum information that is not already present in the previous components.
  • Components are linear combinations of the original data
  • PCA also serves as a tool for data visualization

Why dimensionality reduction

  • Start with many measurements (high dimensional).
  • Want to reduce to few features (lower-dimensional space).
  • One way is to extract features based on capturing groups of variance.
  • Another could be to preferentially select some of the current features most representative of the data.

Intuition behind dimensionality reduction

Intuition behind dimensionality reduction

PCA: quick theory

Principal Components Analysis

  • Performs a rotation of the data that maximizes the variance in the new axes
  • Projects high dimensional data into a low dimensional sub-space (visualized in 2-3 dims)
  • Often captures much of the total data variation in a few dimensions (< 5)
  • Exact solutions require a fully determined system (matrix with full rank), i.e. a “square” matrix with independent rows

Principal Components Analysis: details

  • The first principal component of a set of features \(X_1, X_2,...,X_p\) is the normalized linear combination of the features

\[Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + ... + \phi_{p1}X_p\]

that has the largest variance. Note “normalized” - \(\sum_{j=1}^p\phi_{j1}^2=1\)

  • The elements \(\phi_{11}, \phi_{21},... \phi_{p1}\) are the loadings of the first principal component. Together, them make up the principal component loading vector \(\phi_1 = (\phi_{11}, \phi_{21},... \phi_{p1})^T\)
  • The loadings are constrained so that their sum of squares is equal to one, since othewise setting these elements to be arbitrary large in absolute value could result in an arbitrary large variance.

Computation of Principal Components

  • Input: a \(n \times p\) data set \(X\). Since we are only interested in variance, we assume that each of the variables in \(X\) has been centered to have mean zero (that is, the column means of \(X\) are zero).
  • We then look for the linear combination of the sample feature falues of the form

\[z_{i1} = \phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip}\]

for \(i=1,...,n\) that has largest sample variance under the constraint that \(\sum_{j=1}^p\phi_{j1}^2=1\)

  • Since each of the \(x_{ij}\) has mean zero, so does \(z_{i1}\).
  • Hence the sample variance of the \(z_{i1}\) can be written as \(\frac{1}{n}\sum_{i=1}^n z_{i1}^2\)

Computation of Principal Components

  • Plugging in the sample variance equation the first principal component loading vector solves the optimization problem

\[maximize_{\phi_{11},...\phi_{p1}}\frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{j1}x_{ij})^2\]

subject to \(\sum_{j=1}^p\phi_{j1}^2=1\)

  • The problem can be solved via a singular value decomposition of the matrix \(X\)
  • \(Z_1\) is the first principal component with values \(z_{11}, ... z_{n1}\)

Geometry of PCA

  • The loading vector \(\phi_1\) with elements \(\phi_{11}, \phi_{21},... \phi_{p1}\) defines a direction in feature space along which the data vary the most

  • If we project the \(n\) data points \(x_1,..., x_n\) onto this direction, the projected values (the new coordinates) are the principal component scores \(z_{11}, ... z_{n1}\) themselves

Further principal components

  • The second principal component is the linear combination of \(X_1, ...X_p\) that has maximal variance among all linear combinations that are uncorrelated with \(Z_1\)
  • The second principal component scores \(z_{12}, ... z_{n2}\) take the form

\[z_{i1} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + ... + \phi_{p2}x_{ip}\]

where \(\phi_2\) is the second principal component loading vector, with elements \(\phi_{12}, \phi_{22},... \phi_{p2}\)

Further principal components

  • Constraining \(Z_2\) to be uncorrelated with \(Z_1\) is equivalent to constraining the direction \(\phi_2\) to be orthogonal to the direction \(\phi_1\). And so on for the other components

  • The principal component directions \(\phi_1, \phi_2, \phi_3,...\) are the ordered sequence of right singular vectors of the matrix \(X\)
  • The variances of the components are the \(\frac{1}{n}\) times the squares of the singular values
  • There are at most \(min(n-1, p)\) principal components

Singular Value Decomposition

PCA for gene expression

  • Given a gene-by-sample matrix \(X\) we decompose (centered and scaled) \(X\) as \(USV^T\)
  • We don’t usually care about total expression level and the dynamic range which may be dependent on technical factors
  • \(U\), \(V\) are orthonormal
  • \(S\) diagonal-elements are eigenvalues = variance explained

PCA for gene expression

  • Columns of \(V\) are
    • Principle components
    • Eigengenes/metagenes that span the space of the gene transcriptional responses
  • Columns of \(U\) are
    • The “loadings”, or the correlation between the column and the component
    • Eigenarrays/metaarrays - span the space of the gene transcriptional responses
  • Truncating \(U\), \(V\), \(D\) to the first \(k\) dimensions gives the best \(k\)-rank approximation of \(X\)

Principal Components Analysis

Example: Leukemia data sets by Golub et al.: Classification of ALL and AML

PCA applied to cell cycle data

How many components?

  • Kaiser’s criterion: eigenvalue \(\ge\) 1

  • Cattell’s scree test: The number of components to retain corresponds to the point where the graph of eigenvalues of successive components vs. the component number begins to level off

ICA - Independent Component Analysis

  • PCA assumes multivariate normally distributed data - gene expression data are super-Gaussian
  • ICA models observations as a linear combination of latent feature variables, or components, which are chosen to be as statistically independent as possible. Think about a number of sound sources mixed together, ICA tries to “un-mix” them
  • ICA is a linear rotation of the data just as PCA but instead of recovering the maximum variance, it recovers statistically independent components
  • For microarray data, observations consist of microarray gene expression measurements, and independent components are interpreted to be transcriptional modules that often correspond to specific biological processes

http://www.sciencedirect.com/science/article/pii/S1532046410001000

ICA - Independent Component Analysis

  • \(X\) - an \(m \times n\) matrix of \(n\) genes and \(m\) experiments
  • ICA models this expression matrix as a linear combination of intependent biological processes by decomposing \(X\) as: \[X = AS\]
  • \(S\) is a \(k \times n\) source matrix
  • \(A\) is a \(m \times k\) mixing matrix
  • \(k\) is a user supplied parameter \(\leq min(m,n)\)

Same preprocessing as for PCA - filter, center, scale

ICA - Independent Component Analysis

  • \(S\) is a \(k \times n\) source matrix

  • The components, or rows of \(S\), are independent in the sense that the gene weights in each component reflect samplings of independent random variables.
  • In the context of gene expression, this suggests that the sets of genes comprising the groups strongly contributing to each component have independent compositions.
  • Columns of \(A\) are the distribution of the component’s expression in arrays (rows of \(S\))

fastICA R package, https://cran.r-project.org/web/packages/fastICA/index.html

http://www.sciencedirect.com/science/article/pii/S1532046410001000

Independent component analysis

  • The source matrix \(S\) is used to biologically interpret the components by studying their contributing genes
  • The matrix \(A\) is used to associate the component with sample features by studying the distribution of the samples on the components according to their characteristics (e.g clinical or molecular variables).

  • MineICA - Analysis of an ICA decomposition obtained on genomics data https://bioconductor.org/packages/release/bioc/html/MineICA.html

Other decomposition techniques

  • Non-negative matrix factorization
  • \(a_{.j}=\sum_{k=1}^Kw_{.k}h_{kj},\ or\ A=WH\) (\(A\), \(W\), \(H\) are non-negative)
  • A linear, non-negative approximation of the data
  • By not allowing negative entries in \(W\) and \(H\), NMF enables a non-subtractive combination of parts to form a whole

Jean-Philippe Brunet et al. PNAS 2004;101:4164-4169

Non-negative matrix factorization (NMF)

  • \(a_{.j}=\sum_{k=1}^Kw_{.k}h_{kj},\ or\ A=WH\) (\(A\), \(W\), \(H\) are non-negative)
  • \(A\) is an \(M \times N\) matrix, input gene expression by samples matrix
  • \(W\) is an \(M \times K\) matrix, contains \(K\) basis vectors
  • \(H\) is a \(K \times N\) matrix, contains the coefficient vectors
  • \(H\) defines a meta-gene space: similar to eigengenes
  • Classification can be done in the meta-gene space

Non-negative matrix factorization (NMF)

  • A key feature of NMF is the ability to identify nonsubtractive patterns that together explain the data as a linear combination of its basis vectors.
  • \(A=WH\) (\(A\), \(W\), \(H\) are non-negative)
  • \(A\) is an \(M \times N\) matrix, input gene expression by samples matrix
  • \(W\) is an \(M \times K\) matrix, contains \(K\) basis vectors
  • \(H\) is a \(K \times N\) matrix, contains the coefficient vectors
  • The \(K\) basis vectors in \(W\) can be regarded as the ‘building blocks’ of the data
  • The \(K\) coefficient vectors describe how strongly each ‘building block’ is present in the data

NMF, general formulation

Why nonnegativity

NMF is more than 30-year old!

  • previous variants referred to as:
    • nonnegative rank fatorization (Jeter and Pye, 1981; Chen, 1984);
    • positive matrix factorization (Paatero and Tapper, 1994);
  • popularized by Lee and Seung (1999) for “learning the parts of objects”.

Since then, widely used in various research areas for diverse applications

NMF for clustering

NMF can handle overlapping clusters and provides soft cluster membership indications.

NMF

  • Many computational methods
    • Cost function \(|A-WH|\)
    • The squared Euclidean error function: \(F(W,H)=\parallel A - WH \parallel_F^2\)
    • Kullback–Leibler divergence \(D_{KL}(A \parallel \hat{A}) = \sum_{ij}(A_{ij}log\frac{A_{ij}}{\hat{A_{ij}}})\), where \(\hat{A}=WH\)
  • Optimization procedure
    • Most use stochastic initialization, and the results don’t always converge to the same answer

Lee, Daniel D., and H. Sebastian Seung. “Algorithms for Non-Negative Matrix Factorization.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 556–62. MIT Press, 2001. http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf.

NMF

  • \(A=WH\) : Toy Biological interpretation
  • Assume \(k=2\)
  • We have 2 transcription factors that activate gene signatures \(W1\) and \(W2\)
  • \(H\) represents the activity of each factor in each sample
  • TF effects are additive

NMF

  • NMF operates in the original non-negative measurement space
  • Highly expressed genes matter more
  • Positivity constraint is advantageous: positive correlation among genes is more likely to be biologically meaningful
  • NMF may more accurately capture the data generating process

NMF vs. PCA

  • Results of PCA vs NMF for reducing the leukemia data with 72 samples in visualization. Sample 66 is mislabeled. However in 2-D display, the reduced data by NMF can clearly show this mistake while that by PCA cannot demonstrate the wrong. ‘PC’ stands for principal component and ‘BE’ means basis experiment.

Weixiang Liu, Kehong Yuan, Datian Ye “Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis” Journal of Biomedical Informatic 2008,

Multidimensional scaling

MDS attempts to

  • Identify abstract variables which have generated the inter-object similarity measures
  • Reduce the dimension of the data in a non-linear fashion
  • Reproduce non-linear higher-dimensional structures on a lower-dimensional display

Kruskal’s stress

\[stress=\sqrt{\frac{\sum{(d_{ij} - \hat{d_{ij}})^2}}{\sum{d_{ij}^2}}}\]

  • Goodness-of-fit - Measures degree of correspondence between distances among points on the MDS map and the matrix input.
  • Start with distances \(d_{ij}\)
  • Fit decreasing numbers \(\hat{d_{ij}}\)
  • Subtract, square, sum
  • Take a square root
  • Divide by a scaling factor

MDS Basic Algorithm

  • Obtain and order the \(M\) pairs of similarities
  • Try a configuration in \(q\) dimensions
    • Determine inter-item distances and reference numbers
    • Minimize Kruskal’s stress
  • Move the points around to obtain an improved configuration
  • Repeat until minimum stress is obtained

Comparison Between PCA, MDS

  • PCA tries to preserve the covariance of the original data

  • MDS tries to preserve the metric (ordering relations) of the original space

t-SNE: Nonlinear Dimensional Reduction

Stochastic Neighbor Embedding

  • Stochastic Neighbor Embedding (SNE) is a technique that minimizes the Kullback-Leibler divergence of scaled similarities of the points \(i\) and \(j\) in high dimensional space, \(p_{ij}\), and low dimensional space, \(q_{ij}\)

\[KL(P||Q) = \sum_{i \ne j}p_{ij}log\frac{p_{ij}}{q_{ij}}\]

  • SNE uses a Gaussian kernel (\(exp(-\frac{||x_i - x_j||^2}{2\sigma^2})\)) to compute similarities in high and low dimensional space. \(\sigma\) is a length scale parameter accounting for the width of the kernel

t-Distributed Stochastic Neighborhood Embedding (t-SNE)

  • Improves on SNE by using a t-Distribution as a kernel in low dimensional space
  • Because of the heavy-tailed t-distribution, t-SNE maintains local neighborhoods of the data better and penalizes wrong embeddings of dissimilar points. This property makes it especially suitable to represent clustered data and complex structures in few dimensions
  • t-SNE has one parameter, perplexity, to tune which determines the neighborhood size of the kernels used.

t-Distributed Stochastic Neighborhood Embedding (t-SNE)

  • The t-SNE algorithm maps points from high-dimensional space to low-dimensional space by minimizing the difference in all pairwise similarities between points in high-and low-dimensional spaces.
  • The axes of the low-dimensional spaces are given in arbitrary units.
  • First, the pairwise distance matrix is calculated in high-dimensional space.
  • Next, the distance matrix is transformed to a similarity matrix using a varying Gaussian kernel, so that the similarity between points \(X_i\) and \(X_j\) represents the joint probability that \(X_i\) will choose \(X_j\) as its neighbor or vice versa (based on their Euclidean distance and local density).
  • Then, a random low-dimensional mapping is rendered and pairwise similarities are computed for points in the low-dimensional map. However, the low-dimensional similarities are computed using Student’s t-distribution rather than a Gaussian distribution.
  • Finally, gradient descent is used to minimize the Kullback-Leibler divergence between the two probability distributions, leading to the final low-dimensional map.

Amir, El-ad David, Kara L Davis, Michelle D Tadmor, Erin F Simonds, Jacob H Levine, Sean C Bendall, Daniel K Shenfeld, Smita Krishnaswamy, Garry P Nolan, and Dana Pe’er. “ViSNE Enables Visualization of High Dimensional Single-Cell Data and Reveals Phenotypic Heterogeneity of Leukemia.” Nature Biotechnology 31, no. 6 (June 2013): 545–52. https://doi.org/10.1038/nbt.2594.

t-SNE: Collapsing the Visualization to 2D

t-SNE: How it works.

PCA and t-SNE Together

  • Often t-SNE is performed on PCA components
  • Liberal number of components.
  • Removes mild signal (assumption of noise).
  • Faster, on less data but, hopefully the same signal.

Learn More About t-SNE

Other approaches

  • Bi-clustering - cluster both the genes and the experiments simultaneously to find appropriate context for clustering
    • R packages: iBBiG, FABIA, biclust
    • Stand-alone: BicAT (Biclustering Analysis Toolbox))

Summary

Kossenkov, Andrew V., and Michael F. Ochs. “Matrix Factorisation Methods Applied in Microarray Data Analysis.” International Journal of Data Mining and Bioinformatics 4, no. 1 (2010): 72. https://doi.org/10.1504/IJDMB.2010.030968.