Clustering
Hierarchical clustering
Daxin Jiang, Chun Tang, and Aidong Zhang. “Cluster Analysis for Gene Expression Data: A Survey.” IEEE Transactions on Knowledge and Data Engineering, (November 2004) - Clustering overview for gene expression studies. Definitions, proximity measures (Euclidean, Pearson), clustering (K-means, SOM, hierarchical, graph-theoretical, model-based, density, the use of PCA), biclustering. Metrics for clustering QC (homogeneity, separation, Rand, Jaccard, reliability)
Patrik D’haeseleer, “How Does Gene Expression Clustering Work?,” Nature Biotechnology, (December 2005) - Clustering distances. Recommendations for gene expression choices of clustering
Satagopan, Jaya M., and Katherine S. Panageas. “A Statistical Perspective on Gene Expression Data Analysis.” Statistics in Medicine, (February 15, 2003) - Intro into microarray technology, statistical questions. Hierarchical clustering - clustering metrics. MDS algorithm. Class prediction - linear discriminant analysis algorithm and cross-validation. SAS and S examples
Altman, Naomi, and Martin Krzywinski. “Points of Significance: Clustering.” Nature Methods, (May 30, 2017) - Clustering depends on gene scaling, clustering method, number of simulations in k-means clustering.
Krzywinski, Martin, and Naomi Altman. “Points of Significance: Importance of Being Uncertain.” Nature Methods 10, no. 9 (September 2013)
Altman, Naomi, and Martin Krzywinski. “Points of Significance: Association, Correlation and Causation.” Nature Methods, (September 29, 2015)
Dimensionality reduction
Abdi, Hervé, and Lynne J. Williams. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics, (July 2010) - PCA in-depth review. Mathematical formulations, terminology, examples, interpretation. Figures showing PC axes, rotations, projections, circle of correlation. Rules for selecting number of components. Rotation - varimax, promax, illustrated. Correspondence analysis for nominal variables, Multiple Factor Analysis for a set of observations described by several groups (tables) of variables. Appendices - eigenvalues and eigenvectors, positive semidefinite matrices, SVD
Wall, Michael. “Singular Value Decomposition and Principal Component Analysis,” - SVD and PCA statistical intro. Relation of SVD to PCA, Fourier transform. Examples of applications, including genomics.
Lever, Jake, Martin Krzywinski, and Naomi Altman. “Points of Significance: Principal Component Analysis.” Nature Methods, (June 29, 2017) PCA explanation, the effect of scale. Limitations
Lee, D. D., and H. S. Seung. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature, (October 21, 1999) - Non-negative matrix factorization (NMF) principles, compared with vector quantization (VQ) and PCA. Intuition behind NMF learning parts and PCA learning the whole.
Lee, Daniel D., and H. Sebastian Seung. “Algorithms for Non-Negative Matrix Factorization.” In Advances in Neural Information Processing Systems, MIT Press, 2001 - Two algorithms for solving NMF - Euclidean distance and Kullback-Leibler divergence, with proofs.
Meng, Chen, Oana A. Zeleznik, Gerhard G. Thallinger, Bernhard Kuster, Amin M. Gholami, and Aedín C. Culhane. “Dimension Reduction Techniques for the Integrative Analysis of Multi-Omics Data.” Briefings in Bioinformatics, (July 2016) - Dimensionality reduction techniques - PCA and its derivatives, NMF. Table 1 - Terminology. Table 2 - methods, tools, visualization packages. Methods for integrative data analysis of multi-omics data.
Lee, Su-In, and Serafim Batzoglou. “Application of Independent Component Analysis to Microarrays.” Genome Biology 4, no. 11 (2003) - Independent Components Analysis theory and applications.
Stein-O’Brien, Genevieve L., Raman Arora, Aedin C. Culhane, Alexander V. Favorov, Lana X. Garmire, Casey S. Greene, Loyal A. Goff et al. “Enter the matrix: factorization uncovers knowledge from omics.” Trends in Genetics, (2018) - Matrix factorization and visualization. Refs to various types of MF methods. Terminology, Fig 1 explanation of MF in terms of gene expression and biological processes. References to biological examples.
Yeung, K. Y., and W. L. Ruzzo. “Principal Component Analysis for Clustering Gene Expression Data.” Bioinformatics, (September 1, 2001) - PCA is not always good for denoising data before clustering, clustering of PCs often worse than the original data. Simulated and real-life data. Data used for benchmarks: http://faculty.washington.edu/kayee/pca/
Libbrecht, Maxwell W., and William Stafford Noble. “Machine Learning Applications in Genetics and Genomics.” Nature Reviews. Genetics, (June 2015) - Machine learning in genomics. Supervised/unsupervised learning, semi-supervised, bayesian (incorporating prior knowledge), feature selection, imbalanced class sizes, missing data, networks.
Meng, Chen, Bernhard Kuster, Aedín C. Culhane, and Amin Moghaddas Gholami. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics (May 29, 2014) - MCIA - multiple correspondence analysis for integrating multiple datasets. Statistics and implementation in omicade4 - Multiple co-inertia analysis of omics datasets.
Singular Value Decomposition (SVD) Tutorial: Applications, Examples, Exercises blog post
Liu, Yanchi, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. “Understanding of Internal Clustering Validation Measures,” IEEE, 2010 - Internal clustering validation metrics, table, concise description of each.
Guido Kraemer, Markus Reichstein, and Miguel D. Mahecha, “DimRed and CoRanking Unifying Dimensionality Reduction in R,” The R Journal, 2018 - R packages implementing 15 methods for dimensionality reduction, from PCA, ICA, MDS to Laplasian eigenmaps. Brief but very good overview of each method, its complexity. Quality metrics to judge the quality of embedding. GitHub
Amir, El-ad David, Kara L Davis, Michelle D Tadmor, Erin F Simonds, Jacob H Levine, Sean C Bendall, Daniel K Shenfeld, Smita Krishnaswamy, Garry P Nolan, and Dana Pe’er. “ViSNE Enables Visualization of High Dimensional Single-Cell Data and Reveals Phenotypic Heterogeneity of Leukemia.” Nature Biotechnology, (June 2013) - viSNE paper - tSNE (Barnes-Hut) implementation for single-cell data, and the
cyt
tool for visualization. Supplementary methods - details of t-SNE algorithm, Details of usageBelacel, Nabil, Qian Wang, and Miroslava Cuperlovic-Culf. “Clustering Methods for Microarray Gene Expression Data.” Omics: A Journal of Integrative Biology, (2006) - Clustering methods overview. Hierarchical (agglomerative, divisive), partitional clustering (K-means, K-medoids, SOM). DBSCAN and other density-based algorithms. Graph-theoretical cllustering. Fuzzy clustering, expectation-maximization methods. Table with software.
Kossenkov, Andrew V., and Michael F. Ochs. “Matrix Factorisation Methods Applied in Microarray Data Analysis.” International Journal of Data Mining and Bioinformatics, (2010) - Matrix factorization methods for genomics data. SVD, PCA, ICA, NCA, NMF (sparse and least squares NMF), Bayesian decomposition
Chavent, Marie, Vanessa Kuentz-Simonet, Amaury Labenne, and Jérôme Saracco. “Multivariate Analysis of Mixed Data: The R Package PCAmixdata.” ArXiv, December 8, 2017 - PCAmixdata - R package for PCA on a mixture of numerical and categorical variables. Other packages - ade4, FactoMineR. Theory, statistics, code examples with interpretation. PCAmixdata
Tools
NbClust - Determining the Best Number of Clusters in a Data Set. It provides 30 indexes for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user.
philentropy - Similarity and Distance Quantification Between Probability Functions. Computes 46 optimized distance and similarity measures for comparing probability functions.
fpc - Flexible Procedures for Clustering R package.
prediction.strength
- function to calculate the optimal number of clusters.