Clustering

Hierarchical clustering

  • Daxin Jiang, Chun Tang, and Aidong Zhang. “Cluster Analysis for Gene Expression Data: A Survey.” IEEE Transactions on Knowledge and Data Engineering, (November 2004) - Clustering overview for gene expression studies. Definitions, proximity measures (Euclidean, Pearson), clustering (K-means, SOM, hierarchical, graph-theoretical, model-based, density, the use of PCA), biclustering. Metrics for clustering QC (homogeneity, separation, Rand, Jaccard, reliability)

  • Patrik D’haeseleer, “How Does Gene Expression Clustering Work?,” Nature Biotechnology, (December 2005) - Clustering distances. Recommendations for gene expression choices of clustering

  • Satagopan, Jaya M., and Katherine S. Panageas. “A Statistical Perspective on Gene Expression Data Analysis.” Statistics in Medicine, (February 15, 2003) - Intro into microarray technology, statistical questions. Hierarchical clustering - clustering metrics. MDS algorithm. Class prediction - linear discriminant analysis algorithm and cross-validation. SAS and S examples

  • Altman, Naomi, and Martin Krzywinski. “Points of Significance: Clustering.” Nature Methods, (May 30, 2017) - Clustering depends on gene scaling, clustering method, number of simulations in k-means clustering.

  • Krzywinski, Martin, and Naomi Altman. “Points of Significance: Importance of Being Uncertain.” Nature Methods 10, no. 9 (September 2013)

  • Altman, Naomi, and Martin Krzywinski. “Points of Significance: Association, Correlation and Causation.” Nature Methods, (September 29, 2015)

Dimensionality reduction

Tools

  • NbClust - Determining the Best Number of Clusters in a Data Set. It provides 30 indexes for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user.

  • philentropy - Similarity and Distance Quantification Between Probability Functions. Computes 46 optimized distance and similarity measures for comparing probability functions.

  • fpc - Flexible Procedures for Clustering R package. prediction.strength - function to calculate the optimal number of clusters.