McKmeans: A highly efficient multi-core k-means algorithm for clustering extremely large datasets.

BMC Bioinformatics publication (2010)

Basic usage

McKmeans supports clustering gene expression, exon array, and SNP data sets. To select the cluster number estimation method click on the second tab on top of the frame. The following picture shows an example clustering of a microarray data set
GUI screenshot

Load data, transpose data

Loading data is done via Menu -> File -> Load. The cluster analysis is always done row-wise. However, after loading a data set, the data can be transposed (exchanging rows and columns). The input files must be formatted as CSV files (comma separated text files). Please export your data from Excel to '.csv'. Please ensure that the character for the decimal point is '.' and numerical values are separated by ',' (some language version of Excel treat this different). Header lines starting with '#' are ignored. SNP data has to be encoded as 0,1,2 for 'homozygous reference', 'heterozygous', 'homozygous alternative'. The suffix for SNP files is '.snp'. Example data sets (gene expression, exon array, and SNP data sets) are available for download below.

Visualizing data

After loading the data is shown in the plot panel. Gene expression and exon array data is plotted as a scatterplot displaying the first two columns of the data set. Columns to plot can be changed in the columns to plot text fields. After pressing the 'Update columns to plot!' button, the scatterplot changes. Alternatively Sammon mapping, a non-linear projection to a two-dimensional space can be computed by pressing the 'Sammon mapping' button. After clustering the data points in the scatterplot are coloured to reflect the cluster result. SNP data is displayed as a parallel coordinate plot. The results from clustering SNP data are displayed as a set of parallel coordinate plots, each single plot refering to one cluster. Use the mouse to select a region of the plot and zoom into the graphic. All graphics can be saved to a SVG file via 'Menu -> File -> Export to SVG'.

Cluster analysis

First choose the number of clusters and maximal number of iterations for the cluster algorithm. The clustering starts by clicking the 'Cluster!' button. The cluster analysis may take a while, e.g. 25 minutes for clustering our example data set containing 1000000 data points into 20 clusters on a dual-quad core computer. Optionally, K-means can be started repeatedly. As K-means is initialized randomly, different runs of the algorithm give different results. Increasing the number of restarts will reduce the effects of random initialization. If the number of restarts is larger than 1 only the best clustering (in terms of the quantization error) is reported. The resulting clustering is displayed in the plotting area. Additionally, the number of rows per cluster is displayed on the bottom right. The resulting assignment vector can be saved via 'Menu -> File -> Save'. This vector shows the cluster number for each data point in the ordering of the original data set.

Cluster number estimation

The cluster number estimation method is selected by switching to the cluster number estimation tab. First choose the minimal and maximal number of clusters to test, the maximal number of iterations for the cluster algorithm, and the maximal number of repetitions for each cluster number k. Press the 'Cluster number estimation' button to start. The result is shown as a boxplot chart (displaying the median and interquartile range). The most stable clustering (greatest difference between mean cluster result and mean random clustering) is marked in the figure by a '*' and shown on the lower right panel. Additionally, a p-value from testing the null hypothesis of no difference between MCA-index from clustering and random baseline (U-test) is displayed. The best clustering (from 100 random restarts with k*) is shown in the cluster analysis tab and can be saved via 'Menu -> File -> Save clustering'.

The following picture shows a result of the cluster number estimation.
clusternumber

Download

The GUI version can be downloaded from this link: McKmeans.jar.

An R-package macking McKmeans available for R can be downloaded from this link: rMcKmeans.tar.gz

The script version can be downloaded from this link: McKmeans.clj.

Data sets

The data is required to be formatted as a CSV file (comma separated text file). Header lines starting with '#' are ignored. Clustering is performed row-wise, but the data can be transposed after loading (change rows and columns). The following examples are available for download:
  1. Gene expression data (20 rows, 2 columns)
  2. Exon array example (58510 rows, 20 columns)
  3. SNP data (50 rows, 10 columns)
A script for generating artificial microarray experiment data can be downloaded here.

Additionally, the microarray data set from Smirnov et al.[1] is available from this link. You can use this R-script to transform the data into the input format of McKmeans.

[1] Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587-591

Installation

Java GUI software

The software requires the Java runtime environment (JRE) version 1.6, which can be downloaded from java.sun.com. Hint for Mac OSX 10.5 users: You can change your default JVM version using the Java Preferences Application under /Applications/Utilities/Java Preferences.app.

After downloading the McKmeans GUI version it is started by double-clicking McKmeans.jar or via the following command on a terminal:

java -jar McKmeans.jar

If large input files are processed, the Java virtual machine (JVM) needs more memory (e.g. 4 GB). The command then is:

java -Xmx4g -jar McKmeans.jar

Alternatively, windows users can download a start script McKmeans.bat and start the software by a double-click on this file.

R package

Additionaly, an R-package including McKmeans is available here. This package also requires JRE version 1.6. The R-package can be installed on a terminal window with the command R CMD INSTALL rMcKmeans_0.43.tar.gz. Windows users can install during a running R session with install.packages("rMcKmeans_0.43.tar.gz",repos=NULL,type="source"). See the help files in R with ?mckmeans for multi-core cluster analysis or ?cne for starting a cluster number estimation.

Alternative usage

Command line interface

For batch processing McKmeans provides a command line interface. Help for usage is shown by calling java -jar McKmeans.jar --help. The CMD parameters give access to all functions of the GUI version, i.e. clustering microarray/SNP data and cluster number estimation.

The script version (Clojure)

The script version requires Clojure and Java 1.6 or higher. Clojure can be downloaded from clojure.org.
  1. Install JRE if it is not available on your system.
  2. Download Clojure.
  3. Download the McKmeans script.
  4. Start the Clojure REPL:

    java -Xmx1G -server -cp clojure.jar clojure.lang.Repl

    Hint: For larger data sets the value of the parameter -Xmx must be increased to allocate the appropriate amount of memory for Java.
  5. Load the parallel k-means script by typing (load-file "McKmeans.clj").
  6. After loading the script, a data set must be loaded. A simple example is available in the Download Section.
    The function to load and define the data set is (def dataset (read-csv-file "cluster_example.csv") "," false csv-parse-double).
  7. The number of clusters k and the maximum number of iterations can be set via (def k 20) and (def maxiter 100).
  8. K-means is started by calling the function (kmeans). The runtime can be seen by calling (time (kmeans)).
  9. Results of the clustering are available with the function (read-out).

License

This software is licensed under an Artistic license 2.0 (The Perl Foundation).
Contact/Imprint
Johann M. Kraus, Hans A. Kestler
last modified: 2010-05-12