Tuesday, September 6, 2011

Clustering analysis on Gene Set

The clustering analysis has been used a lot in gene expression microarray studies. Lots of techniques and routines are readily available. This methodology for exploring data structure is also useful in interpreting sequencing based datasets -- just check our Browser!

A lightweight clustering function has been implemented as an additional graph type in Gene Plot, making it another very useful function following up the Gene Set View.

I will take an example to fully demonstrate this feature. First let's run Gene Set View on a gene set.


I'm using glycolysis pathway in human, identified by its ID: "path:hsa00010". But don't hurry up submitting it. Let's change the default gene part by clicking the "change" button:


Options will be shown to determine which gene part will be displayed for genes. Click "custom region..." option and select a 5 KB region around transcription start site, where most interesting stuff lies around (batteries of regulatory elements, mysterious CpG methylation, crazy nucleosome positioning and... )

Then click "use pathway". The view will update in a short moment, into something like below:


In Step 2 of "Gene plot" panel, a forth glyph type with heatmap icon has been added for the clustering function. Click it to reveal its content:


You will notice that Step 3 graph rendering method option is gone. This is because the clustering and heatmap rendering will only be carried out by native code (on our server and on your web browser).

The "Number of data points" option is still there, allows you to control resolution of data. Following it is the clustering method, currently with two available choices: hierarchical and K-means. Each will have its own options ("distance metric" and "agglomeration" for hierarchical, and "number of clusters" for K-means), which are just household parameters to run clustering analysis.

Let's run hierarchical analysis first. Click button "Make gene plot":


This is how hierarchical clustering result looks like, of those sweet 5 KB regions of glycolysis pathway genes, on an MRE-Seq experiment done on CD4 cell sample... The right side is heatmap, where each row is one 5KB region (middle point is TSS, left side is upstream, no matter which strand the gene is on). Darker color means higher MRE signal, indicating higher likelihood of CpGs been unmethylated. Mouse over it for the tooltip:


And on the left is the dendrogram, in horizontal fashion. In addition to looking at the lines and branches, you can sort it out by clicking the branching points:


See that I clicked on a juncture, and that entire sub-tree turns red now. The genes composing this tree are also displayed in a list.

Above is brief intro on hierarchical clustering. This result is sensitive to choice of distance metric, and agglomeration method. Just play with it and you'll see.

Next is the K-means. Run it with same data and search for 3 clusters:


So... the result doesn't look *so interesting*, it might because the data profiles of glycolysis genes are just similar as each other, or I haven't tried good enough? Anyway, the clusters are denoted by buttons on the right side of heatmap, which is clickable for showing genes belonging to this cluster.

That's it. It will be great if you can let me know your opinions, just leave comment below. Enjoy the day!