Friday, February 3, 2012

Showing data distribution using Gene Plot

The Gene Plot function has been reworked - new features added, known bugs removed, interface is more informative.

Enhancements/changes on the four graph types are summarized here:
  • Type 1 - previously only "average value curve" is drawn. Now the value distribution over each data points are displayed.
  • Type 2 - same as before, but now named as "Spaghetti plot".
  • Type 3 - same as before, five functional parts of genes are plotted individually. Same as Type 1, value distribution over data points is displayed.
  • Type 4 - hierarchical or k-means clustering, now visualizes negative values correctly.
In Type 1 and 3, value distribution is presented either as boxplot (using R) or "quartile & extremes" curves (using Google Chart service). Here's the details.

To run Gene Plot as standalone app, click and select Gene Plot from the apps list:


The Gene Plot panel is then displayed:


At Step 0, enter a list of genes or coordinates into the text area. The sample list of cytochrome P450 genes is used here:


CYP4Z1
CYP2A7
CYP2A6
CYP3A4
CYP1A1
CYP4V2
CYP51A1
CYP2C19
CYP26B1
CYP11B2
CYP24A1
CYP4B1
CYP2C8


At Step 1, I select a heatmap track for demo. Click  and select the track named "H3K9me3 vHMEC":



Then go to Step 2 and look at its interface, by default the first plot type is chosen (quartiles & extremes):


Check the checkbox on the bottom "plot average values", then press button to generate the plot:


In this graph, the histone data over the P450 gene bodies are summarized into same number of data points (number of 50). Histone data distribution over sampling points are presented as 6 curves: min/max, lower/upper quartile, median and average. The average curve can be removed by unchecking the checkbox.

This graph is interactive, move cursor over to get details (data point #, curve type, and value). Lower/upper quartiles represent 25/75 percentiles, so between them are 50% of the data values. In this example the the average and median curves don't differ a lot. But in cases of outliers the median/average will show great difference.

This graph is generated by Google Chart service, which is fancy and interactive. R software rendering used to be merely fallback mechanism, but now it gets some special highlights.

At Step 3, select R rendering from the drop-down menu. Notice how Step 2 panel updates:


Two new options show up when using R for graph type 1. With above configurations generate the plot:


Now boxplots are used instead of quartile curves. However the curve for average value is still there. You can turn it off by unchecking the checkbox. Graph generated by R is still image and is not interactive.

The graph Type 3 is similar with Type 1 in using quartile curves or boxplots to represent data distribution. As an example, select genomic feature track "vertebrate PhyloP" (sequence conservation data of human genome against vertebrate genomes), and generate Type 3 plot for the short list of P450 genes:


The plot shows gene exons have higher score, which well correlates with the idea that coding regions tend to be conserved. Average value curves are not shown here. The graph will look like following when generated by Google Chart: