Sunday, May 20, 2012

Variant of BED file format (and how to make it)

UPDATE contents in this post have gone void as WashU Epigenome Browser no longer supports bigBed file format.

Look at this post to see how to reformat your data into the tabix format.

-----------------------------------

In Wash U Epigenome Browser, we use a slightly altered version of BED format to encode positional data of genomic features: the 5th field is set to an unique integer, to be used as the ID of the genomic feature represented by that line. There's no upper bound of the ID value, it can go as high as 10 million if there's 10 million lines in that BED file. (example)

But the day breaks. Right now users diligently following our guideline to prepare a custom genomic feature track will encounter following error when converting BED file into bigBed file with the bedToBigBed program:


At line xx, score (xxx) must be between 0 and 1000


... where "xxx" is an integer bigger than 1000.


It is all because in the BED format specification the 5th field is deemed as "score", and the value must be between 0 and 1000. The bedToBigBed program scrutinizes the input BED file and squawks when it sees a "score" bigger than 1000.

In order to work around the hurdle to generate properly working bigBed files, you can use following bedToBigBed binary to do the work, but not the native one:

http://epigenomegateway.wustl.edu/bedToBigBed

This binary is compiled on a PC with 32bit Ubuntu operating system, using Kent Source Tree downloaded on Apr 25, 2012. It should work on both 32bit and 64bit Linux PCs.


Follow is the recipe to re-make bedToBigBed program that doesn't squawk:

  1. Download Kent Source Tree at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, decompress it, the directory "kent" will be created in your working directory.
  2. Open file "kent/src/lib/basicBed.c"
  3. At line 1375, if the content is "if (!isCt && (bed->score < 0 || bed->score > 1000))", remove line 1375 and 1376. Else do nothing.
  4. Save your edit on this file.
  5. Resume normal procedure to build the library and bedToBigBed binary.
    1. Remove "-Werror" tag from file kent/src/inc/common.mk
    2. Go to kent/src/
    3. Run "make libs"
    4. Go to kent/src/utils/bedToBigBed/
    5. Run "make", then a new "bedToBigBed" binary will be generated


We have to stick to this variation of the BED format because the genomic feature track need the ID field to scroll (ID is a neat way for the Browser to tell which genomic features have been extended by scrolling so the new data can be correctly appended to cached data). We don't think it's bizarre, savage, or ruthless, because the 5th field of BED file is already of integer type, so why not making it free of limit, free to bear an arbitrary value it wants to? We apologize for any unsettlement that might arise, and we're happy to hear your thoughts.

Free(dom) is good, isn't it?