I recently added a new concatenation operation for filo‘s groupBy tool. Similar to the “collapse” operation, it allows one to combine column values from multiple lines in a file or stream based upon common groups. For example, imagine we have the following simple BED file (names.bed).
$ cat names.bed chr1 10 20 aaa chr1 10 20 bbb chr1 10 20 ccc
Using the “collapse” operation, groupBy will give us a list of the names for this common interval:
$ groupBy -i names.bed -g 1,2,3 -c 4 -o collapse chr1 10 20 aaa,bbb,ccc
Now, by using the “concat” operation, groupBy will merge names for this common interval:
$ groupBy -i names.bed -g 1,2,3 -c 4 -o concat chr1 10 20 aaabbbccc
This feature allows one to do many useful things, especially with DNA sequences. Below is an example that uses groupBy with BEDTools to create cDNA sequences from a BED file of exons for each gene/transcript. Such a starting file could be created by using the UCSC “knownGene” track.
Previous post: Subsampling a paired-end FASTQ file