Category Archives: Bioinformatics
This paper presents a new method for detecting and masking repetitive homologous sequences. The design of TANTAN was inspired by the strand slippage mechanism that generates simple repeats. With the assumption that the slippage events occurred with different offsets in mind, they designed an algorithm that integrates self-similarity at different offsets. This was incorporated into a single model along with posterior decoding, which should allow the algorithm to distinguish background from non-background probability. This method is said to enable reliable homology search for protein–protein, protein–DNA and DNA–DNA comparisons, even for extremely AT-rich DNA.
To test the reliability of TANTAN, they have done a series of comparison to other commonly used methods, including Tandem Repeat Finder (TRF) and DustMasker and SegMasker. Tandem Repeat Finder is unique here in that it uses smaller repeat sequences as seeds to develop consensus sequences, which are then matched with similar sequences in the genome . DustMasker and SegMasker are included as part of the BLAST+ package, and are used to mask nucleotide and protein sequences respectively [2-3].
The standard method of comparison used is to align two genomes and compare the number of alignments to the value expected for random sequence data (evalue). These genomes may be individually or both masked by one of the programs being compared. The initial results shown in Figure 4 of the paper seem to show an extremely high number of alignments following masking by DustMasker, SegMasker and TRF. Figure 5 shows similar alignments, but using TANTAN as the masking program. Here, the alignments are much closer to the evalue than before, and in many cases the number of alignments is less than you would expect from random sequence data. Figure 6 shows the effect of different r values on the percent of sequences masked, where anything above a 0.02 seems unreasonably high. Figure 7 shows the number of alignments with TANTAN used to mask the sequences at different points during the alignment process, where masking the genome at nucleotide level and then converting to amino acid sequences shows a particularly high number of alignments. Lastly, Figure 8 shows the result of a soft-mask, where a mask is applied during earlier stages of the sequence comparison but not later stages, which demonstrates a much higher number of alignments than the similar hard masking shown in Figure 5. The author’s final conclusion is that TANTAN is much more adapt at masking simple repeats than the tested software.
The program is designed to help with the comparison of homologous structures between species; however, there seem to be a number of factors that they don’t take into account. The most obvious of these is that nowhere do they factor in transposable elements(TEs), which may comprise a significant portion of the genome. For example, 45% of the human genome, 55% of the opossum genome, and 73% of the maize genome are composed to TEs [4-6]. They state “since sequences do not evolve by reversal, there are no true homologs in these tests” during the testing stages, but many transposable elements can insert in both forward and reverse orientation . This makes it likely that many of their “spurious similarities” were in fact informative insertion sites, and may explain some of the high number of alignments seen in the theoretically random data seen when testing the plant and animal genomes. To compound this error, there never seems to be a point where the so called spurious sites were collected and compared at a sequence level to check for any actual homology.
In addition, there doesn’t seem to be any direct comparison between the sequences that make up the different ranges of masking. In Figure 6 we see a wide range of different numbers based on the different r values used, but there is no indication of what is actually being masked beyond the baseline of 0.005. Obviously in the heavily AT-rich genomes you are getting a lot of false positives which are accounted for in TANTAN, but what about the other genomes? Is TANTAN masking genomic features similar to simple repeats, or is it become at that level we’re seeing transposon masking, or is it masking random nucleotides that bear some vague repetition. Their protein – protein alignment testing seems to be a bit more grounded, since protein sequences will be more conserved, but TANTAN doesn’t seem to handle translated sequences very well as seen in figure 7.
Overall, the paper seems to be lacking in practical testing. TANTAN lacks RepeatMasker’s convenient masked region only file output, which would have made digging through the actual data much easier. Too many bioinformatists treat this kind of graphs and statistical analysis as gospel or black boxes that you never look into. The test design here is very clever, but it seems like understanding of the genomic landscape of different taxa was a bit deficient. The paper should have focused on TANTAN’s specific strengths in comparison to other available programs, rather than comparing with a set of somewhat ill-suited comparisons. The lack of a RepeatMasker comparison is particularly jarring, since I believe it would be the go-to program for masking large portions of the genome for homologous analysis. It should also be pointed out that the scoring system use and the variables set made a large difference in the quality of the output in TANTAN, and while a previous paper covers similar ground with the other programs, I feel that a direct comparison of the “optimized” output for each program would have been more valuable. I also feel a figure comparing a genome masked with TANTAN and a genome masked with another program to both genomes masked with TANTAN and both genomes masked with the other program would be useful; the difference in the range used for the Y-axis in Figure 4 and Figure 5 seems designed to make the number of alignments seem excessively high in the other programs. The other programs within the paper possess other functions that TANTAN doesn’t perform, the usefulness of TRF in TE identification being of particular note.
TANTAN’s primary strength seems to be its ability to perform simple repeat masking in AT-rich genomes. It does an admirable job in masking the simple repeats and should be comparable to the other programs used for de novo masking and the computational resources used should make it available to a causal desktop user for most genomes. However, as mentioned in the paper, the hand optimized parameters will likely make it poorly suited for “right out the door” use.
As mentioned above, there is no “only repeats” output to provide easy access to the masked sections for analysis. This makes TANTAN particularly ill-suited for the study of repeats, especially when compared to other programs such as RepeatMasker or TRF. There is always the possibility of extracting the masked areas using another program, which wouldn’t necessarily be computationally difficult, but it’s an extra step which should be unnecessary. However, the algorithm itself seems to be well designed, and is probably better suited to be implemented as part of a larger framework or more specifically specialized tools. While functionally the same, a frontend with commonly used parameter combinations designed to work best in a given set of circumstances would make the program much more available to the casual user.
- Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580.
- Morgulis,A., Gertz,E.M., Scha¨ ffer,A.A. and Agarwala,R. (2006) A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol., 13, 1028–1040.
- Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409:860-921
- Mikkelsen, T.S., B. Aken, C.T. Amemiya, J.L. Chang, S. Duke, M. Garber, A.J. Gentles, L. Goodstadt, A. Heger, J. Jurka, M. Kamal, E. Mauceli, S.M.J. Searle, T. Sharpe, M.L. Baker, M.A. Batzer, P.V. Benos, K. Belov, M. Clamp, A. Cook, J. Cuff, R. Das, J.E. Deakin, M. Grabherr, J.M. Greally, W. Gu, R.L. Jirtle, S. Mahony, M.A. Marra, R.D. Miller, R.D. Nicholls, A.T. Papenfuss, Z.E. Parra, D.D. Pollock, D.A. Ray, J.E. Schein, T.P. Speed, J.L. VandeBerg, M.J. Wakefield, C.M. Wade, J.A. Walker, C. Webber, J.R. Weidman, X. Xie, M.C. Zody, Broad Institute Genome Sequencing Platform, Broad Institute Whole Genome Assembly Team, J.A. Marshall Graves, C.P. Ponting, M. Breen, P.B. Samollow, E.S. Lander, K. Lindblad-Toh (2007) Genome of the marsupial Monodelphis domestica reveals lineage-specific innovation in coding and non-coding sequences. Nature 447(7141):115-230.
- Meyers, B.C., Tingey, S.V., and Morgante, M. (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Research, 11:1660-1676
- Sela et al (2010): The role of transposable elements in the evolution of non-mammalian vertebrates and invertebrates. Genome Biology, 11:R59