A fast, alignment-free, conservation-based method for transcription factor binding site discovery
Raluca Gordân, Leelavati Narlikar, and Alexander J. Hartemink

Abstract

As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Based on the premise that selective pressure forces functional DNA elements to evolve at a slower rate than non-functional elements, comparative approaches generally consider the DNA sites that are well conserved in orthologous regulatory regions to be good candidates for functional transcription factor (TF) binding sites. Most conservation-based methods for TF binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not correctly align these binding sites. Here, we present a novel, alignment-free approach for incorporating conservation information into TF motif discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use the conservation information to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is thus very simple and fast. It does not require sequence alignments, nor the phylogenetic relationships between the orthologous sequences, and yet it is more effective on real biological data than methods that do. We show that when applied to yeast ChIP-chip data, our algorithm with the conservation information performs better in terms of both prediction accuracy and speed than current conservation-based motif discovery tools.


Data and scripts

  • Fasta files: FASTA.zip

  • Discriminative conservation priors DC: DCpriors.zip

  • Literature consensus motifs (as PSSMs): literature.zip

  • Script to compute the inter-motif distance: distancePSSMtoPSSM.pl

    The script takes as input two files, each containing one PSSM: 4 rows, corresponding to A,C,G,T; W columns, corresponding to the W positions in the motif. Elements on a row are separated by tabs. Please see the literature motifs for examples of such files.

    Running the script: perl your_motif.txt literature_motif.txt


Supplementary files