RSA-tools - Tutorials - peak-motifs

Contents

  1. Prerequisite
  2. Introduction
  3. Study case
  4. Test sets
  5. Tuning peak-motifs parameters
  6. Interpreting the result
  7. Additional exercises
  8. References


Prerequisite

This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.

  1. PSSM theory
  2. String-based motif discovery

It is better to follow the corresponding tutorials before this one.

  1. Position-specific scoring matrices.
  2. oligo-analysis: detection of over-represented words.
  3. position-analysis: detection of words having a positional bias in sequences aligned on some reference position.

A companion tutorial explains how to retrieve peak sequences from Galaxy.


Introduction

The program peak-motifs combines various programs of the RSAT suite to discover cis-regulatory motifs and predict putative transcription factor binding sites from a set of peak sequences identified by high-throughput methods such as ChIP-seq, ChIP-on-chip or related methods.

In this tutorial, we expain how to tune the parameters and interpret of results for the different steps of the peak-motifs workflow:

  1. Composition of the peak sequences (peak length distribution, mono- and di-nucleotides composition).
  2. Motif discovery with a series of complementary algorithms (oligo-analysis, position-analysis, dyad-analysis, local-words) relying on distinct criteria (global or local over-representation, positional heterogeneity) for selecting exceptional motifs.
  3. Comparison of discovered motifs with collections of known motifs, to identify transcription factors that may be associated with the discovered motifs.
  4. Scanning of input sequence with discovered motifs to predict putative binding sites, analyze their positional profiles and enrichment.
  5. Vizualisation of binding sites in the UCSC Genome Browser.


Study case

To illustrate the features of peak-motifs, we will analyze a set of peak sequences that were obtained by pulling down genomic regions bound by the transcription factor Oct4 in the mouse. The experiment was performed in the context of a wider study, where X. Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).


Test sets

A set of test sequences are available on the supplementary material Web site
http://rsat.bigre.ulb.ac.be/rsat/data/published_data/peak-motifs_2011/

The peak sequences from Chen's article are in the subdirectory
data/sequences/Chen_2008/peaks_from_galaxy/

Note: these peak sequences differ from those available in the GEO database. Indeed, Chen and colleagues filtered their peaks on the basis of discovered motifs in order to submit a "cleaned" collection of peaks to GEO. Since the goal of this tutorial is to show how peak-motifs performs on a raw collection of peak sequences, we have re-generated a complete peak collection from the original reads submitted by Chen in the GEO database. The mapping of the reads was performed with Bowtie against the mm9 assembly, then we used the program MACS to call the peak regions, and PeakSplitter to split the large regions into effective peaks. The peak sequences were then collected from Galaxy.


Tuning peak-motifs parameters

  1. Open a connection to the RSAT Web server.

  2. In the menu on the left side, expand the title NGS - ChIP-seq and select the tool peak-motifs.

Peak sequences panel

  1. Unless you dispose of a custom set of peak sequences, you can download the test set provided on the supplementary material Web site (file Oct4vsGFP_MACS_fdr0.02_splitted_peaks_sorted.fa).

    Note: the sequences should be saved as unformatted text file.

  2. Enter a Title for this analysis (e.g. Oct4 dataset Chen 2008)

  3. Under Peak sequences, click on the Browse button to select your peak sequence file.

    • Note: Alternatively, you may directly copy-paste the sequences in the box, but this will impose restrictions on the size of the data set.
    • Caution: make sure you upload a file containing peak sequences (usually less than 100 000 sequences of a few hundreds base pairs each) and not the raw reads (usually several millions of sequences of a few tens bp).

Reduce input peak sequences panel

This panel can be expanded by double-clicking on the triangle on the right.

It allows you to limit the analysis to a given number of top peaks from the input file, or to clip sequences around the centers in order to restrict them to a maximal size. With the peak sequences used in this tutorial, there is no specific need to apply those restrictions. The two steps hereafter just indicate the reasons why you generally don't need to activate the restrictions on peak number and peak size.

  1. Make sure that the option Number of top sequences to retain is left blank.
    • Note: most existing tools for motif discovery in ChIP-seq peaks systematically restrict the number of top sequences because the underlying algorithms to not scale up with large sequence sets. In contrast, the algorithms used in peak-motifsare linear in time and can analyze several megabases in a few minutes. There is thus no need to restrict the number of peaks.
  2. Make sure that the option Cut peak sequences is left black.
    • Note: this option permits to restrict the width of the peaks by clipping the left and right extremities of each peak sequence over a given distance. As a result, the analysis is restricted to the central region of each peak, which is supposed to contain the highest density in transcription factor binding sites. However, the benefit strongly depends on the precision of the previous procedure used for peak calling. In practice, it is generally safe to let the program anlyze the whole dataset, but in some cases it might be useful to also restrict the analysis to peak centers and compare the results.
    • Note: this option should not be used for collection of large regions such as those obtained from chromatin accessiblity or histone methylation studies.

Change motif discovery parameters panel

This panel contain the parameters for the motif discovery step. For the case study, we will keep the default settings, using the programs oligo-analysis and position-analysis to discover over-represented motifs and motifs with positional biais.

We explain hereafter the way to tune the parameters for depending on the properties of the peak collection and the expected structure of the trnascription factor binding motif.

Compare motifs panel

Discovered motifs can be compared to databases of known motifs. We directly support various public databases like JASPAR, Uniprobe. Users may also upload here private collection of matrices e.g. TRANSFAC.

  1. Keep JASPAR core Vertebrates checked, and also check JASPAR PBM (UNIPROBE) Mouse, since our dataset was obtained from mouse.

Locate motifs and export as UCSC custom track

If the sequences are provided in appropriate format, the positions of the predicted sites can automatically be converted from peak-relative to genomic coordinates.

  1. Keep the Search putative binding sites option checked.

  2. Assuming that you followed the steps above, select Sequences were fetched from Galaxy. This will recalculate the genomic coordinates of the predicted binding sites, and generate a custom UCSC track to vizualise the results in this popular genome browser.

  3. Enter your email adress and click GO.

    • Note: peak-motifs runs a complete workflow involving several tasks. Even though the motif discovery algorihtms are time-efficient (the computing time increases linearly with sequence size), the complete treatment can take several minutes. For this reason, this tools requires an email address in order to notify users when the results are ready.
    • Note: as soon as peak-motif starts, an information message appears, indicating the URL where the results will become available. Optionally, you can click on this URL and periodically reload the page to follow the progress of the analysis. The report page will be progressively updated until the whole analysis is finished.


Interpreting the result

A link to the result should appear on the new page. The results appear progressively, to enable the users to analyse their results more quickly.

  1. Click on this link to see the results.
  2. TO BE CONTINUED...

Additional exercises


References

  1. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V. B., Wong, E., Orlov, Y. L., Zhang, W., Jiang, J., Loh, Y. H., Yeo, H. C., Yeo, Z. X., Narang, V., Govindarajan, K. R., Leong, B., Shahab, A., Ruan, Y., Bourque, G., Sung, W. K., Clarke, N. D., Wei, C. L. and Ng, H. H. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-17. [Pubmed 18555785].
  2. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2011). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets Nucleic Acids Research doi:10.1093/nar/gkr1104, 9. [Open access]

For suggestions please post an issue on GitHub or contact the