RSA-tools - Tutorials - dna-pattern


Searching pattern positions

    1. Retrieve upstream sequences from -800 to -1 for the following yeast genes (as you have seen in the tutorial on sequence retrieval)

    2. Since you are working with an eukaryote, make sure to inactivate the option Prevent overlap with neigbout genes. Check that all your sequences have 800bp.

      Now we have the upstream sequences, we will scan them with the consensi for GATA boxes and HAP sites. At the bottom of the sequences, a series of buttons are presented. These buttons allow you to send your sequence to a selection of sequence analysis programs. Click on dna-pattern (IUPAC). A new form appears.

      Note that the search will automatically be performed on the sequences you just retrieved (sequence transferred from your previous query). This differs from the form you would receive by clicking on "dna-pattern" in the left frame, and which would contain an empty box for entering your own sequences.

    3. In the Query pattern(s) box, we will enter the patterns to be searched for. Each pattern must come on a separate line. The first word of each line is the string description of the pattern, the second word is an identifier for this pattern. Type the following text in the Query pattern(s) box:
    4. GATAAG	Gata_box
      CCAAY	Hap_site
      Note the use of degenerate IUPAC degenerate code: the Y from CCAAY on the second line means "either C or T".

    5. Leave all other parameters unchanged and click GO.
    6. You see now the positions of all matches with the patterns you entered within the upstream sequences of the selected genes. Each line shows a single match, and the different columns indicate respectively:

      • pattern identifier
      • strand on which the match was found (D for direct, R for Reverse)
      • pattern searched for (i.e. the query strings you provided)
      • name of the sequence in which it was found
      • starting position of the match
      • end position of the match
      • match sequence. The matching bases are indicated in UPPERCASES. The 4 flanking bases at left and right are in lowercases.
      • matching score. In this case all scores equal 1, but we will see later how to use this column.

      Notice that positions are returned in negative coordinates, relative to the end of the sequence (the last nucleotide has position -1). This behaviour was selected with the "Origin" option in the dna-pattern form (Origin=end). This option is particularly useful for analyzing regulatory sequences, but it can be inactivated in other cases.

      We will now display the same results graphically.

      1. Click on the Feature map button on the bottom of the result page.
      2. In the Title box, type
      3. Gata boxes and Hap sites in the upstream regions of NIT genes

      4. after the title Display limits, fill
        • the from box with -800,
        • the to box with 0

      5. In the pop-up menu "feature handle", select symbol
      6. make sure the Dynamic map option is checked.
      7. Leave other parameters unchanged and click GO.

      After a few seconds, the feature map should appear. A few comments:

      • Gata boxes appear in blue, Hap sites in red
      • A specific symbol is associated to each pattern, allowing to distinguish them when the feature map is printed in black and white
      • Color boxes are displayed either above or below the horizontal black lines, accordingly to the strand of the match.
      • Coordinates are provided with reference to the ORF starting position, negative values indicate an upstream position, and positive coordinates are within the coding sequences (0 corresponds to the first nucleotide ot the start codon).

      • If your browser is recent, the map is dynamic. With your mouse, position the cursor just above one of the sites in the sequences. Look now at the status bar (at the bottom) of your browser window. The complete information about this site is displayed. Move the cursor to another site and check that the information is well updated. If you are using Internet Explorer, make sure to activate the status bar (in the View menu).

      Searching for complex patterns

        We will now show an example of search for patterns containing spacings.

        Another characteristics of GATA boxes is that they often come clustered in the upstream region: nitrogen-responding genes usually have a pair of GATA boxes, separated by 0 t 60 base pairs. dna-pattern allows to search for spaced motifs by using a notation called regular expressions. For example :

        • a repetition is specified by a number within curly brackets (e.g. A{6} is equivalent to AAAAAA)
        • this can be combined with the IUPAC notation to specify a fixed spacing (e.g. n{30} means a spacing of exactly 30 nucleotides)
        • variable number of repeats can be specified by entering two numbers, separated by a comma, in the curly brackets (e.g. n{0,60} means "between 0 and 60 nucleotides")

        Run the tutorial as above, but enter the following patterns.

        	GATAAGn{0,60}GATAAG	Gata_tandem
        	CTTATCn{0,60}GATAAG	Gata_inv1
        	GATAAGn{0,60}CTTATC	Gata_inv2
        	GATAA			Gata_box
        	GATAAG			Gata_box_strict

      Counting multiple patterns in multiple sequences

        A charcteristics of yeast GATA boxes is that they act in a synergic way, i.e. nitrogen-responsive generally genes contain multiple GATA boxes in their upstream sequences. Thus, for this particular regulation, one might be interested in counting the number of matches, rather than returning their precise positions. This can be done with dna-pattern.

        1. Come back to the dna-pattern form.
        2. Enter the same list of patterns as before.
        3. Deselect the checkbox match positions
        4. Select the checkbox match count table
        5. GO

        The program returns a table, where each row represents a sequence and each column a pattern. Totals per row and per columns are optionally included.

        You can now come back to the tutorial main page and follow the next tutorials.