Using DNA Barcodes to Identify and Classify Living Things:
Answers to Bioinformatics Questions

I. Use BLAST to Find DNA Sequences in Databases (Electronic PCR)

    1. Why are some alignments longer than others?
      The main difference in length occurs between hits that align to both primers versus those that align only to the forward or reverse primer. The lengths and colors of the alignment bars tell how much of your query matched sequences in the database. Where the forward and reverse primer matches, you will see a black vertical line between the forward and reverse primer in the graphic summary. Typically, most of the significant alignments will have complete matches to the forward and reverse primers.
      • What is the E value of your most significant hit, and what does it mean? What does it mean if there are multiple hits with similar E values?
        The lowest E value obtained for a match to both primers should be in the range of 0.001 to 2e-04, or 0.0002. This might seem high for a probability, but in fact each of these values means that a match of this quality would be expected to occur by chance less than once in this database! For example, a score of 0.33 would mean that a single match would be expected to occur by chance once in every three searches. E values are based on the length of the search sequence, and thus the relatively short primers used in this experiment produce relatively high E values. Searches with longer primers or long DNA sequences return E values with smaller values. Multiple hits with similar E values are from closely related species.
      • What do the descriptions of significant hits have in common?
        For the plant primers, the sequence sources should all be chloroplast genomes. For the vertebrate, fish, and invertebrate primers, the hits should all be mitochondrial genomes. For the fungi primers, the hits should all be to the nuclear internal transcribed spacer of the 5.8s ribosomal RNA gene.
    1. Which nucleotide positions do the primers match in the subject sequence?
      The answers will vary for each hit and primer set. For Phoenix dactylifera
      (NC_013991.2), the plant primers match 56930-56955 and 57509-57528, respectively. For Pucrasia macrolopha (NC_020587.1), the vertebrate (non-fish) primers match 6589-6613 and 7272-7298 respectively. For Mallotus villosus (NC_015244.1), the fish primers match  5556-5584 and 6233-6258 respectively.  For Candida orthopsilosis (NC_018301.1), the fungi  primers match  344066-344085 and  344559-344577 respectively.  For Choristoneura longicellana (NC_019996.1), the invertebrate primers match  1474-1498 and  2155-2180 respectively.
    1. What value do you get if you calculate the fragment size for other species that have matches to the forward and reverse primer? Do you get the same number?
      The length range of the products produced from the primers will be between 450 to 800 nucleotides. For the plant primers, using Phoenix dactylifera (NC_013991.2) as an example gives 56955-57509 = 554 nucleotides. These are the absolute nucleotide coordinates for this blast hit, and the total length will vary. The range in possible lengths should be between 550 and 600 nucleotides. For the vertebrate (non-fish) primers, Pucrasia macrolopha (NC_020587.1) as an example gives 7298-6589 = 710 nucleotides.  For the fish primers Mallotus villosus (NC_015244.1) as an example gives 6258-5556 = 703 nucleotides. For the fungi primers, Candida orthopsilosis (NC_018301.1) as an example gives 344559 -344066 = 494 nucleotides. For the invertebrate primers, Choristoneura longicellana (NC_019996.1) as an example gives 2180-1474 = 707 nucleotides.
    1. Identify the feature(s) located between the nucleotide positions identified by the primers, as determined in 3.b. above.
      Depending on the hit, the name of features may vary. However, for plant primers, the feature is usually a gene named rbcL that codes for a product called “ribulose 1,5-bisphosphate carboxylase/oxygenase large subunit.” For the vertebrate (non-fish), fish, and invertebrate primers, the feature is usually a gene named COI or COXI, which codes for cytochrome C oxidase subunit I. For the fungi primers, the feature is usually the nuclear internal transcribed spacer (ITS), a variable region that surrounds the 5.8s ribosomal RNA gene.

II. Identify Species and Phylogenetic Relationships Using DNA Subway

      • What is the error rate and accuracy associated with a Phred score of 20?
        A Phred score of 20 equals 1 error in 100 or 99% accuracy.
      • What do you notice about the electropherogram peaks and quality scores at nucleotide positions labeled "N"?
        At "N" positions, peaks representing different nucleotides have similar amplitudes (heights) and overlap, or no single peak rises above the background of lower amplitude peaks. Quality scores are very low.
    2. Why is it important to remove excess Ns from the ends of the sequences?
      Each "N" is scored as a misalignment, causing experimental sequences to appear to be less related to reference sequences than they actually are. This will significantly impact tree building, potentially placing related sequences in different clades.
    1. How does the consensus sequence optimize the amount of sequence information available for analysis? Why does this occur?
      The consensus sequence extends the length of the sequence and improves the accuracy of the sequence in regions where one read is of low quality. Sequence immediately following each primer has many errors and this sequence should be trimmed from the results. The read from the opposite strand usually extends into this region and provides data for the sequence at either end of the amplicon that would otherwise be lost. Also, the sequence quality can be low at different positions because of high GC content or other characteristics of the DNA. Often, the sequence quality from one direction is better than from the other direction. By selecting the best sequence for these regions, the overall quality of the consensus will be better than either forward or reverse sequences.
    2. Do differences tend to occur in certain areas of the sequence? Why?
      Differences cluster at the 5’ and 3’ ends because the sequence quality at the ends is poor.
      • Why do the most significant hits typically have E-values of 0? (This is not the case with BLAST searches with primers.) What does it mean when there are multiple BLAST hits with similar E-values?
        The lower the E-value, the lower the probability of a random match and the higher the probability that the BLAST hit is related to the query. Searching with a long (500 bp or more) barcode sequence increases the number of significant alignments with high scores compared to searches with short primers. It is common to have multiple hits with identical or very similar E-values. Of course, identical matches to the same species would be expected to have an E value of zero. However, other hits with 0 or very low E-values are often found for members of the same genus. In some families of plants or animals, the barcode regions used in this experiment are not variable enough to make a conclusive species determination. Similar E-values would also be obtained when two sequences have the same number of sequence differences, but at different positions.
      • What causes these problems?
        The quality of sequences may be low at either end, contributing to gaps and Ns, and the length of the sequences in the databases may also be of different lengths, which can lead to gaps.
      • Why is it important to remove sequence gaps and unaligned ends?
        Gaps and unaligned ends are scored as mismatches by the tree-building algorithms, making sequences appear less related than they actually are, forcing related sequences into different clades.
      • What assumptions are made when one infers evolutionary relationships from sequence differences?
        The major assumption is that mutations occur at a constant rate; the “molecular clock” provides the measure of evolutionary time. Since branch lengths of a phylogenetic tree represent mutations per unit of time, an increase in the mutation rate at some point in evolutionary time would artificially lengthen branch lengths. If the barcode region mutates more frequently in one clade, then a larger number of differences would be incorrectly interpreted as increased phylogenetic distance between it and other clades. Also, although there is a chance that any given nucleotide has undergone multiple substitutions (for example A>T>C or A>T>A), tree-building algorithms only evaluate nucleotide positions as they occur in the sequences being compared. If the sequences being evaluated do not include a variation that happened during evolution, it will not be taken into account, and the algorithm will assume the minimum number of substitutions. Since the chance of multiple substitutions increases over time, the phylogenetic tree will tend to overestimate relatedness between distantly related species that diverged extremely long ago.
      • Why do gene and phylogenetic trees sometimes disagree?
        Traditional phylogenetic trees are primarily based on morphological (physical) features. Related clades share morphological features by descent from a common ancestor. However, unrelated groups may develop a similar morphological feature when they independently adapt to similar challenges or environments. (For example, bats and birds have wings, but this feature arose independent of a common ancestor.) Gene trees can call attention to situations – at many taxonomic levels – where morphological similarities have been misinterpreted as a close phylogenetic relationship. Also, gene trees may identify new species that cannot be differentiated by morphology alone.
    1. How does it compare to the maximum likelihood tree? What does this tell you?
      The trees will likely have a different arrangement of nodes and place some sequences on different nodes. This tells you that there are multiple possible solutions for most phylogenetic trees, and different algorithms will calculate different optimum trees.