Gene & Genome Sequencing

Sequence output

Sequencing Quality
A sample ab1 file displaying the base calls, the chromatograms and the quality scores for each base. Notice the poor quality in the red box and the corresponding peaks/bases

The output file of next generation sequencing methods utilize the fastq format. Like a fasta file, there is a header that describes the sequence. The first line is the header or title line which begins with ‘@‘ (remember that fasta begins with ‘>‘). The second line is the actual raw sequence (once again similar to fasta). The third line has no meaning while the fourth line is filled with symbols as long as the sequence line. This last line is the quality score of the base call. As with the Sanger sequencing, there may be ambiguity with the base call of the sequence and the certainty is maintained in the quality score.

Fastq
Sample fastq file displaying 5 short read sequences

Phred scores were developed to assess the quality of the base calls arising from fluorescent Sanger sequencing during the Human Genome Project. The phred program scans the peaks of the chromatogram and scores based on certainty or accuracy of the call. The scores are logarithmically based and scores greater than 20 represent greater than 99% accuracy of the base call.

Phred Table

Using the phred scores embedded in the last line of fastq files, poor quality reads can be removed. Using a program like FastQC permits the assessment of the reads and produces graphical representation of quality.

FastQC
FastQC quality output illustrating the Phred score for each base call. This short read sequence of about 100 nucleotides has all bases made at greater than 30, or > 99.9% accuracy.

 

Assembly and Alignment

Sequences from short reads must be assembled into a usable sequence. To do so, a reference genome may aid in the assembly after adapter sequences are trimmed using automated methods. In the case that there is no reference genome, a related species may be used or a more computationally intensive process of de novo assembly must take place. With de novo assembly, it may be useful to have some long reads performed with PacBio to create scaffolds for generating the assembly into contiguous sequences, or contigs.
Mapping Reads