Module 3

Data Analysis & Identification Process

Process of Identification

The following methodologies can be followed for 16S rRNA/ITS/18S/custom markers as a tool to identify the bacterial/fungal/any organisms, respectively:

1. Source of Sample:

Environmental samples, Soil, Water, Clinical, or any place where microbes can exist.

2. Collection/ isolation of pure culture:

Thes samples can be collected in various forms but must be screened further and the pure cultures should be isolated using different isolation methods in petri plates/ slants/ suspension etc. One must remember that in case of Sanger’s Sequencing, the samples must be a pure culture without any contamination.

3. Selection of PCR Primers and amplification:

This is a crucial step that plays an important role in the identification process. Selection of the correct marker depends on type of organisms that need to be identified. According to the International Barcode of Life (iBOL), certain DNA regions (markers) have unique patterns and can be used as DNA barcodes for organism identification. There are specific markers designated as universal barcodes for bacteria and fungi, such as 16S rRNA and ITS, respectively.

The total length of the 16S rRNA gene is generally around 1500 bp, though it varies slightly from organism to organism. Since the gene length is approximately 1500 bp, it is possible to amplify the entire gene in one go using forward and reverse primers, such as 27F and 1492R. To obtain the complete sequence of 1500 bp, sequencing PCR requires multiple primers (up to 3-4 primers), such as 27F, 704F, and either 907R or 1492R. This increase in the number of primers is due to the limitations of Sanger sequencing, which can produce clean sequences of a maximum length of 800-900 bp.

In contrast, ITS sequences typically range from 500 to 800 bp in length and can be sequenced using two universal primers, such as ITS1 or ITS5 as forward primers and ITS4 as the reverse primer. Additionally, customized primers can be used to sequence specific regions of interest.

4. Sanger Sequencing:

The Sanger sequencing PCR can be set using Big-Dye terminator kit provided by Applied Biosystem/ Thermo. Following Sequencing PCR, the samples can be run on any of the variants of sanger sequencer available by AB such as ABI 3500XL Genetic Analyzer/ SeqStudio Genetic Analyzers and the raw data can be exported in the form of .ab1 (Applied Biosystem sequence data file) file and .fasta file format.

5. Data quality check (QC) check and generating FASTA files:

The quality of data can be tested using various available software/ application. The software can be accessed freely / paid subscription based. During QC of Sanger Sequencing data, the following steps can be followed.

Before we move forward, one must know what FASTA is:

FASTA format: The fasta is a unique format that can be utilized by various bioinformatics tools to read/ understand the genetic codes. In this case the format is represented in two lines (Fig. 4):

  • Line 1 – it always starts with > sign followed by unique identifier/ organism name followed by marker/ gene name and any other required details.
  • Line 2 – it always starts with the actual sequences such as A/T/G/C, the sequences may appear in multiple lines based on the sequence length, but the new sequence can be added only after next > sign.
Figure 4 - FASTA Format
Figure 4. The fasta format representing the ITS sequences of Aspergillus brasiliensis.

To analyze the data provided for practice, one can follow the steps below:

Steps to be followed:

Software/ applications to be used: Tracy (by Applied Biosystem)/ Chromas/ BioEdit/ FinchTV, etc. (Try any of the above or any alternate software and install them in your system Windows/ MAC/ Linux based on the compatibility of application).

a. Open the forward and reverse primer sequence files in Chromas 2.6.6
b. Check the sequence quality using the Quality box (trace QC) and Peak intensity (fig. 5, 6 and 7).
Note: In case the quality box is filled below 50% and the peak intensity is not good (over lapping/ mixed peaks), the data/ bases should be verified carefully using both the sequences, if this issue persist frequently throughout the sequence (other than both ends), one must discard the data and go for resequencing.
c. Remove the poor-quality bases from both the ends using delete or backspace option (the poor bases at the end may range up to +/- 50bp).
d. Export the data using EXPORT option with save as text (type as text/ fast format).
e. Please repeat the same process for reverse primer
f. During reverse primer, please make sure that sequences are reverse complimented (Reverse option mentioned in the tool OR Edit options OR shortcut (Ctrl+=)) before exporting them in fasta or text format.
Figure 5
Figure 5. Initial sequences of raw .ab1 files (the yellow part can be directly trimmed off to remove poor quality bases from start and end of the data).
Figure 6
Figure 6. Clean data after removing initial few bases from the raw files.
Figure 7
Figure 7. The export option from Chromas 2.6.6. The red dotted line denotes various option and red arrow highlights the reverse complement option in Chromas 2.6.6.

6. Assembly/ consensus contig sequence:

To generate the consensus sequences/ contigs, you may use any Multiple Sequence alignment tools offered by various databases or software such as ClustalW, EMBL, MEGA, BioEdit, etc. In this process we have used EMBL webpage to align and generate the consensus sequences (Fig. 8 and 9). For practice we used ITS (ITS1 and ITS4) (the following steps will remain same for all primers).

a. During the alignment process one should make sure that both the primers do have overlapping regions, otherwise there will not be a proper alignment.

b. The aligned files can be cross verified with the chromatograms and the final consensus sequences can be generated either by copying extra flanking regions from forward primer to reverse or vice-versa.

Figure 8
Figure 8. EMBL webpage (https://www.ebi.ac.uk/jdispatcher/msa/muscle?stype=protein) interface to paste/ upload the data to be aligned.
Figure 9
Figure 9. The alignment of forward and reverse primer-based sequences to generate the consensus sequence (EMBL muscle alignment tool was used to generate this data).

5.1. Possible errors and troubleshooting:

a. Primers do not show overlapping regions

In case the primers do not show overlapping regions, this could happen due to following possibilities:

  • i. Missing overlapping primers
  • ii. Presence of homopolymers
  • iii. Poor quality sequences
  • iv. Selection of gene/ marker with multicopy gene
  • v. Low primer efficiency with target gene/ marker
b. Mixed peaks / high background noise in chromatograms

If sequence quality is not good or some of the bases depict mixed peaks/ high background noise in chromatograms, the alignment might appear with some ambiguities. Also, there is possibility of some other IUPAC based alphabets than standard A/T/G/C. Placement of such alphabet indicates uncertain bases or possible in-dels present in the sequenced region and such sequences must be verified followed by re-sequencing of same region.