Module 4

BLAST & Databases

Identification of organism using standard databases:

What is a Database?

A database for DNA, RNA, and proteins is a specialized repository that stores biological information related to these essential molecules. These databases organize and provide access to sequences, structures, functions, and interactions of nucleic acids and proteins. Key features include searchable sequences, annotations about gene functions, variations, and evolutionary data. Prominent examples include GenBank for DNA sequences, UniProt for protein sequences and functions, and RNAcentral for RNA sequences. These resources are invaluable for researcher’s studying genetics, molecular biology, and bioinformatics, as they facilitate data analysis, comparison, and visualization in their exploration of the molecular basis of life. Along with the above-mentioned database, there are several specialized databases which can be used for specific target markers such as UNITE for ITS sequences and SILVA for 16S rRNA, etc.

The correct identification is an important step to accurately identify the organisms before using them in any further application. The above consensus sequences provided for practice can be used for the dentification purpose using the BLAST tool of GenBank (NCBI). The following steps can be used to correctly identify / match the organisms using NCBI database (Fig. 10, and 15).

Steps to be followed (Fig. 10 to 15):

1. Use the correctly curated contig sequence in fast file format, the sequences can be directly copy and pasted on nBLAST interphase box (Fig. 11)
2. Rest of the parameters can be kept as default until one have some specific requirements regarding blast or organism specific match criteria (Fig. 12 and 13). You may also play with the different databases under the option “choose search set” if unsure about your organism and data, please use “standard database” option.
3. After this, select the “Highly Similar Sequences (megablast)” option under the section “program selection”.
4. Once, all the parameters are set, please click on “BLAST” tab/ option and a new window will appear (Fig. 13) and wait for the results to appear. This process may take few seconds to several minutes based on the sequence length, similarity, and databases (Fig. 14).
5. Once the search run is complete, the window will automatically refresh itself and the results will appear (Fig. 15).
Figure 10
Figure 10. GenBank – NCBI database homepage
Figure 11
Figure 11. NCBI BLAST tool page (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
Figure 12
Figure 12. nBLAST (nucleotide BLAST) webpage to understand the similarity of organism based on nucleotide dataset.
Figure 13
Figure 13. nBLAST tool for identification of ITS marker-based sequences.
Figure 14
Figure 14. The BLAST tool analysis may take few seconds to few minutes.
Figure 15
Figure 15. The nBLAST result showing best match based on the percent identity (100%) with E value of 0.0.

Interpretation of the BLAST result:

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics tool used to compare nucleotide or protein sequences against a database. When interpreting a BLAST result, you should focus on several key components:

1. Query Sequence

This is the sequence you submitted for comparison. Understanding this sequence is crucial for interpreting the results accurately.

2. Subject Sequences

These are the sequences from the database that showed similarities to your query. Each hit represents a potential match.

3. E-value (Expect value)

The E-value indicates the number of hits one might expect to see by chance when searching a database of a particular size. A lower E-value (typically less than 0.01 or close to 0) suggests a more statistically significant match.

4. Score

This reflects the quality of the alignment, considering both matches and mismatches. Higher scores indicate better alignments.

5. % Identities

This section shows how many of the residues in the query match the subject sequence (identities) and how many are similar (positives). These values help assess the degree of similarity. A higher % identity suggests a closer evolutionary relationship and potentially shared functions.

6. Alignment

This is the actual alignment of the sequences, highlighting matches, mismatches, and gaps. Visualizing the alignment can help you understand the regions of conservation or variability.

7. Organism Information

Knowing the organism from which the matched sequences are derived can provide context and relevance, especially if you’re studying a specific biological question.

8. Accession Numbers

These are unique identifiers for the subject sequences in the database, allowing further investigation into specific sequences or publications related to them.

As per the blast hit, the BLAST run will fetch several details (mentioned above) from the database, and it will show details as mentioned in figure 15.

Percent Identity Interpretation Significance
> 97% Strong conservation Indicates strong conservation, often same genus and species or gene families (for microbes). Please note that in case of higher eukaryotes the % similarity alone cannot define the identity.
90% – 97% Strong conservation Indicates strong conservation, often seen in closely related genus/ species or gene families. But the species-level identity will not be reliable.
70% – 90% Strong similarity May still represent strong similarity and could suggest functional conservation between genus/ families.
50% – 70% Moderate similarity Moderate similarity that could indicate shared ancestry but may also reflect divergent evolution.
< 50% Low similarity Often indicates that the sequences are distantly related.
If all the above steps are followed correctly, the identification of organisms can be performed. However, please note that if your organism is novel or the database does not contain enough information related to it, it might show poor values for the parameters mentioned above. Further to strengthen the identification process one must opt for multiple marker-based identification or NGS approach to avoid any ambiguities.
For more details, please feel free to connect with us: higx360@himedialabs.com