NCBI Conserved Domain Database (CDD) and Setup on Google Colab

 

Introduction

The NCBI Conserved Domain Database (CDD) is a collection of annotated multiple sequence alignments representing protein domain families. It is widely used in bioinformatics to identify functional and evolutionary relationships in proteins.

One of the most efficient ways to analyze conserved domains in proteins is through RPS-BLAST (Reverse Position-Specific BLAST). In this guide, we will set up and run RPS-BLAST with CDD on Google Colab.


Why Use CDD?

  • Identifies functional protein domains in sequences.
  • Helps in annotating proteins with domain structures.
  • Supports comparative genomics and phylogenetic studies.

Setting Up CDD on Google Colab

Google Colab provides a cloud-based Linux environment, making it easy to install and run RPS-BLAST with CDD.

1. Create Required Directories

!mkdir -p /content/ncbi
!mkdir -p /content/cdd
!mkdir -p /content/query

This step ensures that:

  • BLAST executables are stored in /content/ncbi
  • CDD database files are stored in /content/cdd
  • Query sequences are saved in /content/query

2. Install RPS-BLAST

Download and install NCBI BLAST+, which includes rpsblast:

!wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
!tar -xvzf ncbi-blast-2.16.0+-x64-linux.tar.gz -C /content/ncbi --strip-components=1

Add BLAST to the system path:

import os
os.environ["PATH"] += ":/content/ncbi/bin"

Check if RPS-BLAST is installed:

!rpsblast -version

3. Download the CDD Database

CDD contains domain alignments used for RPS-BLAST searches. Download and extract the CDD database:

!wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cdd_LE.tar.gz
!tar -xvzf Cdd_LE.tar.gz -C /content/cdd

4. Install rpsbproc for Processing Results

The rpsbproc tool is used to interpret and format RPS-BLAST results.

!wget https://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/RpsbProc-x64-linux.tar.gz
!tar -xvzf RpsbProc-x64-linux.tar.gz -C /content/ncbi/bin
!chmod +x /content/ncbi/bin/RpsbProc-x64-linux/rpsbproc

Confirm the installation:

!/content/ncbi/bin/RpsbProc-x64-linux/rpsbproc -help

5. Prepare a Protein Query Sequence

Create a FASTA file with a sample protein sequence:

query_fasta = """>example_protein
MAVQGPELFVRLVKPDVDYLGAGGELFDSLGKTVVKVGRGAIMPYMGAGAHTYFSNYPMFYDE
"""

with open("/content/query/protein.fasta", "w") as file:
    file.write(query_fasta)

6. Run RPS-BLAST

Now, run RPS-BLAST using the CDD database:

!rpsblast -query /content/query/protein.fasta -db /content/cdd/Cdd -out /content/query/output.asn -outfmt 11

7. Process Results with rpsbproc

Format the output for easier interpretation:

!/content/ncbi/bin/RpsbProc-x64-linux/rpsbproc -i /content/query/output.asn -o /content/query/output.txt
!cat /content/query/output.txt

This will display the conserved domain annotations found in the query protein.


Conclusion

Setting up RPS-BLAST with CDD on Google Colab is straightforward and enables powerful protein domain analysis. This approach is especially useful for biologists and bioinformaticians working with functional annotations and evolutionary studies.

By following these steps, you can quickly analyze conserved domains in protein sequences without requiring a local installation.