5 Fetching sequence data

⏳ Time

Teaching: 15 min
Exercises: 10 min

🤔 Questions

What databases does NCBI maintain, and how do I navigate them?
How do I find the taxonomy identifier for an organism?
How do I download a genome sequence from the command line?

🎯 Objectives

Describe the major NCBI databases and how they relate to each other
Use esearch and efetch to query the taxonomy and nucleotide databases
Retrieve a genome in FASTA format and verify the download
Use curl as a backup download method

5.1 The NCBI databases

The National Center for Biotechnology Information (NCBI) maintains a large family of interconnected biological databases. They can all be queried from the command line using the Entrez Direct tools we installed in the previous chapter (Kans 2013).

The most commonly used databases in genomics work are:

Database	Contents	`esearch -db` name
Nucleotide	DNA and RNA sequences (GenBank, RefSeq)	`nucleotide`
Protein	Protein sequences	`protein`
Genome	Whole genome assemblies	`genome`
Taxonomy	Organism classification and IDs	`taxonomy`
SRA	Raw sequencing reads	`sra`
PubMed	Biomedical literature	`pubmed`
Assembly	Genome assembly records	`assembly`

These databases are cross-linked: a genome record links to its taxonomy entry, its raw reads in SRA, and the papers that described it.

Tip

You can list all available databases with:

einfo -dbs

5.2 Finding an organism’s taxonomy ID

Every organism in NCBI has a unique taxonomy identifier (TaxID). This number is stable, species-specific, and used across all NCBI databases to unambiguously refer to an organism.

Let’s try to find the TaxID for Pelagibacter ubique:

esearch -db taxonomy -query "Pelagibacter ubique"

You should see something like:

<ENTREZ_DIRECT>
  <Db>taxonomy</Db>
  <Count>0</Count>
  ...
</ENTREZ_DIRECT>

A count of 0. The command ran without error, but found nothing. This is an important lesson: the absence of an error message does not mean the result is correct. Always check whether the output actually makes sense.

Warning

Command-line tools fail silently far more often than they fail loudly. A zero result, an empty file, or an unexpectedly small file are all signs that something may have gone wrong. Always sanity-check your output.

In this case, the issue is the Candidatus prefix. EDirect’s taxonomy search does not index Candidatus names reliably, so the full species name returns nothing.

Cross-checking with the web portal

This is exactly the kind of situation where stepping away from the command line and using the NCBI web portal is the right move. Open your browser and go to:

https://www.ncbi.nlm.nih.gov/taxonomy

Search for Pelagibacter ubique there. You will find the entry for Candidatus Pelagibacter ubique HTCC1062 with TaxID 335992.

Note

The Candidatus designation indicates a microorganism that has been characterised genetically but could not be cultivated in pure culture under standard laboratory conditions. P. ubique was finally brought into pure culture in 2007, after years of failed attempts. Using a TaxID rather than the species name sidesteps naming convention issues entirely, which is why TaxIDs are preferred in scripted workflows.

Searching by genus

Back on the command line, searching at genus level works:

esearch -db taxonomy -query "Pelagibacter" \
  | efetch -format docsum \
  | xtract -pattern DocumentSummary -element TaxId ScientificName Rank

This confirms the TaxID 335992 we found on the web portal. From here, we use the TaxID for everything downstream.

5.3 How many genomes are available?

Now that we have the TaxID, let’s ask NCBI how many nucleotide sequences exist for this organism. We use txid335992[ORGN] rather than the species name, which avoids the Candidatus indexing problem entirely:

esearch -db nucleotide -query "txid335992[ORGN]" | grep "<Count>"

We can narrow down to complete genomes using the [TITL] field tag:

esearch -db nucleotide -query "txid335992[ORGN] AND complete genome[TITL]" \
  | grep "<Count>"

The [ORGN] and [TITL] in brackets are search field tags that restrict where the query term is matched – the same syntax used in the NCBI web search interface, just expressed on the command line.

🚀 Bonus: for those who want more

Try querying for other SAR11 relatives. How many complete genomes exist for the broader SAR11 clade?

esearch -db nucleotide -query "SAR11[ORGN] AND complete genome[TITL]" \
  | grep "<Count>"

You can also use the TaxID directly with txid to include all strains:

esearch -db nucleotide -query "txid335992[ORGN]" | grep "<Count>"

5.4 The organism: Pelagibacter ubique HTCC1062

We will work with strain HTCC1062, the first SAR11 isolate ever brought into pure culture. Its genome (accession NC_007205.1) is small at ~1.3 Mb and extraordinarily well-studied.

Note

Genome streamlining is an evolutionary strategy common in bacteria living in stable, nutrient-poor environments like the open ocean. P. ubique has one of the smallest genomes of any free-living organism, with almost no non-coding DNA and very few regulatory elements every gene kept is there for a reason.

5.5 The FASTA format (What does sequence data look like?)

A FASTA file is one of the simplest and most widely used formats in bioinformatics. Each sequence entry has two parts:

A header line starting with >, containing the sequence identifier and optional description
One or more lines of sequence data (60-70 characters per line by convention)

>NC_007205.1 Candidatus Pelagibacter ubique HTCC1062, complete sequence
AGTTTTCGAATTTGAATTTTAAGAAGTTTCGAATTTGAATTTGAAGAAATTTCGAATTT
GAATTTGAAAAAGTTTCGAATTTGAATTTGAAGAAATTTCGAATTTGAATTTGAAAAAG
...

A FASTA file can hold a single sequence (as here) or many sequences a multi-FASTA file which is common for collections of genes or protein sequences.

5.6 Fetching (Downloading from NCBI)

Let’s move into our data directory:

cd ~/training/data

Method 1: `efetch`

If Entrez Direct is installed (see Chapter 4), fetch the genome directly:

efetch -db nucleotide -id NC_007205.1 -format fasta > pelagibacter_ubique.fasta

Part	Meaning
`efetch`	The NCBI fetch tool
`-db nucleotide`	The nucleotide database
`-id NC_007205.1`	The accession number
`-format fasta`	Output in FASTA format
`> pelagibacter_ubique.fasta`	Redirect output to a file

Note

The > symbol redirects command output into a file instead of printing to the screen. We explore this in depth in Chapter 6.

Method 2: `curl` (backup for all systems)

If efetch is not available, download directly from GitHub:

curl -L -o pelagibacter_ubique.fasta \
  https://raw.githubusercontent.com/clarajegousse/training-data/main/pelagibacter_ubique.fasta

Flag	Meaning
`-L`	Follow redirects
`-o pelagibacter_ubique.fasta`	Save to this filename

Tip

wget works identically for simple downloads and is common on Linux and in MobaXterm:

wget -O pelagibacter_ubique.fasta \
  https://raw.githubusercontent.com/clarajegousse/training-data/main/pelagibacter_ubique.fasta

5.7 Verifying (Did it work?)

Always check the file exists and has a sensible size:

ls -lh pelagibacter_ubique.fasta

You should see around 1.4 MB. A file of 0 bytes or a few hundred bytes means something went wrong.

Peek at the beginning:

head pelagibacter_ubique.fasta

You should see a header starting with >NC_007205.1 followed by lines of A, T, G, C.

✏️ Exercise 5.1

Search NCBI taxonomy to confirm the TaxID of Pelagibacter ubique
Check how many complete genome sequences exist for this organism
Download the genome using either efetch or curl
Verify the download with ls -lh and head -n 5

Solution

# 1. Taxonomy search (genus level, then cross-check on web portal)
esearch -db taxonomy -query "Pelagibacter" \
  | efetch -format docsum \
  | xtract -pattern DocumentSummary -element TaxId ScientificName Rank

# 2. Genome count using TaxID
esearch -db nucleotide \
  -query "txid335992[ORGN] AND complete genome[TITL]" \
  | grep "<Count>"

# 3. Download
efetch -db nucleotide -id NC_007205.1 -format fasta > pelagibacter_ubique.fasta
# or:
curl -L -o pelagibacter_ubique.fasta \
  https://raw.githubusercontent.com/clarajegousse/training-data/main/pelagibacter_ubique.fasta

# 4. Verify
ls -lh pelagibacter_ubique.fasta
head -n 5 pelagibacter_ubique.fasta

🔑 Key points

NCBI maintains interconnected databases for sequences, genomes, taxonomy, literature, and raw reads
Every organism has a unique TaxID; esearch -db taxonomy retrieves it
esearch counts and filters records; efetch downloads them
FASTA is a simple two-part format: a > header line followed by sequence data
Always verify downloads with ls -lh and head

5.1 The NCBI databases

5.2 Finding an organism’s taxonomy ID

Cross-checking with the web portal

Searching by genus

5.3 How many genomes are available?

5.4 The organism: Pelagibacter ubique HTCC1062

5.5 The FASTA format (What does sequence data look like?)

5.6 Fetching (Downloading from NCBI)

Method 1: efetch

Method 2: curl (backup for all systems)

5.7 Verifying (Did it work?)

Method 1: `efetch`

Method 2: `curl` (backup for all systems)