6  Exploring sequence data

Note⏳ Time
  • Teaching: 8 min
  • Exercises: 5 min
Note🤔 Questions
  • How do I search for patterns inside a file?
  • How do I count lines, words and characters?
  • How do I chain commands together?
Note🎯 Objectives
  • Use grep to search for patterns in a FASTA file
  • Use wc to count lines and estimate sequence length
  • Use the pipe | to chain commands together
  • Redirect output to a file with > and >>

6.1 Searching (Is this pattern in here?)

The command grep (global regular expression print) reads a file line by line and prints every line that contains a given pattern.

Let’s confirm our FASTA file has a header:

grep ">" pelagibacter_ubique.fasta

This prints every line containing >. In a single-sequence FASTA file, there is exactly one such line.

Useful grep flags

Flag Effect
-c Count matching lines instead of printing them
-i Case-insensitive matching
-v Invert: print lines that do not match
-n Show line numbers alongside matches

How many header lines are in the file?

grep -c ">" pelagibacter_ubique.fasta

For a single genome this gives 1. In a multi-FASTA file say, all predicted proteins of an organism this tells you the number of sequences.


6.2 Counting (How much is here?)

The command wc (word count) counts lines, words and characters in a file:

wc pelagibacter_ubique.fasta

The output format is: lines words characters filename

To count only lines:

wc -l pelagibacter_ubique.fasta
Note

Each line of sequence in a FASTA file is typically 60–70 characters wide. If you subtract 1 (the header line) and multiply the remaining line count by the line length, you get a rough estimate of the genome size though the last line is usually shorter.


6.3 Piping (Chaining commands together)

A pipe (|) takes the output of one command and passes it directly as input to the next, without creating an intermediate file. Think of it as connecting two tools in a pipeline.

Let’s count only the sequence lines (excluding the header):

grep -v ">" pelagibacter_ubique.fasta | wc -l

This does two things in one step:

  1. grep -v ">" selects all lines that are not headers
  2. | wc -l counts those lines

You can chain as many commands as you like:

grep -v ">" pelagibacter_ubique.fasta | wc -c

This counts the total number of characters in the sequence lines a close approximation of the genome length in base pairs (note: it includes the newline character at the end of each line, so the true bp count is slightly lower).


6.4 Redirecting (Saving output to a file)

You already used > to redirect efetch output into a file. You can redirect the output of any command:

grep ">" pelagibacter_ubique.fasta > headers.txt
cat headers.txt

Use >> to append to an existing file rather than overwriting it:

echo "Source: NCBI accession NC_007205.1" >> headers.txt
cat headers.txt
Warning

> overwrites the destination file if it already exists. Use >> when you want to add to a file. It is an easy mistake to accidentally overwrite data you wanted to keep.


Caution✏️ Exercise 6.1
  1. How many sequence lines (non-header lines) are in pelagibacter_ubique.fasta?
  2. Use a pipe to count the total number of characters in the sequence lines (excluding the header). How does this compare to the expected genome size of ~1,308,759 bp?
# Count sequence lines
grep -v ">" pelagibacter_ubique.fasta | wc -l

# Count sequence characters
grep -v ">" pelagibacter_ubique.fasta | wc -c

The character count will be slightly higher than 1,308,759 because wc -c also counts the newline character (\n) at the end of each line. Subtract the number of sequence lines from the character count to get the true base pair count.

Important🚀 Bonus: for those who want more

The -o flag in grep prints only the matched portion of each line one occurrence per line. Use it to count how many times the start codon ATG appears in the genome:

grep -o "ATG" pelagibacter_ubique.fasta | wc -l

Then try searching for the sigma-70 promoter consensus TATAAT:

grep -c "TATAAT" pelagibacter_ubique.fasta

This is a naive search it finds the string anywhere in the file, in any reading frame, and on the forward strand only. Real promoter and gene prediction requires more sophisticated tools. But it demonstrates how quickly you can ask biological questions with just a few shell commands.


Tip🔑 Key points
  • grep searches for patterns; -c counts matches; -v inverts the search; -o prints only the match
  • wc -l counts lines; wc -c counts characters
  • The pipe | passes output from one command directly to the next no intermediate file needed
  • > redirects output to a file (overwrites); >> appends without overwriting