6 Exploring sequence data
6.1 Searching (Is this pattern in here?)
The command grep (global regular expression print) reads a file line by line and prints every line that contains a given pattern.
Let’s confirm our FASTA file has a header:
grep ">" pelagibacter_ubique.fastaThis prints every line containing >. In a single-sequence FASTA file, there is exactly one such line.
Useful grep flags
| Flag | Effect |
|---|---|
-c |
Count matching lines instead of printing them |
-i |
Case-insensitive matching |
-v |
Invert: print lines that do not match |
-n |
Show line numbers alongside matches |
How many header lines are in the file?
grep -c ">" pelagibacter_ubique.fastaFor a single genome this gives 1. In a multi-FASTA file say, all predicted proteins of an organism this tells you the number of sequences.
6.2 Counting (How much is here?)
The command wc (word count) counts lines, words and characters in a file:
wc pelagibacter_ubique.fastaThe output format is: lines words characters filename
To count only lines:
wc -l pelagibacter_ubique.fastaEach line of sequence in a FASTA file is typically 60–70 characters wide. If you subtract 1 (the header line) and multiply the remaining line count by the line length, you get a rough estimate of the genome size though the last line is usually shorter.
6.3 Piping (Chaining commands together)
A pipe (|) takes the output of one command and passes it directly as input to the next, without creating an intermediate file. Think of it as connecting two tools in a pipeline.
Let’s count only the sequence lines (excluding the header):
grep -v ">" pelagibacter_ubique.fasta | wc -lThis does two things in one step:
grep -v ">"selects all lines that are not headers| wc -lcounts those lines
You can chain as many commands as you like:
grep -v ">" pelagibacter_ubique.fasta | wc -cThis counts the total number of characters in the sequence lines a close approximation of the genome length in base pairs (note: it includes the newline character at the end of each line, so the true bp count is slightly lower).
6.4 Redirecting (Saving output to a file)
You already used > to redirect efetch output into a file. You can redirect the output of any command:
grep ">" pelagibacter_ubique.fasta > headers.txt
cat headers.txtUse >> to append to an existing file rather than overwriting it:
echo "Source: NCBI accession NC_007205.1" >> headers.txt
cat headers.txt> overwrites the destination file if it already exists. Use >> when you want to add to a file. It is an easy mistake to accidentally overwrite data you wanted to keep.