Sequence File Formats — FASTA, FASTQ and GenBank
Almost all sequence data is stored as plain text in one of a few standard formats. The three you meet most often are FASTA (sequences), FASTQ (sequencing reads with quality scores) and GenBank (annotated records). Here is what each looks like and when to use it.
FASTA
The simplest format: each record is a header line starting with > and an identifier, followed by the sequence on one or more lines. No quality scores. Used for reference genomes, genes and protein sequences.
>seq1 example sequence ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG ATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTG
FASTQ
The standard for raw sequencing reads. Each record is four lines: an @ header, the sequence, a + separator, and a line of Phred quality characters — one per base, so the sequence and quality lines are always the same length.
@read1 ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG + IIIIIIIIIIIIIIIHHHHHHGGGGGFFFFEEEDDDCCCB
GenBank
An annotated format. Beyond the sequence (in the ORIGIN block) it carries structured metadata — a LOCUS line, DEFINITION and a FEATURES table describing genes, CDS and their coordinates. Records end with //.
LOCUS SEQ1 39 bp DNA linear 01-JAN-2026
DEFINITION Example sequence record.
FEATURES Location/Qualifiers
CDS 1..39
/gene="example"
ORIGIN
1 atggccattg taatgggccg ctgaaagggt gcccgatag
//FASTA vs FASTQ vs GenBank
| Format | Extensions | Stores | Quality | Typical use |
|---|---|---|---|---|
| FASTA | .fasta .fa .fna .faa | ID + sequence | No | Reference sequences, genomes, proteins |
| FASTQ | .fastq .fq | ID + sequence + per-base quality | Yes (Phred) | Raw sequencing reads |
| GenBank | .gb .gbk | Sequence + rich annotation | No | Annotated records (genes, plasmids) |
Other common formats handle later analysis steps: SAM/BAM store read alignments, and VCF stores genetic variants — both are outside the scope of these three sequence formats.
Frequently asked questions
- What is a FASTA file?
- A FASTA file is a plain-text format for one or more sequences. Each record starts with a header line beginning with '>' followed by an identifier, then one or more lines of sequence. It stores no quality information and is the most common format for reference DNA, RNA and protein sequences.
- What is the difference between FASTA and FASTQ?
- FASTA stores just an identifier and the sequence. FASTQ adds a per-base quality score: each record is four lines — an '@' header, the sequence, a '+' separator, and a line of Phred quality characters the same length as the sequence. FASTQ is used for raw sequencing reads; FASTA for finished sequences.
- Why does a FASTQ record have four lines?
- Line 1 is the read ID (starting with @), line 2 is the sequence, line 3 is a separator (a '+', optionally repeating the ID), and line 4 encodes a quality score for every base as an ASCII character. The sequence and quality lines must be the same length.
- What does a GenBank file contain that FASTA does not?
- GenBank is an annotated format: alongside the sequence it carries structured metadata — a LOCUS line, DEFINITION, source organism and, crucially, a FEATURES table describing genes, CDS, regulatory elements and their coordinates. FASTA carries only a header and the bases.