SeqBench

Sequence File Formats — FASTA, FASTQ and GenBank

Almost all sequence data is stored as plain text in one of a few standard formats. The three you meet most often are FASTA (sequences), FASTQ (sequencing reads with quality scores) and GenBank (annotated records). Here is what each looks like and when to use it.

FASTA

The simplest format: each record is a header line starting with > and an identifier, followed by the sequence on one or more lines. No quality scores. Used for reference genomes, genes and protein sequences.

>seq1 example sequence
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
ATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTG

FASTQ

The standard for raw sequencing reads. Each record is four lines: an @ header, the sequence, a + separator, and a line of Phred quality characters — one per base, so the sequence and quality lines are always the same length.

@read1
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
+
IIIIIIIIIIIIIIIHHHHHHGGGGGFFFFEEEDDDCCCB

GenBank

An annotated format. Beyond the sequence (in the ORIGIN block) it carries structured metadata — a LOCUS line, DEFINITION and a FEATURES table describing genes, CDS and their coordinates. Records end with //.

LOCUS       SEQ1         39 bp    DNA     linear   01-JAN-2026
DEFINITION  Example sequence record.
FEATURES             Location/Qualifiers
     CDS             1..39
                     /gene="example"
ORIGIN
        1 atggccattg taatgggccg ctgaaagggt gcccgatag
//

FASTA vs FASTQ vs GenBank

FormatExtensionsStoresQualityTypical use
FASTA.fasta .fa .fna .faaID + sequenceNoReference sequences, genomes, proteins
FASTQ.fastq .fqID + sequence + per-base qualityYes (Phred)Raw sequencing reads
GenBank.gb .gbkSequence + rich annotationNoAnnotated records (genes, plasmids)

Other common formats handle later analysis steps: SAM/BAM store read alignments, and VCF stores genetic variants — both are outside the scope of these three sequence formats.

Frequently asked questions

What is a FASTA file?
A FASTA file is a plain-text format for one or more sequences. Each record starts with a header line beginning with '>' followed by an identifier, then one or more lines of sequence. It stores no quality information and is the most common format for reference DNA, RNA and protein sequences.
What is the difference between FASTA and FASTQ?
FASTA stores just an identifier and the sequence. FASTQ adds a per-base quality score: each record is four lines — an '@' header, the sequence, a '+' separator, and a line of Phred quality characters the same length as the sequence. FASTQ is used for raw sequencing reads; FASTA for finished sequences.
Why does a FASTQ record have four lines?
Line 1 is the read ID (starting with @), line 2 is the sequence, line 3 is a separator (a '+', optionally repeating the ID), and line 4 encodes a quality score for every base as an ASCII character. The sequence and quality lines must be the same length.
What does a GenBank file contain that FASTA does not?
GenBank is an annotated format: alongside the sequence it carries structured metadata — a LOCUS line, DEFINITION, source organism and, crucially, a FEATURES table describing genes, CDS, regulatory elements and their coordinates. FASTA carries only a header and the bases.

Learn more