Sequence File Formats — FASTA, FASTQ and GenBank

Almost all sequence data is stored as plain text in one of a few standard formats. The three you meet most often are FASTA (sequences), FASTQ (sequencing reads with quality scores) and GenBank (annotated records). Here is what each looks like and when to use it.

FASTA

The simplest format: each record is a header line starting with > and an identifier, followed by the sequence on one or more lines. No quality scores. Used for reference genomes, genes and protein sequences.

fasta

>seq1 example sequence
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
ATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTG

FASTQ

The standard for raw sequencing reads. Each record is four lines: an @ header, the sequence, a + separator, and a line of Phred quality characters — one per base, so the sequence and quality lines are always the same length.

fastq

@read1
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
+
IIIIIIIIIIIIIIIHHHHHHGGGGGFFFFEEEDDDCCCB

GenBank

An annotated format. Beyond the sequence (in the ORIGIN block) it carries structured metadata — a LOCUS line, DEFINITION and a FEATURES table describing genes, CDS and their coordinates. Records end with //.

genbank

LOCUS       SEQ1         39 bp    DNA     linear   01-JAN-2026
DEFINITION  Example sequence record.
FEATURES             Location/Qualifiers
     CDS             1..39
                     /gene="example"
ORIGIN
        1 atggccattg taatgggccg ctgaaagggt gcccgatag
//

FASTA vs FASTQ vs GenBank

Format	Extensions	Stores	Quality	Typical use
FASTA	.fasta .fa .fna .faa	ID + sequence	No	Reference sequences, genomes, proteins
FASTQ	.fastq .fq	ID + sequence + per-base quality	Yes (Phred)	Raw sequencing reads
GenBank	.gb .gbk	Sequence + rich annotation	No	Annotated records (genes, plasmids)

Other common formats handle later analysis steps: SAM/BAM store read alignments, and VCF stores genetic variants — both are outside the scope of these three sequence formats.

Frequently asked questions

What is a FASTA file?

A FASTA file is a plain-text format for one or more sequences. Each record starts with a header line beginning with '>' followed by an identifier, then one or more lines of sequence. It stores no quality information and is the most common format for reference DNA, RNA and protein sequences.

What is the difference between FASTA and FASTQ?

FASTA stores just an identifier and the sequence. FASTQ adds a per-base quality score: each record is four lines — an '@' header, the sequence, a '+' separator, and a line of Phred quality characters the same length as the sequence. FASTQ is used for raw sequencing reads; FASTA for finished sequences.

Why does a FASTQ record have four lines?

Line 1 is the read ID (starting with @), line 2 is the sequence, line 3 is a separator (a '+', optionally repeating the ID), and line 4 encodes a quality score for every base as an ASCII character. The sequence and quality lines must be the same length.

What does a GenBank file contain that FASTA does not?

GenBank is an annotated format: alongside the sequence it carries structured metadata — a LOCUS line, DEFINITION, source organism and, crucially, a FEATURES table describing genes, CDS, regulatory elements and their coordinates. FASTA carries only a header and the bases.

Learn more

Related tools and references

Use these related pages when this table raises a practical calculation or workflow question.

Tools

Sequence Fetcher tool

Paste a GenBank, RefSeq or UniProt accession and get the FASTA or GenBank record.

FASTA/FASTQ Stats tool

Summarise and validate FASTA or FASTQ: counts, N50, GC, quality.

Sequence Formatter tool

Clean, case-convert, wrap, reverse and convert between DNA and RNA.

Guides

Nearby reference tables

Genomics file formats

SAM/BAM flags, VCF, BED and GFF/GTF columns and coordinate systems.

Phred quality scores

Phred score meaning, error probabilities and ASCII encodings.

Restriction enzymes

Common restriction enzymes: recognition sites, cut positions, NEB buffer activity, star activity and an interactive double-digest buffer finder.