SeqBench

Genomics File Formats — SAM/BAM, VCF, BED and GFF/GTF

Once reads leave the FASTA/FASTQ stage, downstream analysis flows through a handful of tab-delimited formats. SAM/BAM hold read alignments, VCF holds variant calls, and BED and GFF3/GTF describe genomic intervals and annotations. This reference lists their columns, the SAM FLAG bits, and the coordinate system each one uses.

Coordinate systems — the #1 gotcha

The single most common source of off-by-one errors is mixing coordinate conventions. BED is 0-based and half-open [start, end), where the start counts from 0 and the end is exclusive. Every other format here is 1-based and fully closed: positions count from 1 and both ends are included.

FormatBaseInterval
BED0-basedHalf-open [start, end)
SAM / BAM1-basedFully closed
VCF1-basedFully closed
GFF3 / GTF1-basedFully closed

SAM / BAM

SAM (Sequence Alignment/Map) is a tab-delimited text format of read alignments against a reference; BAM is its compressed binary form. Each alignment line has 11 mandatory fields. POS is 1-based.

ColFieldMeaning
1QNAMERead name
2FLAGBitwise flags
3RNAMEReference name
4POS1-based leftmost position
5MAPQMapping quality
6CIGARAlignment (M/I/D/S/…)
7RNEXTMate reference
8PNEXTMate position
9TLENTemplate length
10SEQRead sequence
11QUALPhred quality (ASCII, like FASTQ)

SAM FLAG bits

The FLAG field is the sum of the set bits below — for example 99 = 1 + 2 + 32 + 64 (paired, proper pair, mate reverse strand, first in pair).

HexDecimalMeaning
0x11Read paired
0x22Read mapped in proper pair
0x44Read unmapped
0x88Mate unmapped
0x1016Read reverse strand
0x2032Mate reverse strand
0x4064First in pair
0x80128Second in pair
0x100256Secondary alignment
0x200512Fails QC
0x4001024PCR/optical duplicate
0x8002048Supplementary alignment

VCF

VCF (Variant Call Format) stores variant calls. A file begins with ## meta header lines, then a single #CHROM column-header line, then one record per variant. POS is 1-based.

ColFieldMeaning
1CHROMChromosome / reference name
2POS1-based position of the variant
3IDVariant identifier (e.g. rsID, or .)
4REFReference allele
5ALTAlternate allele(s)
6QUALPhred-scaled variant quality
7FILTERPASS or filters the variant failed
8INFOSemicolon-separated annotations
9FORMATGenotype field keys (then one column per sample)

The first eight columns are fixed; FORMAT and one column per sample appear only when genotypes are recorded.

##fileformat=VCFv4.3
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
chr1	1000	.	A	G	60	PASS	DP=42	GT	0/1

BED

BED is a simple interval format. It is 0-based, half-open. Only the first three columns are required (BED3); BED6 adds name, score and strand, and BED12 adds the block columns that describe exon structure.

ColFieldMeaning
1chromChromosome / reference name
2chromStartStart (0-based)
3chromEndEnd (exclusive)
4nameFeature name
5scoreScore (0–1000)
6strandStrand (+/-)
7thickStartStart of thick drawing
8thickEndEnd of thick drawing
9itemRgbDisplay colour (R,G,B)
10blockCountNumber of blocks (exons)
11blockSizesComma-separated block sizes
12blockStartsComma-separated block starts

BED3 = columns 1–3 · BED6 = columns 1–6 · BED12 = all 12 columns.

GFF3 / GTF

GFF3 and GTF are 9-column, tab-delimited annotation formats. Both are 1-based and closed. GTF (a.k.a. GFF2.5) shares the first eight columns with GFF3 but differs in column 9, the attributes.

ColFieldMeaning
1seqidSequence / chromosome name
2sourceProgram or database that made the feature
3typeFeature type (gene, mRNA, CDS, exon…)
4startStart (1-based)
5endEnd (inclusive)
6scoreScore (or .)
7strandStrand (+/-/.)
8phaseReading frame for CDS (0/1/2)
9attributesKey/value attributes (differs GFF3 vs GTF)

The difference is entirely in column 9. GFF3 uses key=value;key=value pairs (with ID= and Parent= to link features):

chr1	ensembl	gene	1000	2000	.	+	.	ID=gene1;Name=EXMP

GTF uses key "value"; pairs (notably gene_id and transcript_id):

chr1	ensembl	CDS	1000	2000	.	+	0	gene_id "gene1"; transcript_id "tx1";

Frequently asked questions

What is the difference between SAM and BAM?
They hold the same data — read alignments against a reference. SAM is tab-delimited plain text you can read directly; BAM is its compressed binary equivalent, smaller and faster for tools to process. Both use 1-based positions and the same 11 mandatory fields.
What does a SAM FLAG value mean and how do I decode it?
The FLAG (column 2) is the sum of set bit values, so a single integer encodes several true/false properties. For example 99 = 1 + 2 + 32 + 64, meaning the read is paired, mapped in a proper pair, has its mate on the reverse strand and is the first in the pair. Decode it by subtracting the largest powers of two, or by checking each bit against the FLAG-bit table above.
Is BED 0-based or 1-based?
BED is 0-based and half-open: chromStart counts from 0 and chromEnd is exclusive, so the interval is [start, end). A BED line of chr1 0 100 covers the first 100 bases. SAM, VCF, GFF3 and GTF are all 1-based and fully closed, which is the most common off-by-one trap when converting between formats.
What is the difference between GFF3 and GTF?
Both are 9-column, tab-delimited, 1-based annotation formats and share the first eight columns. They differ only in column 9, the attributes. GFF3 uses key=value pairs separated by semicolons (ID=…;Parent=…), while GTF (a.k.a. GFF2.5) uses key "value" pairs (gene_id "…"; transcript_id "…";).
What does a VCF file store?
VCF (Variant Call Format) stores genetic variants — SNPs, insertions, deletions and more — relative to a reference. It begins with ## meta header lines, then a #CHROM header line, then one record per variant with fixed columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO) and optional FORMAT plus one genotype column per sample. Positions are 1-based.

See also

Related tools and references

Use these related pages when this table raises a practical calculation or workflow question.