Genomics File Formats — SAM/BAM, VCF, BED and GFF/GTF
Once reads leave the FASTA/FASTQ stage, downstream analysis flows through a handful of tab-delimited formats. SAM/BAM hold read alignments, VCF holds variant calls, and BED and GFF3/GTF describe genomic intervals and annotations. This reference lists their columns, the SAM FLAG bits, and the coordinate system each one uses.
Coordinate systems — the #1 gotcha
The single most common source of off-by-one errors is mixing coordinate conventions. BED is 0-based and half-open — [start, end), where the start counts from 0 and the end is exclusive. Every other format here is 1-based and fully closed: positions count from 1 and both ends are included.
| Format | Base | Interval |
|---|---|---|
| BED | 0-based | Half-open [start, end) |
| SAM / BAM | 1-based | Fully closed |
| VCF | 1-based | Fully closed |
| GFF3 / GTF | 1-based | Fully closed |
SAM / BAM
SAM (Sequence Alignment/Map) is a tab-delimited text format of read alignments against a reference; BAM is its compressed binary form. Each alignment line has 11 mandatory fields. POS is 1-based.
| Col | Field | Meaning |
|---|---|---|
| 1 | QNAME | Read name |
| 2 | FLAG | Bitwise flags |
| 3 | RNAME | Reference name |
| 4 | POS | 1-based leftmost position |
| 5 | MAPQ | Mapping quality |
| 6 | CIGAR | Alignment (M/I/D/S/…) |
| 7 | RNEXT | Mate reference |
| 8 | PNEXT | Mate position |
| 9 | TLEN | Template length |
| 10 | SEQ | Read sequence |
| 11 | QUAL | Phred quality (ASCII, like FASTQ) |
SAM FLAG bits
The FLAG field is the sum of the set bits below — for example 99 = 1 + 2 + 32 + 64 (paired, proper pair, mate reverse strand, first in pair).
| Hex | Decimal | Meaning |
|---|---|---|
| 0x1 | 1 | Read paired |
| 0x2 | 2 | Read mapped in proper pair |
| 0x4 | 4 | Read unmapped |
| 0x8 | 8 | Mate unmapped |
| 0x10 | 16 | Read reverse strand |
| 0x20 | 32 | Mate reverse strand |
| 0x40 | 64 | First in pair |
| 0x80 | 128 | Second in pair |
| 0x100 | 256 | Secondary alignment |
| 0x200 | 512 | Fails QC |
| 0x400 | 1024 | PCR/optical duplicate |
| 0x800 | 2048 | Supplementary alignment |
VCF
VCF (Variant Call Format) stores variant calls. A file begins with ## meta header lines, then a single #CHROM column-header line, then one record per variant. POS is 1-based.
| Col | Field | Meaning |
|---|---|---|
| 1 | CHROM | Chromosome / reference name |
| 2 | POS | 1-based position of the variant |
| 3 | ID | Variant identifier (e.g. rsID, or .) |
| 4 | REF | Reference allele |
| 5 | ALT | Alternate allele(s) |
| 6 | QUAL | Phred-scaled variant quality |
| 7 | FILTER | PASS or filters the variant failed |
| 8 | INFO | Semicolon-separated annotations |
| 9 | FORMAT | Genotype field keys (then one column per sample) |
The first eight columns are fixed; FORMAT and one column per sample appear only when genotypes are recorded.
##fileformat=VCFv4.3 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 chr1 1000 . A G 60 PASS DP=42 GT 0/1
BED
BED is a simple interval format. It is 0-based, half-open. Only the first three columns are required (BED3); BED6 adds name, score and strand, and BED12 adds the block columns that describe exon structure.
| Col | Field | Meaning |
|---|---|---|
| 1 | chrom | Chromosome / reference name |
| 2 | chromStart | Start (0-based) |
| 3 | chromEnd | End (exclusive) |
| 4 | name | Feature name |
| 5 | score | Score (0–1000) |
| 6 | strand | Strand (+/-) |
| 7 | thickStart | Start of thick drawing |
| 8 | thickEnd | End of thick drawing |
| 9 | itemRgb | Display colour (R,G,B) |
| 10 | blockCount | Number of blocks (exons) |
| 11 | blockSizes | Comma-separated block sizes |
| 12 | blockStarts | Comma-separated block starts |
BED3 = columns 1–3 · BED6 = columns 1–6 · BED12 = all 12 columns.
GFF3 / GTF
GFF3 and GTF are 9-column, tab-delimited annotation formats. Both are 1-based and closed. GTF (a.k.a. GFF2.5) shares the first eight columns with GFF3 but differs in column 9, the attributes.
| Col | Field | Meaning |
|---|---|---|
| 1 | seqid | Sequence / chromosome name |
| 2 | source | Program or database that made the feature |
| 3 | type | Feature type (gene, mRNA, CDS, exon…) |
| 4 | start | Start (1-based) |
| 5 | end | End (inclusive) |
| 6 | score | Score (or .) |
| 7 | strand | Strand (+/-/.) |
| 8 | phase | Reading frame for CDS (0/1/2) |
| 9 | attributes | Key/value attributes (differs GFF3 vs GTF) |
The difference is entirely in column 9. GFF3 uses key=value;key=value pairs (with ID= and Parent= to link features):
chr1 ensembl gene 1000 2000 . + . ID=gene1;Name=EXMP
GTF uses key "value"; pairs (notably gene_id and transcript_id):
chr1 ensembl CDS 1000 2000 . + 0 gene_id "gene1"; transcript_id "tx1";
Frequently asked questions
- What is the difference between SAM and BAM?
- They hold the same data — read alignments against a reference. SAM is tab-delimited plain text you can read directly; BAM is its compressed binary equivalent, smaller and faster for tools to process. Both use 1-based positions and the same 11 mandatory fields.
- What does a SAM FLAG value mean and how do I decode it?
- The FLAG (column 2) is the sum of set bit values, so a single integer encodes several true/false properties. For example 99 = 1 + 2 + 32 + 64, meaning the read is paired, mapped in a proper pair, has its mate on the reverse strand and is the first in the pair. Decode it by subtracting the largest powers of two, or by checking each bit against the FLAG-bit table above.
- Is BED 0-based or 1-based?
- BED is 0-based and half-open: chromStart counts from 0 and chromEnd is exclusive, so the interval is [start, end). A BED line of chr1 0 100 covers the first 100 bases. SAM, VCF, GFF3 and GTF are all 1-based and fully closed, which is the most common off-by-one trap when converting between formats.
- What is the difference between GFF3 and GTF?
- Both are 9-column, tab-delimited, 1-based annotation formats and share the first eight columns. They differ only in column 9, the attributes. GFF3 uses key=value pairs separated by semicolons (ID=…;Parent=…), while GTF (a.k.a. GFF2.5) uses key "value" pairs (gene_id "…"; transcript_id "…";).
- What does a VCF file store?
- VCF (Variant Call Format) stores genetic variants — SNPs, insertions, deletions and more — relative to a reference. It begins with ## meta header lines, then a #CHROM header line, then one record per variant with fixed columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO) and optional FORMAT plus one genotype column per sample. Positions are 1-based.
See also
Related tools and references
Use these related pages when this table raises a practical calculation or workflow question.