How to Analyze an Unknown DNA Sequence

6 min read · Updated June 10, 2026

You've been handed a stretch of DNA — a synthesised fragment, a clone, an amplicon, a sequence from a paper — and you need to know what it is and what you can do with it. Rather than guessing, there's a quick, repeatable workflow that tells you most of what matters in a few minutes. This guide walks through it step by step.

Step 1 — Composition and a GC sanity check

Start with the basics: how long is it, and what is its GC content? Length tells you whether you're looking at an oligo, a gene-sized fragment or a whole construct. GC content is the first thing that affects downstream work — very high or very low GC changes primer behaviour, PCR conditions and how evenly the region sequences. A balanced GC (roughly 40–60%) is the easy case; values outside that are worth noting before you design anything.

Step 2 — Reading frames and ORFs (does it code?)

Next, ask whether the sequence could encode a protein. Scanning all six reading frames for open reading frames (ORFs) reveals the longest stretch that runs from a start codon to a stop without interruption. A long ORF in one frame strongly suggests a coding sequence; the absence of any meaningful ORF points to a non-coding region, a regulatory element, or the wrong strand. Translating the longest ORF and eyeballing the protein is a good confirmation step.

Step 3 — Restriction sites for cloning

If you might clone or otherwise manipulate the fragment, scan it for restriction enzyme sites. The most valuable result is the list of enzymes that cut exactly once: single cutters are ideal for linearising a plasmid or setting up directional cloning, whereas an enzyme that cuts your insert multiple times will fragment it. Knowing which common enzymes don't cut at all is equally useful when you need to leave the insert untouched.

Step 4 — Primers to amplify it

Finally, if you want to PCR the fragment, the sequence ends give you a first pair of primers. Taking ~20 nt from each end (the reverse primer being the reverse complement of the 3' end) and checking their melting temperatures tells you whether a simple amplification is feasible and whether the two Tm values are close enough to share an annealing step. Treat these as a starting point and refine for dimers and specificity before ordering.

Doing it all in one pass

Each of these steps has its own dedicated tool, but running them one by one for every new sequence is tedious. A sequence analyzer composes the workflow — composition, ORFs, single-cutter enzymes and end-primer Tm — into a single report you can copy or download, so characterising a new sequence becomes one structured pass rather than four separate checks.

Worked example: characterising an 81-nt fragment start to finish

The four steps above are easiest to trust once you've seen them run on an actual sequence with real numbers. Here's an 81-nt fragment, read 5'→3', taken through all four in order:

TTCGCTAGCATGGCTAAGCGTTTCGGCAACTGGATCCGTCAGAATTCTAAGCCGTTAGCTCGATTGGCCATGCGTAACGGT

Composition: 81 bases long, with 17 A, 22 T, 22 G and 20 C. GC content is (22 + 20) / 81 = 42 / 81 ≈ 51.9% — squarely inside the 40–60% range, so nothing here flags unusual primer or PCR behaviour before you've designed anything.

Reading frames: of the three forward frames, only one contains an ATG at all — it has two, at positions 10 and 70 — while the other two forward frames contain no ATG anywhere in this fragment. Reading on from position 10, that frame runs all the way to the end of the fragment without hitting an in-frame stop codon, so it doesn't count as a complete ORF — it may simply continue beyond this excerpt. Checking the three reverse frames (on the reverse complement) turns up a clean 33-nt ORF running from position 39 to 71: reverse-complementing that stretch gives ATGGCCAATCGAGCTAACGGCTTAGAATTCTGA, which translates in-frame to Met-Ala-Asn-Arg-Ala-Asn-Gly-Leu-Glu-Phe-Stop (MANRANGLEF*). That's the one worth reporting — and a reminder that the real reading frame is often on the strand you weren't handed.

Restriction sites: a scan for six common cutters finds GAATTC (EcoRI) once, at position 42–47, and GGATCC (BamHI) once, at position 32–37; HindIII, NotI, XhoI and PstI don't appear at all. Both EcoRI and BamHI are single cutters here, so a double digest is a realistic directional-cloning option — though it's worth noticing the EcoRI site (42–47) falls inside the ORF just found (39–71), while the BamHI site (32–37) sits upstream of it, so only one of the two would actually land inside the coding region.

End primers: taking the first 20 nt as-is gives a forward primer of TTCGCTAGCATGGCTAAGCG, and reverse-complementing the last 20 nt gives a reverse primer of ACCGTTACGCATGGCCAATC. Both happen to be 11/20 = 55% GC, giving a salt-adjusted Tm of 64.9 + 41 × (11 − 16.4) / 20 ≈ 53.8 °C for each — identical here, so this pair wouldn't need any temperature compromise before ordering.

ORF, CDS and gene: not the same claim

These three words get used interchangeably in conversation, but each one is a stronger claim than the last, and it's easy to blur them together right after finding a long ORF.

An ORF (open reading frame) is a purely sequence-level observation: a run of in-frame codons from a start codon to a stop, with no interruption. Finding one means the DNA is arranged so it could be translated — nothing about expression, function or biological reality is implied yet. Genomes are full of short ORFs that arise by chance and are never used.

A CDS (coding sequence) is the annotated claim that a specific ORF is actually the coding region of a real transcript, usually backed by evidence — homology to a known protein, RNA-seq or proteomic support, or conservation across related genomes.

A gene is broader again: the genomic locus that gives rise to a transcript, including untranslated regions and, in eukaryotes, introns that the ORF/CDS never covers. A gene can contain an ORF; an ORF alone doesn't make something a gene.

So when a workflow like this one reports a long ORF, treat it as a lead worth following up — translating it and checking whether the protein looks plausible, or comparing it against known sequences — not as a finished identification. As a rough rule of thumb, be more skeptical of short ORFs (well under ~100 codons) found in isolation, since these turn up by chance fairly often; but that's a heuristic, not a cutoff — genuine short peptides and regulatory upstream ORFs exist too.

Common mistakes

Counting any ATG as a start codon without checking that an in-frame stop follows before the sequence runs out — an ATG with no downstream in-frame stop within your fragment is only the possible start of an ORF that continues past what you have, not a complete one.
Stopping after the three forward reading frames. A coding fragment is just as likely to have its ORF on the reverse complement strand, especially when you don't already know which orientation you were handed.
Leaving Ns or other IUPAC ambiguity codes uncounted in the composition or restriction-site scan. A run of Ns silently skews GC% and can hide a real restriction site or fabricate a false one if it isn't excluded or flagged separately.
Assuming a text match for a recognition sequence means the enzyme will cut there. Type IIS enzymes (BsaI, BsmBI and similar, common in Golden Gate assembly) cut at a fixed distance outside a non-palindromic recognition site rather than within a palindrome, so a plain substring search doesn't tell the whole story for every enzyme.
Reading "no long ORF" as "definitely non-coding." A fragment cut off mid-gene, a single sequencing error that introduces a frameshift or a spurious stop, or a genuinely short coding peptide can all suppress a real ORF that a longer or cleaner read would reveal.

Frequently asked questions

What's the first thing to check on an unknown sequence?

Length and GC content. They immediately tell you the scale of the sequence and flag any GC extremes that will affect primer design and PCR before you commit to anything downstream.

How do I tell if a sequence is coding?

Scan all six reading frames for open reading frames. A long ORF running from a start codon to a stop in one frame strongly suggests a coding sequence; translating it and checking the protein confirms it.

Why do single-cutter enzymes matter?

An enzyme that cuts your sequence exactly once is ideal for linearising or for directional cloning, while an enzyme that cuts multiple times would fragment the insert. The single cutters are usually the ones you build a cloning strategy around.

What's the difference between an ORF and a gene?

An ORF is just a sequence pattern — a run of codons from a start codon to an in-frame stop with no interruption. A gene is the actual genomic locus that produces a transcript, including regulatory and untranslated sequence the ORF doesn't cover, and a CDS is the specific claim that a given ORF is the real coding region of that transcript. Finding a long ORF is a strong hint, not proof of either.

Why would the real ORF turn out to be on the reverse strand instead of the strand I was given?

DNA is double-stranded, and whoever handed you the sequence had to pick one strand to write down — that choice has nothing to do with which strand actually encodes anything. Scanning only the three forward frames misses any ORF that runs the other way, so a full check always covers all six frames: three on the sequence as given and three on its reverse complement.

The composition, ORF and enzyme checks all look fine — how do I find out what the sequence actually is, not just its properties?

Those checks describe the fragment's properties, not its identity. To confirm what it actually is, you need to compare it against something known — for example fetching a candidate reference by accession and running a pairwise or multiple sequence alignment against it to see how well they match. If you don't already have a specific candidate in mind, a full database-wide similarity search (BLAST-style) is a separate step this workflow doesn't replace.

Related references

Restriction enzymes

Common restriction enzymes: recognition sites, cut positions, NEB buffer activity, star activity and an interactive double-digest buffer finder.

Primer design table

PCR primer design rules for length, GC content and Tm.

Sequence file formats

Quick reference for FASTA, FASTQ, GenBank and related formats.

Related tools

Sequence Analyzer

Generate composition, ORF, restriction-site and primer summaries from one sequence.

ORF Finder

Find open reading frames in all six frames and translate them.

Restriction Sites

Find recognition and cut sites for common restriction enzymes.

GC Content

Calculate GC%, AT% and per-base composition for DNA or RNA.