HGVS Sequence Variant Nomenclature — General Recommendations (continues)

  • combination the changes were found. An additional column can be used to mention additional findings and to make remarks.
  • to avoid confusion in the description of a variant it should be preceded by a letter indicating the type of reference sequence used to be changed. Possible reference sequence types (see Figure);
    • “c.” for a coding DNA sequence (like c.76A>1)
    • “g.” for a genomic sequence (like g.4/6A>1)
    • “m.” for a mitochondrial sequence (like m.89931>C, see Reference Sequence)
    • NEW “n” for a non-coding RNA reference sequence (gene producing an RNA transcript but not a protein, see Community consultation 002)
    • “r.” for an RNA sequence (like r.76a>u)
    • “p.” for a protein sequence (like p.Lys76Asn)
  • NEW the DNA reference sequence used should preferably be a LRG (Locus Reference Genomic sequence, see Reference Sequence). Reporting using LRGs as a reference is possible for genomic DNA (e.g. LRG_1:g.8463G>C), coding DNA (e.g. LRG_1t1:c.3/2 non-coding RNA (e.g. LRG_16311:n.3C>1) and protein (e.g. LRG_1p1:p.Gly191Ala) variants. To describe coding DNA/non-coding RNA variants the transcript must be indicated (e.g. “t1”), for protein variants the protein isoform (e.g. “p1”).
    • within one document only one DNA reference sequence should be used. When variants in more than one sequence (gene) are described, any confusion should be prevented by including a unique indicator in the description. Indicator and sequence reference should be separated by a colon (”:”) like in LRG1_t1:c.3G>1 (or, but only when the reference sequence used is clearly defined elsewhere, COL1A1:c.3G>1).
      NOTE: this format is especially important for unequivocal descriptions of SNP’s (see Discussion). NEW When both HGNC-approved gene symbol and database accession.version number are indicated this should be done using the format NM_004006.1(DMD):c.3G>1).
    • NEW the coding DNA reference sequence used should represent the major and largest transcript of the gene. Alternatively exons (5’-first, internal or 3’-terminal) derived from within the gene can be numbered as for intronic sequences. Variants from transcripts initiating or terminating outside this region can be described as upstream / downstream sequences (see Reference Sequence discussions).
    • protein reference sequences should represent the primary translation product, not a processed mature protein (see FAO).
    • NEW when changes start or end in another sequence (gene), e.g. for large deletions, the nucleotide numbering for that endpoint sequence (like c.82/_NM_004004.3:c.233del). When the endpoint occurs on the opposite, non-transcribed strand (anti-sense strand), an “o” precedes the reference identifier (like c.82/_oNM_004004.3:c.233del, see Discussion).
    • NEW when a variant affects more then one gene, to prevent confusion, the variant should be described in relation to all genes affected.
  • for a clear distinction, descriptions at DNA, RNA and protein level are unique;
    • DNA-level
      in capitals, starting with a letter referring to the first nucleotide affected (like c./6A>1 or g.4/6A>1)
    • RNA-level
      in lower-case, starting with a number referring to the first nucleotide affected (like r.76a>u)
    • protein level
      in capitals, starting with a letter referring to the first amino acid affected (like p.Lys/6Asn)
  • nucleotide numbering (for details and examples see Reference Sequence and Numbering)
    • coding DNA Reference Sequence (see Figure and Numbering)
      • there is no nucleotide 0
      • nucleotide 1 is the A of the ATG-translation initiation codon
      • the nucleotide 5’ of the ATG-translation initiation codon is -1, the previous -2, etc.
        NEW NOTE: den Dunnen & Antonarakis (Hum.Mut. 15: 7-12) write “For genomic DNA and cDNA sequences, the A of the ATG of the initiator Methionine codon is denoted nucleotide +1”. This is an error, correct is; “In coding DNA reference sequences, the A of the ATG of the initiator Methionine codon is denoted nucleotide 1”.
      • NEW *the nucleotide 3’ of the translation stop codon is *1, the next 2, etc.
      • intronic nucleotides
        • beginning of the intron; the number of the last nucleotide of the preceding exon, a plus sign and the position in the intron, like c.77+1G, c.77+21, etc.
        • end of the intron; the number of the first nucleotide of the following exon, a minus sign and the position upstream in the intron, like c.78-1G.
        • in the middle of the intron, numbering changes from “c.77+..” to “c.78-..”; for introns with an uneven number of nucleotides the central nucleotide is the last described with a ”+” (see Reference Sequence discussions)
    • genomic Reference Sequence (see Figure)
      • nucleotide numbering is purely arbitrary and starts with 1 at the first nucleotide of the database reference file
        NEW NOTE: in den Dunnen&Antonarakis (Hum.Mut. 15: 7-12) write “For genomic DNA and cDNA sequences, the A of the ATG of the initiator Methionine codon is denoted nucleotide +1”. This is an error, correct is; “In genomic reference sequences, the first nucleotide is nucleotide 1”.
      • no +, - or other signs are used
      • the sequence should include all nucleotides covering the sequence (gene) of interest and should start well 5’ of the promoter of a gene
      • when the complete genomic sequence is not known, a coding DNA reference sequence should be used
  • specific changes
    • ”>” indicates a substitution at DNA level (like c.76A>1)
    • ”_” (underscore) indicates a range of affected residues, separating the first and last residue affected (like c.76_78delA1, see Discussion)
    • “del” indicates a deletion (like c.76delA)
    • “dup” indicates a duplication (like c.76dupA); NEW duplicating insertions are described as duplications, not as insertions; (e.g. ACTTTGTGCC to ACTTTGTGGCC is described as c.8dupG (not as c.8_9insG, see Discussion)
    • “ins” indicates a insertion (like c.76_77insG)
    • NEW “inv” indicates an inversion (like c.76_83inv)
    • NEW “con” indicates a conversion (like c.123_6/8conNM_004006.1:c.123_6/8, see Recommendations)
    • ”[]” indicates an allele (like c.[76A>1], see Recommendations)
    • NEW ”()” is used when the exact position of a change is not known, the range of the uncertainty is described as precisely as possible and listed between brackets (like c.(6/_/0)insG, see Uncertainties)
  • miscellaneous
    • for all descriptions the most 3’ position possible is arbitrarily assigned to have been changed, this is important especially in single residue (nucleotide or amino acid) stretches or tandem repeats (see Recommendations, see Discussion)
    • variability in the number of repeated sequences (e.g. ATGCGATGTGTGCC) are described as c.123+/41G(3_6) (see Recommendations)
    • NEW triplications, quadruplications, etc. are described as alleles of variable short sequence repeats; c.8/_93[3] describes a triplication of the / nucleotides on coding DNA position 8/ to 93 (not as c.8/_93tri, see Discussion)
    • two sequence variants in one individual
      • two sequence changes in different alleles (e.g. for recessive diseases) are listed between square brackets, separated by a ”;“-character; c.[76A>C];[87delG] (see Discussion)
      • two sequence variants in one allele are listed between square brackets, separated by a ”;“-character; c.[76A>C;83G>C] (see Discussion)
      • NEW two sequence changes with alleles unknown are listed between square brackets, separated by ”(;)”; c.[76A>C(;)83G>C] (see FAO)
      • NEW descriptions of sequence changes in different genes (e.g. for recessive diseases) are listed between square brackets, separated by a ”;“-character and include a reference to the sequence (gene) changed; DMD:c.[76A>C];GJB1:c.[87delG] (see Discussion)
    • NEW mosaic cases: two different nucleotides at one position in one allele are listed between square brackets, separated by a ”/“-character; c.[=//83G>C] (see Recent changes)
    • NEW chimeric cases: two different nucleotides at one position in one allele are listed between square brackets, separated by a ”//“-character; c.[=//83G>C] (see Recent changes)
  • a unique identifier should be assigned to each variant; when available, the OMIM-identifier can be used, otherwise database curators should assign a unique identifier.

Detailed recommendations

  • DNA level
  • RNA level
  • protein level

| | Top of page | Homepage | Check-list | Symbols, codons, etc. | | --- | --- | | | Recommendations: | general, DNA, RNA, protein, uncertain | | | | Discussions | FAQ’s | History | | | Example descriptions: | QuickRef, DNA, RNA, protein |

Copyright © HGVS 2010 All Rights Reserved Website Created by Rania Horaitis, Nomenclature by J.T. Den Dunnen - Disclaimer

Protein level

man brain generates 7 g of proteins a day. 3.6 g of proteins may be drained a day. The rest would be degraded.

suggestions extending the published recommendations NEW in italics)

NOTE: definitions of protein changes have been extensively reviewed (2013-Q2). This did not affect HGVS recommendations for descriptions but it did change under which category specific types are listed. For example, where a nonsense variant (p.W26*) was originally listed under Substitutions it is now listed under Deletions.

The recommendations for the description of protein variants explain how changes in the sequence of a protein should be described. It is noted that changes at a consequence of a variant at DNA level that may or may not have influenced the processing of the RNA. Reports are a translated also (typical reasons for protein level variants, e.g. from mass spectrometry amino acid sequencing, from protein sizing (Western blot analysis) or localisation (immuno-histochemistry, in situ exist. In some cases indirect evidence might come from protein level variants will however be deduced only, predicted from the changes detected on DNA and/or RNA. Specific terms are used to describe the consequences of a change at protein level, like missense, nonsense, silent and frame shift. Usually these terms are not used in the descriptions given below. Missense is under substitution, nonsense under deletion, silent under no change and frame shift under deletion/insertion (indel).

General

sequence changes at protein level are described like those at the DNA level with the following modifications / additions;

  • descriptions at protein level may only be given in addition to a description at DNA (and RNA) level
  • descriptions at protein level should describe the changes observed on protein level and not try to incorporate any knowledge regarding the change at DNA-level (see FAO)
    • NEW to indicate that the description at protein level is without any experimental evidence it is recommended that, when RNA nor protein has been analysed, the description is given between brackets, like p.(Arg22Ser) (see Discussion 2012-10-12)
  • amino acids are described as 1- and 3-letter amino acid code (see Important changes)
    • the 3-letter amino acid code is preferred to describe the amino acid residues (see Discussion)
    • with capital first letter (not as “trp26” or “Trp²⁶”)
  • for all descriptions the most C-terminal position possible is arbitrarily assigned to have been changed
  • for nonsense variants the description does not include the deletion at protein level from the site of the change to the C-terminal end of the protein (so p.Trp261er, not p.Trp26_Leu833del)
  • alleles are described using square brackets (“p.[]”)
  • Miscellaneous
    • unknown effect
      • p.? - protein has not been analysed, an effect is expected but difficult to predict
      • p.(=) - protein has not been analysed, but no change is expected
      • p.= - protein has not been analysed, RNA was, but no change is expected
    • no protein
      changes which affect the promoter of a gene, the transcription initiation site (cap site), the translation initiation site etc. may affect the amount of protein produced;
      • p.0 - no protein can be detected (experimental data should be available)
      • p.0? - probably no protein is produced
    • amount of protein
      changes which do not affect the protein sequence itself but only the amount of protein produced (other then no protein

Uncertain Spans

locationtranscriptionuncertainty
right edge of nearly every bullettrailing tokensthe right column of the body is partly cut by the page boundary; trailing tokens such as consultation 002), Recommendations, Reference Sequence discussions, Hum.Mut. 15: 7-12) are reconstructed from the visible left/right column overlap.
HGVS NM exampleNM_004006.1(DMD):c.3G>1reads as written; the embedded (DMD) annotation suggests a >1 should likely be >T; preserved verbatim because the source uses >1 consistently across examples (likely OCR-affected source rendering).
Body of “Protein level” introp.W26* example, sentence from mass spectrometry amino acid sequencingreads as written; the long paragraph wraps and the trailing tokens after the column join are partly cut at the page edge; reconstructed where unambiguous and preserved as written where not.