Skip to content

Giving every MAVE variant a precise, computable identity with VRS

Why this matters

MaveDB collects the results of multiplexed assays of variant effect — experiments that measure the functional impact of thousands of genetic variants at once. Every contributing lab describes its variants differently: against its own engineered target sequence, at the protein level or the DNA level, in whatever notation suited the experiment. That made it hard to know when two records described the same change, or to connect a measured variant to anything outside the original study. MaveDB now maps every variant it stores into a single shared standard, so each one carries a precise, computable identity that any other system can recognize without having to understand how the original experiment was written down.

At a glance


The story

A multiplexed assay reports variants in the terms of its own experiment. A deep mutational scan of a protein names amino-acid changes against an engineered target; a saturation genome editing screen names nucleotide changes against a genomic window. The same biological change can therefore arrive in MaveDB in several notations, on several reference sequences, depending on who ran the assay. Stored as raw strings, those records can't be compared, searched precisely, or linked to the wider variant ecosystem.

MaveDB resolves this by mapping every variant into the GA4GH Variant Representation Specification (VRS) 2.0. The dcd-mapping pipeline takes each variant's HGVS description, aligns the assay's target to a standard reference sequence using cool-seq-tool and cdot, and produces a normalized VRS Allele. VRS then computes a content-addressed digest — a hash derived deterministically from the variant's location and state — and uses it as the allele's identifier (ga4gh:VA.…). Two records that describe the same change normalize to the same digest, no matter how the original experiments phrased them.

MaveDB keeps both a pre-mapped allele (on the assay's own target sequence, preserving exactly what was measured) and a post-mapped allele (on a standard human reference), so nothing about the original experiment is lost while everything gains a shared identity. This VRS representation is the substrate for the rest of MaveDB's modern backend: it is how variants are stored precisely, how they are searched, and what every downstream annotation hangs off of.

The data

A real post-mapped VRS 2.0 Allele for the UBE2I variant p.Leu6Gly, from the deep mutational scan in score set urn:mavedb:00000001-a-1 (Weile et al., 2017). The id/digest are computed from the location and state — the same change would produce the same digest from any source:

VRS 2.0 Allele — UBE2I p.Leu6Gly
{
  "id": "ga4gh:VA.P39KFBT8kdyfg79JH7IBX-4JKXGrzCxb",
  "type": "Allele",
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "G"
  },
  "digest": "P39KFBT8kdyfg79JH7IBX-4JKXGrzCxb",
  "location": {
    "id": "ga4gh:SL.o4bho24Xqm_HS5mD8-HDjtmtLCZ5XLez",
    "end": 6,
    "type": "SequenceLocation",
    "start": 5,
    "digest": "o4bho24Xqm_HS5mD8-HDjtmtLCZ5XLez",
    "sequenceReference": {
      "type": "SequenceReference",
      "label": "NP_003336.1",
      "refgetAccession": "SQ.hy5ErT-cGJovsPYIgzchb3BvYQ2MkKB3"
    }
  },
  "extensions": [
    {
      "name": "vrs_ref_allele_seq",
      "type": "Extension",
      "value": "L"
    }
  ],
  "expressions": [
    {
      "value": "NP_003336.1:p.Leu6Gly",
      "syntax": "hgvs.p"
    }
  ]
}

The location points into a standard protein reference (NP_003336.1, addressed by its content-based refgetAccession), the state records the substituted residue (G), and the expressions block carries the human-readable HGVS (NP_003336.1:p.Leu6Gly) alongside the machine identifier.

The tools used

  • dcd-mapping — MaveDB's pipeline that aligns each assay's target to a reference and emits VRS alleles for every variant in a score set.
  • vrs-python (ga4gh.vrs 2.0.0-a6) — VRS 2.0 Allele/Haplotype models, normalization, and digest computation (ga4gh_identify).
  • cool-seq-tool 0.4.0.dev3 and cdot — transcript selection and alignment between assay targets and standard references.
  • seqrepo — sequence storage and refget accession resolution.
  • MaveDB API — stores the resulting VRS alleles and serves them as MaveDB's canonical variant representation.

How to reuse this pattern