9/27/2014

Maintaining stable ids between GRCh37 and GRCh38

http://www.ensembl.info/blog/2014/07/08/maintaining-stable-ids-between-grch37-and-grch38/

As mentioned in another post, due to the presence of patches in both GRCh37 and GRCh38, the assembly mapping has proven challenging.
Related to this, another novelty arises when assigning stable ids to genes.
Every time a gene set is updated for a species, we compare the newest gene set with the previous one.
If we find a perfect match between the two gene sets, the stable id assigned to the older model will be used for the new model.
Even if the model has changed slightly (longer UTR for example), we try to map the old stable id whenever possible, with a version change to indicate that it was not a perfect match.
To provide a better comparison between the last GRCh37 gene set (e!75) and the new GRCh38 gene set (e!76), we have decided to project the old set onto the new assembly. This allows for overlap comparisons rather than simple sequence alignments. However, this means that around 2% of the genes are lost, as they can not be mapped onto the new assembly. If these gene models are still present in the new assembly, they are being assigned a new stable id.
Putting this in perspective of patch fixes integrated into the new reference, we also have cases where two genes in GRCh37 (one of the reference, one on the patch) both match the same gene on the new reference in GRCh38.
In that case, we have decided to arbitrarily keep the longest standing stable ID, which is likely to be the one on the reference.
The stable ID which was used on the patch is recorded as retired but a link is provided to its replacement. For example, searching for ENSG00000260384 (SERINC2 gene on HG989_PATCH) will redirect the user to ENSG00000168528 (SERINC2 on the primary assembly).
Screen Shot 2014-06-27 at 10.46.23Screen Shot 2014-06-27 at 10.48.13
This resulted in the deletion of around 3% of our genes.
In other cases, the difference between the GRCh37 reference (without patch) and the GRCh38 reference (with integrated patch fix from GRCh37) is too important to project annotations from the reference. Only annotations from the patch are then kept, along with the stable ids. For these cases, if there is a known alt_allele to a gene on the GRCh37 reference, it is added as a link to its equivalent on the patch.
Consequently, searching for ENSG00000183678 (CTAG1A gene on the GRCh37 primary assembly) will redirect the user to ENSG00000268651 (CTAG1A gene on HG1497_PATCH in GRCh37, on the primary assembly in GRCh38).
As mentioned in the blog post about the new gene set, a new assembly implies a number of underlying changes in the gene structure.
Despite this, 95% of all the gene stable ids have been assigned to the new gene models.
With this work, we try and ensure that you will still be able to find your favourite gene using the same stable id as in GRCh37.

No comments:

Post a Comment