Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Aug 31:21:55-79.
doi: 10.1146/annurev-genom-121119-083418. Epub 2020 May 18.

Progress, Challenges, and Surprises in Annotating the Human Genome

Affiliations
Review

Progress, Challenges, and Surprises in Annotating the Human Genome

Daniel R Zerbino et al. Annu Rev Genomics Hum Genet. .

Abstract

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.

Keywords: annotation; genes; genome; human; regulatory elements; variants.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Gene annotation process.
Gene annotation uses diverse orthogonal data types to determine first the structure and then the most likely functional class of the transcript and gene locus. Long transcriptomic data aligned to the reference genome identify the overall exon-intron structure of the transcript, while short RNA sequencing reads give confidence to the annotation of precise intron/exon boundaries and extensions at the ends of the transcripts (5′ and 3′ untranslated regions), especially where coverage from longer reads is low. Some transcript structures may be annotated entirely based on RNA sequencing data, again where coverage from longer reads is low. Terminal short-read data sets help define the 5′ and 3′ ends of transcripts, which is important from both a structural and functional point of view; where the termini of a transcript can be identified with confidence, lending certainty of the structural annotation, the annotators gain greater confidence in their determination of functional annotation. The presence of high-quality proteomic data and evidence of the evolutionary conservation of coding sequence informs the annotation of coding potential.
Figure 2
Figure 2. Organizations that support the GRC assembly and its gene annotations.
Abbreviations: e!, Ensembl Project; GRC, Genome Reference Consortium; HGNC, Human Genome Organisation (HUGO) Gene Nomenclature Committee; INSDC, International Nucleotide Sequence Database Collaboration; NCBI, National Center for Biotechnology Information; UCSC, University of California, Santa Cruz.
Figure 3
Figure 3. A locus whose identification was possible only through the analysis of recent orthologous data types.
The locus lacks any support from transcript evidence deposited in INSDC databases, and as such, it is not represented in any reference annotation database. Only by identifying the intersection of PhyloCSF data (to identify conserved protein-coding potential), RNA-seq data (to provide evidence of transcription and tissue specificity), Intropolis RNA-seq-supported intron-spanning reads (to provide evidence for precise split junctions and support tissue specificity from other datasets), CAGE data (to define transcript 5′ ends and tissue specificity support), and polyA-seq data (to define transcript 3′ ends and tissue specificity support) could a correctly splicing transcript model be built and the correct coding sequence added. Given the expectation of conservation, protein-coding genes identified by this annotation process were also annotated in mouse to provide an additional check on their validity Abbreviations: CAGE, cap analysis gene expression; INSDC, International Nucleotide Sequence Database Collaboration; PhyloCSF, Phylogenetic Codon Substitution Frequencies; polyA-seq, polyA sequencing; RNA-seq, RNA sequencing.
Figure 4
Figure 4. Progress in the annotation of gene loci in Ensembl/GENCODE.
(a) The number of protein-coding genes annotated has generally fallen over time but appears to be generally stable in recent years. The number of pseudogene loci increased rapidly during the annotation of the whole genome (2007–2012) and has maintained slow growth subsequently, while the number of lncRNA experienced a similar pattern of increase but continues to rise. Small-RNA locus totals are generally stable, only changing when there is a significant update to their automated annotation pipeline, and the relatively few IG and TR segments have remained broadly stale since their initial annotation. (b) The number of transcripts continues to increase over time, particularly for protein-coding genes and lncRNA loci, and given the availability of high-quality long-read data sets, this trend is expected to continue. (c,d) The changes to protein-coding gene counts underlying the relatively stable headline totals for human and mouse, respectively, in three recent Ensembl/GENCODE annotation releases. Protein-coding genes were both added and removed in every human and mouse release, with a total of 33 additions and 48 removals in human and 80 additions and 188 removals in mouse, suggesting that the final gene annotation for protein-coding genes has not yet been settled. Abbreviations: IG, immunoglobulin; lncRNA, long noncoding RNA; TR, T cell receptor.

Similar articles

Cited by

  • NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer.
    Salomé B, Sfakianos JP, Ranti D, Daza J, Bieber C, Charap A, Hammer C, Banchereau R, Farkas AM, Ruan DF, Izadmehr S, Geanon D, Kelly G, de Real RM, Lee B, Beaumont KG, Shroff S, Wang YA, Wang YC, Thin TH, Garcia-Barros M, Hegewisch-Solloa E, Mace EM, Wang L, O'Donnell T, Chowell D, Fernandez-Rodriguez R, Skobe M, Taylor N, Kim-Schulze S, Sebra RP, Palmer D, Clancy-Thompson E, Hammond S, Kamphorst AO, Malmberg KJ, Marcenaro E, Romero P, Brody R, Viard M, Yuki Y, Martin M, Carrington M, Mehrazin R, Wiklund P, Mellman I, Mariathasan S, Zhu J, Galsky MD, Bhardwaj N, Horowitz A. Salomé B, et al. Cancer Cell. 2022 Sep 12;40(9):1027-1043.e9. doi: 10.1016/j.ccell.2022.08.005. Cancer Cell. 2022. PMID: 36099881 Free PMC article.
  • Updating mRNA variants of the human RSK4 gene and their expression in different stressed situations.
    Qin Z, Yang J, Zhang K, Gao X, Ran Q, Xu Y, Wang Z, Lou D, Huang C, Zellmer L, Meng G, Chen N, Ma H, Wang Z, Liao DJ. Qin Z, et al. Heliyon. 2024 Mar 8;10(7):e27475. doi: 10.1016/j.heliyon.2024.e27475. eCollection 2024 Apr 15. Heliyon. 2024. PMID: 38560189 Free PMC article.
  • A novel hybrid model to predict concomitant diseases for Hashimoto's thyroiditis.
    Ataş PK. Ataş PK. BMC Bioinformatics. 2023 Aug 24;24(1):319. doi: 10.1186/s12859-023-05443-5. BMC Bioinformatics. 2023. PMID: 37620755 Free PMC article.
  • A complete reference genome improves analysis of human genetic variation.
    Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, Avdeyev P, Taylor DJ, Shafin K, Shumate A, Xiao C, Wagner J, McDaniel J, Olson ND, Sauria MEG, Vollger MR, Rhie A, Meredith M, Martin S, Lee J, Koren S, Rosenfeld JA, Paten B, Layer R, Chin CS, Sedlazeck FJ, Hansen NF, Miller DE, Phillippy AM, Miga KH, McCoy RC, Dennis MY, Zook JM, Schatz MC. Aganezov S, et al. Science. 2022 Apr;376(6588):eabl3533. doi: 10.1126/science.abl3533. Epub 2022 Apr 1. Science. 2022. PMID: 35357935 Free PMC article.
  • Hybrid assembly and comparative genomics unveil insights into the evolution and biology of the red-legged partridge.
    Eleiwa A, Nadal J, Vilaprinyo E, Marin-Sanguino A, Sorribas A, Basallo O, Lucido A, Richart C, Pena RN, Ros-Freixedes R, Usie A, Alves R. Eleiwa A, et al. Sci Rep. 2024 Aug 22;14(1):19531. doi: 10.1038/s41598-024-70018-0. Sci Rep. 2024. PMID: 39174643 Free PMC article.

References

    1. 1000 Genomes Proj. Consort. A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
    1. Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, et al. BLUEPRINT to decode the epigenetic signature written in blood. Nature. 2012;30:224–26. - PubMed
    1. Alexandersson M, Cawley S, Pachter L. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003;13:496–502. - PMC - PubMed
    1. Allen NE, Sudlow C, Peakman T, Collins R. UK Biobank data: come and get it. Sci Transl Med. 2014;6:224ed4. - PubMed
    1. Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 2019;47:D1038–43. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources