Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 1;9(2):380-397.
doi: 10.1093/gbe/evw307.

Ribosomal RNA Genes Contribute to the Formation of Pseudogenes and Junk DNA in the Human Genome

Affiliations

Ribosomal RNA Genes Contribute to the Formation of Pseudogenes and Junk DNA in the Human Genome

Brent M Robicheau et al. Genome Biol Evol. .

Abstract

Approximately 35% of the human genome can be identified as sequence devoid of a selected-effect function, and not derived from transposable elements or repeated sequences. We provide evidence supporting a known origin for a fraction of this sequence. We show that: 1) highly degraded, but near full length, ribosomal DNA (rDNA) units, including both 45S and Intergenic Spacer (IGS), can be found at multiple sites in the human genome on chromosomes without rDNA arrays, 2) that these rDNA sequences have a propensity for being centromere proximal, and 3) that sequence at all human functional rDNA array ends is divergent from canonical rDNA to the point that it is pseudogenic. We also show that small sequence strings of rDNA (from 45S + IGS) can be found distributed throughout the genome and are identifiable as an "rDNA-like signal", representing 0.26% of the q-arm of HSA21 and ∼2% of the total sequence of other regions tested. The size of sequence strings found in the rDNA-like signal intergrade into the size of sequence strings that make up the full-length degrading rDNA units found scattered throughout the genome. We conclude that the displaced and degrading rDNA sequences are likely of a similar origin but represent different stages in their evolution towards random sequence. Collectively, our data suggests that over vast evolutionary time, rDNA arrays contribute to the production of junk DNA. The concept that the production of rDNA pseudogenes is a by-product of concerted evolution represents a previously under-appreciated process; we demonstrate here its importance.

Keywords: concerted evolution; degraded rDNA; genome evolution; vestigial centromere.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.—
Fig. 1.—
Analysis of pseudogenes located at extreme edges of rDNA arrays (i.e., in Distal Junction [DJ] and Proximal Junction [PJ] regions). (A) Position of rDNA pseudogenes. A diagram of an acrocentric chromosome is provided; the green box represents a centromere and green circles represent telomeres. The insert shown with a dashed line indicates the orientation and size of DJ and PJ regions with associated cosmids and BACs from Floutsakou et al. (2013) shown directly beneath. Cosmid and BAC names are indicated in black, while the chromosome source for the sequence is shown in red. Repeating gray boxes, with a yellow line in the center, indicate the position of functional rDNA relative to DJ and PJ (the word “rDNA” is also given). (B) Visual representation of the degree of dissimilarity of DJ and PJ pseudogenes compared with the canonical rDNA unit. Mature rRNA and spacer regions of the rDNA unit are indicated at the top, and are labelled and coloured blue. DJ and PJ pseudogenes [listed to the left of sequences] are then compared with canonical rDNA, highlighting indels [black horizontal bars] and SNPs [black vertical bars]. All gray regions highlight identical nucleotide positions.
F<sc>ig</sc>. 2.—
Fig. 2.—
rDNA sequence present along HSA21. The number of nucleotides within alignment hits with respect to position along Chromosome 21 is plotted. Centromere-related sequence is found in the region where histogram bars drop to zero after position 10 Mb. Alignment hits were found using megablast, Discontiguous BLAST and blastn algorithms in BLAST. Three features are shown: “A” indicates pseudogene sequence placed along the HSA21 build from DJ/PJ contigs, “B- -B” indicates centromere associated rDNA hits, and “C-…-C” shows the q-arm of HSA21 and how the level of rDNA-like sequence does not drop to zero (compare to centromere position after 10 Mb).
F<sc>ig</sc>. 3.—
Fig. 3.—
Analysis of centromere associated rDNA sequence present on HSA21. (A) A subsample of 2 Mb to the left and right of centromere sequence presented in figure 2; it is therefore a more detailed look at feature “B- -B” (fig. 2). Megablast, discontiguous BLAST and blastn were used to obtain alignment hits. Peaks along histogram indicate regions of sequence with high amounts of rDNA-like sequence. (B) Distribution of alignment hit sizes in the 2-Mb regions that are left [red] and right [green] of the centromere of panel A. The data have been log transformed, Σ = sum of nucleotides, and diamonds denote the means of their respective boxplots.
F<sc>ig</sc>. 4.—
Fig. 4.—
Large megablast alignment hits that match canonical human rDNA plotted along all human chromosomes. X-axes differ to allow all graph panels to be the same size; the y-axes have no value, data points are only vertically spread to make the data more legible. Megablast hits are indicated in blue, while the start/stop position of centromeres are highlighted in red [with their thresholds shown as dashed vertical lines].
F<sc>ig</sc>. 5.—
Fig. 5.—
Identifying highly degraded rDNA units in regions far removed from rDNA arrays. (A) A closer look at the HSA20 data shown in figure 4; blue dots indicate megablast hits, red dots with accompanying dashed vertical bars indicate centromere thresholds. (B) The 1.6 Mb that contributes most to the assemblage of megablast hits present near the centromere at HSA20. The image is modified from the output produced by BLAST (Altschul et al. 1990; Camacho et al. 2009). Megablast hits are indicated along the 1.6-Mb and color coded according to size. The genomic coordinates for 1.6 Mb of sequence are also provided. (C) A closer look at the megablast hits present at the 1.2-Mb position of the entire 1.6 Mb region using a dot plot. The ∼80 kb of sequence at 1.2 Mb is compared with the canonical rDNA unit. Using a sliding window size of 50/100 bp, there is a strong indication that this region along the 1.6 Mb of HSA20 contains a highly degraded rDNA unit [see diagonal lines that are highlighted in blue, and which represent matches for the rRNA specifying and spacer region (IGS) of canonical rDNA]. (D) The dot plot shows two diagonal lines as an artefact of which base in the repeated array is counted as ���zero���. Since we use the standard base numbering scheme, the real sequence is artificially split. The diagram illustrates where the sequences align within a repeated array. Array is shown in black/gray. Pseudogene is shown in blue. The diagram in (D) is not to scale.
F<sc>ig</sc>. 6.—
Fig. 6.—
Determining if rDNA megablast hits occur closer to centromeres. (A) Size of megablast hits presented in figure 4 relative to their distance from centromeres. Black open circles represent Megblast hits and a linear trend-line is indicated in blue. (B) The cumulative proportion of megablast hit sites as a function of distance from centromere [black open circles]. For comparison, the cumulative proportions were calculated for 1,000 random assignments of the observed hit counts to locations randomly generated from a uniform distribution. Triangles indicate median cumulative proportions and vertical lines give the 2.5th and 97.5th percentiles of the cumulative proportions over random draws.
F<sc>ig</sc>. 7.—
Fig. 7.—
Highly degraded rDNA unit present near vestigial centromere on HSA2. (A) Detail of the HSA2 panel from figure 4. The region shown is a close up of sequence to the right of the active centromere and extending to the edge of the vestigial centromere. Megablast hits [blue dots], HSA2’s active centromere [red dot with dashed gray vertical bars] and HSA2’s vestigial centromere threshold [dashed green vertical bars] are indicated along HSA2. The vestigial centromere threshold corresponds to 2q21.3 to 2q22.1 loci (Avarello et al. 1992; Baldini et al. 1993). Because this threshold is a best estimate of cytogenetic banding locations, two gene positions are also plotted as controls: HNMT [Gene ID: 3176] that occurs in 2q22.1 and ACMSD [Gene ID: 130013] that occurs in 2q21.3 [according to the Entrez Gene Database at NCBI]. (B) Megablast hits located at “*” in panel A, and which are associated with a degrading rDNA unit. About 107 kb have been subsampled from the 1 Mb surrounding the high number of hits in panel A. Similar to figure 5C, diagonal lines produced at a sliding window size of 50/100 bp indicate matches to rRNA specifying and IGS rDNA sequence.
F<sc>ig</sc>. 8.—
Fig. 8.—
Regions of Homo sapiens and Mus musculus genomes that were used in over-representation of rDNA-like sequence analysis. “Synteny blocks” for each organism are indicated on chromosomes (see light blue boxes); a letter to the left indicates the identity we have assigned to a particular block. Blocks with corresponding letters between human and mouse share synteny (e.g., Block A H. sapiens vs. Block A M. musculus). Red circles indicate the position of rDNA clusters. Block A in M. musculus is the only synteny region that comes from a chromosome also harboring an rDNA cluster. Images are not to scale. Ideograms modified from Adler (1992) and Adler and Willis (1991).
F<sc>ig</sc>. 9.—
Fig. 9.—
Over-representation of ribosomal DNA-like sequence [45S + IGS] distributed in human and mouse genomes. For each synteny block (A–E and X) both rDNA to genomic sequence similarity [see blue diamonds (A) and red triangles (B)] and rDNA to random sequence similarity [see open circles; replicated 10 times per synteny block (n =10)] were calculated. The X-axis in both (A) and (B) represents the percentage of nucleotides that are rDNA-like. Results from probing with combined mature rRNA specifying and spacer (ETS + IGS) sequences are shown (the tailored probe was used). Each synteny block is 200 kb long, with the exception of mouse synteny block A, which is 205,973 bp long. This block is longer because we included the anchoring gene’s sequence in the block. The difference between genomic sequence similarity and random sequence similarity can be interpreted as the “true” over-abundance of rDNA-like sequence (e.g., this value for block C in both human and mouse would be ∼2%).

Similar articles

Cited by

References

    1. Adler D. 1992. Idiogram album: mouse. http://pathology.washington.edu/research/cytopages/.
    1. Adler D, Willis M. 1991. Idiogram album: human. http://pathology.washington.edu/research/cytopages/.
    1. Aldrup-MacDonald M, Sullivan B. 2014. The past, present, and future of human centromere genomics. Genes 5:33–50. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. - PubMed
    1. Avarello R, Pedicini A, Caiulo A, Zuffardi O, Fraccaro M. 1992. Evidence for an ancestral alphoid domain on the long arm of human chromosome 2. Hum Genet. 89:247–249. - PubMed

Publication types