Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Background Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5–7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9. Results Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinION was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8× coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained > 800× coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was > 99% G4C2 content, though we cannot rule out small interruptions. Conclusions Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear. Electronic supplementary material The online version of this article (10.1186/s13024-018-0274-4) contains supplementary material, which is available to authorized users.

Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinION was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8× coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained > 800× coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual's repeat region was > 99% G 4 C 2 content, though we cannot rule out small interruptions.
(Continued on next page)

Background
Many neurodegenerative diseases, including Huntington's disease [1][2][3][4], spinocerebellar ataxias [1,2], frontotemporal dementia (FTD) [3], and amyotrophic lateral sclerosis (ALS) [3] can be caused by nucleotide repeat expansions [1] that are historically challenging to sequence [4,5]. Repeat expansions are a specific multi-nucleotide DNA sequence that is repeated (i.e., expanded) significantly more times than normal. In 2011, a C9orf72 'GGGGCC' (G 4 C 2 ) repeat expansion was discovered [3,6] that causes approximately 34% and 26% of familial ALS and FTD cases, respectively [7]. This finding genetically linked ALS and FTD, generating an exciting opportunity to better understand the etiology of both diseases, and potentially develop a therapeutic approach. Individuals with ALS and FTD caused by the G 4 C 2 expansion generally have hundreds to thousands of G 4 C 2 repeats [8], while healthy individuals typically have between 2 and 30 G 4 C 2 repeats [6,9], though a precise cutoff for pathogenicity is unclear [9]. Additional diseases caused by repeat expansions include Fuch's disease [10], myotonic dystrophy [11], Friedreich's ataxia [12], and Fragile X syndrome [13], among others, demonstrating the breadth of diseases caused by such expansions. Revealing the underlying etiology of these diseases, and discovering additional repeat expansions that directly cause or modify disease, or modify risk for disease, will likely be accelerated through long-read sequencing technologies capable of characterizing at least major portions of the repeat; characterizing these repeats at the nucleotide level will help determine, for example, whether the repeat is interrupted and whether such interruptions mitigate disease, as in other neurodegenerative disorders [14][15][16][17].
It is unclear whether third-generation long-read sequencing platforms such as Pacific Biosciences' (PacBio; RS II and Sequel) and Oxford Nanopore Technologies' (ONT; MinION) can traverse these challenging disease-causing repeats, nor is there a report of nucleotide-level sequencing data in a C9orf72 repeat expansion carrier. Likewise, it is unclear whether the C9orf72 repeat expansion is pure G 4 C 2 repeat in affected carriers, or whether it is interrupted by non-G 4 C 2 sequence. The C9orf72 G 4 C 2 expansion may be the most challenging repeat to sequence, given its extreme length, "pure" GC content [4,5], and propensity to form G-quadruplexes in both RNA [18,19] and DNA [19].
Here, we demonstrate that both PacBio and ONT sequencing platforms can sequence through repeats cloned into plasmids, including the spinocerebellar ataxia type 36 (SCA36) disease-causing 'GGCCTG' repeat expansion [20] and the FTD-and ALS-causing G 4 C 2 repeat expansion. We further report long-read sequencing data from the C9orf72 G 4 C 2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using both whole-genome and no-amplification (No-Amp) targeted sequencing [21,22] on the PacBio Sequel. Our findings indicate that long-read sequencing is well suited to characterizing repeat expansions and that this technology has potential to accelerate future genetic discovery efforts across a broad range of diseases that may involve repeat expansions. These technologies may also have potential in clinical and genetic counseling environments for repeat-expansion and other structural variant disorders, generally. Structural mutations, and repeat expansions specifically, are challenging for short-read technologies. Thus, long-read sequencing technologies may be ideal for discovering new disease-causing or disease-modifying repeat expansions that have escaped detection with conventional short-read sequencing.

PacBio RS II and ONT MinION sequence through repeats cloned into plasmids
To generally assess the PacBio and ONT sequencing platforms, we cloned the SCA36 'GGCCTG' (Fig. 1b) and C9orf72 G 4 C 2 ( Fig. 1c and d) repeat expansions into plasmids and sequenced them on the PacBio RS II and ONT MinION (Fig. 2). We also included an EGFP-containing plasmid without a repeat expansion, as a control (Fig. 1a). The SCA36 plasmid was included as a secondary control because it has 1/6th lower GC content than the C9orf72 G 4 C 2 repeat; we anticipated that the G 4 C 2 repeat may be too challenging for these technologies, as sequencing GC-rich regions has historically been challenging for any technology [4,5]. We aimed to construct a plasmid containing 62 'GGCCTG' repeats, and two plasmids containing approximately 423 (C9-423) and 774 (C9-774) G 4 C 2 repeats, respectively. Because these long repeat sequences are unstable, most bacterial colonies contained fewer than the targeted number of repeats (Additional file 1: Figure S1).
We also compared how well the platforms sequenced specifically through the repeat regions, where we first compared repeat length distributions. Both platforms produced highly similar distributions for all plasmids (Fig. 4), but the repeat lengths varied widely within each plasmid, as expected based on gel intensity curves (Additional file 1: Figure S1). The C9-423 repeat length distribution is more variable than even the C9-774, perhaps because the C9-774 plasmid backbone is more tolerant of the repeat. The median repeat lengths are highly similar for the PacBio RS II and ONT MinION (Fig. 4), where the median repeat lengths (measured by repeat number, not bases) for the PacBio RS II were 35, 148, and 395 for SCA36, C9-423, and C9-774, respectively, while median repeat lengths for the ONT MinION were 37, 172, and 406, respectively. Repeat lengths for the SCA36 plasmid were confirmed by Sanger sequencing, estimating approximately 37 repeats, before sequence traces became indeterminate (Additional file 1: Figure S2). We also assessed the percentage of reads that extended through the repeats to assess whether the innate characteristics of the repeat affected sequencing performance. Approximately 95.9%, 66.8%, and 43.8% of PacBio RS II reads successfully extended through the repeat for SCA36, C9-423, and C9-774, respectively, while Schematic diagrams for plasmids used to test PacBio and ONT long-read sequencing technologies. To minimize biases when comparing the PacBio RS II and ONT MinION, we constructed four plasmids, including three repeat-containing plasmids and a non-repeat-containing plasmid. Each plasmid map identifies estimated plasmid size, and the location and size of the repeat within the plasmid. a The first plasmid did not contain a repeat, as a control, but instead included the EGFP gene. The EGFP plasmid was linearized at position 2969 with the AvrII restriction enzyme. b We also constructed a plasmid with 62 repeats of the spinocerebellar ataxia type 36 (SCA36) 'GGCCTG' repeat, which was linearized at position 2873 with AvrII. c A third plasmid contained 423 C9orf72 'GGGGCC' repeats, and was linearized at position 6368 with MluI to maximize non-repeat sequence both up and downstream of the plasmid, thus avoiding bias against reads in either direction; allowing the repeat to be too close to either end could compromise sequencing or downstream analyses. d We included an additional plasmid with 774 C9orf72 'GGGGCC' repeats to simulate the expansion size found in ALS-or FTD-affected expansion carriers. While 774 repeats is dramatically smaller than the expansion found in many affected carriers, it was the largest we were able to construct reliably, because these repeats are unstable in bacteria. Additionally, while we targeted the number of specified repeats for each plasmid, most colonies contained fewer than the targeted repeats because repeats are generally unstable in bacteria (Additional file 1: Figure S1). Thus, the targeted number of repeats serves as an estimated maximum number of repeats. Plasmids were visualized using AngularPlasmid 99.5%, 97.7%, and 83.5% of ONT MinION reads extended through, respectively.
Base calling accuracy through the C9-774 repeat region was dramatically different between the PacBio RS II and ONT MinION for the C9orf72 repeat. After aligning the repeat region for individual reads to the expected repeat sequence of the same length, using the global Needleman-Wunsch algorithm [23], the median PacBio RS II error rate for C9-774 was 7.4%, while the median ONT MinION error rate for C9-774 was 47.3%. The Pac-Bio RS II consensus sequence (Additional file 1: Data S1) contained 774 repeats in the C9-774 plasmid, attaining approximately 99.8% accuracy. The ONT MinION consensus sequence (Additional file 1: Data S2) also contained 774 repeats in the C9-774 plasmid, but was only approximately 26.6% accurate, because many guanines and cytosines were erroneously called as adenine. Thus, in the ONT MinION consensus sequence, guanines and cytosines were represented as mixed nucleotides in the consensus sequence (e.g., R representing G or A, and M representing C or A). Exactly 553 (71.4%) of the 774 ONT MinION repeats were represented as either RRRRCM or RRRRMC. Workflow for linearizing, pooling, and sequencing plasmids on the PacBio RS II and ONT MinION long-read platforms. Each plasmid was cut with the restriction enzyme identified in the respective plasmid maps (Fig. 1), and at the specified location. After linearizing each plasmid independently, the plasmids were pooled and cleaned. We then sequenced the same pool on the PacBio RS II and Oxford Nanopore Technologies' (ONT) MinION using their respective library preparation protocols. After sequencing, reads from each plasmid were identified using BLAST and then aligned to their respective reference sequences using graphmap, as preparation for downstream comparisons PacBio sequel successfully sequences through the C9orf72 repeat expansion in affected carriers Sequencing repeat expansions cloned into plasmids demonstrated that both platforms are capable of sequencing through these challenging repeats, but demonstrating on a human expansion carrier is essential to determining whether long-read technologies can characterize the repeats in their entirety at the nucleotide level, and also to determine whether the technologies are suitable for discovering new disease-causing or disease-modifying repeat expansions. We identified and confirmed two C9orf72 G 4 C 2 repeat expansion carriers through fluorescent PCR fragment analysis, repeat-primed PCR, and Southern blotting [24] using cerebellar tissue. The fluorescent PCR analysis demonstrated the individuals carried two and eight repeats in the unexpanded allele, respectively (Fig. 5a, d). We then determined the individuals were expansion carriers through repeat-primed PCR (Fig. 5b, e), and estimated the expansion sizes by Southern blot (Fig. 5c, f). The most abundant expansion sizes, averaged across multiple Southern blots, indicate repeat sizes of approximately 1083 repeats (8.8 kb, including flanking sequence) and 1933 repeats (13.9 kb), respectively.
We then performed whole-genome sequencing on the sample with the longer repeat (sample 2) on both the PacBio Sequel and the ONT MinION. We transitioned to the PacBio Sequel from the RS II because the Sequel's higher throughput is more amenable to whole-genome sequencing for large genomes. We purified high molecular-weight cerebellar DNA for sample 2 and generated approximately 7× median genome-wide coverage, and 8× coverage across the C9orf72 repeat locus from five PacBio Sequel SMRT cells. We sequenced the same sample on the ONT MinION, generating approximately Both the PacBio RS II and ONT MinION successfully sequence through repeats, but the RS II had more variable read lengths. After selecting only those reads that could be clearly identified for each plasmid (described in Fig. 1), there were 46,213, 67,339, 9012, and 11,535 PacBio RS II reads for EGFP, SCA36, C9-423, and C9-774, respectively. Likewise, there were 26,735, 39,059, 8276, and 8720 ONT MinION reads for the same respective plasmids. The PacBio RS II generally had more reads, but read length distributions are much tighter for the ONT MinION across all four plasmids, and more closely resemble expected read lengths. The median read length for each instrument is indicated by dashed lines, and the expected maximum read length is indicated by a solid gray line. Expected maximum read lengths for each plasmid were 6080 (EGFP), 5984 (SCA36), 8813 (C9-423), and 9731 (C9-774). Because these long repeat sequences are unstable in plasmids, however, most bacterial colonies contained fewer than the targeted number of repeats (Additional file 1: Figure S1). Thus we expect the read sizes to vary. The additional PacBio RS II read variability may be related to library preparation 3× median coverage, and 2× across the C9orf72 repeat locus from 15 flow cells. All ONT MinION flow cells passed quality control before loading the library, with > 1000 active pores.
Of the two ONT MinION reads, neither covered an expanded allele, and the repeat region for only one of the reads could be clearly defined. We excluded the read for which we could not clearly define the repeat region. Where the human reference genome (hg38) contains three G 4 C 2 repeats (Fig. 6a), the read for which we could clearly define the repeat region had a total of 41 nucleotides within the repeat region (gain of 25 nucleotides compared to hg38); this equates to approximately seven total repeats (Additional file 1: Figure S3b; Data S3). While hg38 contains three repeats (Additional file 1: Figure S3a), this does not accurately represent what is observed in the population, as the most common non-pathogenic allele is two repeats followed by eight repeats [3]. An allele with three repeats was not observed in the population [3]. The ONT MinION's measurement of seven repeats closely resembles the eight repeats measured by our fluorescent PCR fragment analysis (Fig. 5d). The ONT MinION was only able to sequence the non-mutant allele with 15 flow cells in this study.

Whole-genome PacBio sequel sequencing identifies repeat expansion and characterizes repeat length
To date, researchers have relied on Southern blots to measure an individual's repeat expansion size, but technologies have limited our ability to characterize nucleotide content, which may have critical implications on disease etiology, age of onset, duration, and other clinical phenotypes. The whole-genome PacBio Sequel reads enabled us to generally characterize repeat length and nucleotide content for this case's G 4 C 2 repeat expansion, but read depth using this approach limited our ability to accurately assess G 4 C 2 content.
Four of the eight PacBio Sequel reads capturing the C9orf72 repeat locus were clearly expanded and four were not. Repeat lengths for the four reads capturing the wild-type (non-pathogenic) allele ranged from 46 to 50 nucleotides, where two measured exactly 48 nucleotides (gain of 30 compared to hg38), equating to eight total repeats (Additional file 1: Figure S3c; Fig. 6b); these results matched our fragment analysis (Fig. 5d). Of the four reads capturing an expanded allele, three did not bridge the entire repeat region, where one captured 178 nucleotides in the repeat region (approximately 30 repeats; Figs. 6c, 7-red; Additional file 1: Data S5), another captured 419 nucleotides (approximately 69 repeats; Fig. 4 Repeat length distributions for the PacBio RS II and ONT MinION were highly concordant. Both platforms produced highly similar distributions for all plasmids, but the repeat lengths varied widely within each plasmid, as expected based on gel intensity curves (Additional file 1: Figure S1). The C9-423 repeat length distribution is more variable than even the C9-774, perhaps because the C9-774 plasmid backbone is more tolerant of the repeat. The median number of repeats for the PacBio RS II were 35, 148, and 395 for SCA36, C9-423, and C9-774, respectively, while median repeat lengths for the ONT MinION were 37, 172, and 406, respectively. The percentage of reads that extended through the SCA36, C9-423, and C9-774, repeats were approximately 95.9%, 66.8%, and 43.8% for the PacBio RS II, respectively, while 99.5%, 97.7%, and 83.5% of ONT MinION reads extended through, respectively Figs. 6c, 7-blue; Additional file 1: Data S6), and the third captured 5471 nucleotides (approximately 912 repeats; Figs. 6c, 7-green; Additional file 1: Data S7). It is possible that the read capturing 419 nucleotides bridged the repeat because the end of the read closely matches the sequence adjacent to the repeat region (Additional file 1: Figure S3d). The fourth read, however, spanned the entire repeat with 7941 nucleotides (approximately 1324 repeats; Figs. 6c, 7-brown; Additional file 1: Figure S3e; Data S8), which falls easily within the Southern blot's Characterization of the affected C9orf72 repeat expansion carriers using standard methodologies. a, d We first performed fluorescent PCR to determine the individuals' non-pathogenic repeat sizes. Genomic DNA was PCR-amplified with genotyping primers and one fluorescently labeled primer. Fragment length analysis of the PCR product was then performed on an ABI3730 DNA analyzer and visualized using GeneMapper software. A peak is observable at 129 bp (a) and 165 bp (d), indicating that the non-pathogenic alleles for samples 1 and 2 contain two and eight repeats, respectively. A single peak also indicates that the individual is either homozygous for the given allele, or also has an expansion. b, e To determine whether the individuals had a repeat expansion, we performed a repeat-primed PCR analysis. PCR products of a repeat-primed PCR were separated on an ABI3730 DNA analyzer and visualized by GeneMapper software, showing a stutter amplification characteristic for a C9orf72 repeat expansion. This does not indicate expansion size, however. c, e After determining the individuals were expansion carriers, we performed a Southern blot to estimate the size. The Southern blots reveal a long repeat expansion in other individuals for whom cerebellar tissue was available, including positive controls (POS CON; lanes 1-5, and 1 and 3, respectively) and our patients of interest (CASE; lanes six and two, respectively). DIG-labeled DNA Molecular Weight Markers (Roche) are shown to estimate the repeat expansion's size. Measurements were based on multiple separate Southern blots for each case; for simplicity one representative Southern blot is shown. The most abundant expansion size in samples 1 and 2 are estimated around 1083 (8.8 kb) and 1933 repeats (13.9 kb), respectively. The smears ranged widely, demonstrating the heterogeneity (i.e., mosaicism) of this repeat expansion within a small piece of tissue. This demonstrates the importance of additional long-read sequencing studies to characterize the repeat at the nucleotide level range (Fig. 5f ). Mean GC content across the case's repeat region was 87.7% compared to 97.0% for the C9-774 plasmid repeat region. Based on the percentage of plasmid reads that spanned the C9-774 repeat region (43.8%), we would expect approximately half of the reads covering the expanded allele to span at least 774 repeats, which we observed in these data. These data demonstrate the PacBio Sequel is able to sequence through the challenging C9orf72 G 4 C 2 repeat, and is well suited for genetic discovery efforts involving large repeat expansions with appropriate sequencing depth.

Greater read depth through targeted PacBio sequencing enables improved nucleotide content characterization
Because performing long-read, whole-genome sequencing is costly and excessive when investigating a small region, we also tested PacBio's relatively new No-Amp targeted sequencing method [21] across the C9orf72 repeat expansion in a case with a smaller repeat (sample 1; Fig. 5c). This method allowed us to achieve deeper read depth and more accurately assess nucleotide content compared to the whole-genome approach. We used a sample with a shorter expansion to maximize the number of sequencing passes for individual DNA Fig. 6 PacBio Sequel reads traverse the repeat region for pathogenic and non-pathogenic alleles. The PacBio Sequel sequenced through both pathogenic and non-pathogenic alleles, demonstrating the platform is capable of characterizing repeat expansions. All of these reads were first aligned by graphmap, and then hand curated to determine the repeat region. a The human genome reference sequence (hg38) contains three G 4 C 2 repeats (18 nucleotides). We identified specific "landmarks" before and after the repeat region in the reference sequence to properly locate the repeat region in the reads, and to hand curate the alignments. Landmarks are identified by red bars adjacent to the repeat region. b We obtained four PacBio Sequel reads covering the eight-repeat sequence, spanning 48 nucleotides. There was a net gain of 29 nucleotides within the defined repeat region, which equates to approximately 5 additional repeats; this concurs with our fragment analysis (Fig. 5a). c We also obtained four reads that covered an expanded allele, one of which bridged the entire repeat expansion, with approximately 1324 repeats (7941 nucleotides). The other three reads ended before bridging the repeat region, where one captured approximately 30 repeats (178 nucleotides), another captured approximately 69 repeats (419 nucleotides), and the third captured approximately 912 repeats (5471 nucleotides) molecules, thus increasing overall quality of the circular consensus sequences.
We obtained 828 circular consensus sequences for sample 1, where approximately 70% (576 of 828) of reads measured exactly two repeats (the individual's unexpanded allele; Fig. 5a), 14% (115 of 828) were within six nucleotides of two repeats, and 16% (134 of 828) were from expanded alleles. We excluded any sequences that did not read through the entire repeat, determined by alignment (see Methods), thus, all included reads represent full-length repeat alleles. The repeat distribution from the expanded alleles suggests mosaicism (by length) with two modes at approximately 110 and 870 repeats (Fig. 8). Without prior estimates from the Southern blot (Fig. 5c), we likely would have estimated the primary populations of this individual's repeats at 2, 110, and 870 repeats. Because of the Southern blot, however, we know the primary population of the individual's expanded repeat is near 1000 repeats. The 97.5th and 99th percentiles of the distribution's probability density function (from the PacBio Sequel) are approximately 964 and 1011 repeats, which closely resemble estimates by Southern blot (Fig. 5c).
Based on the most common simple sequence repeats (SSRs) in reads capturing the repeat expansion, nucleotide content from sample 1 is likely mostly pure G 4 C 2 repeat. We used PERF (which stands for "PERF is an Exhaustive Repeat Finder") [25] to measure raw GC content and the most common SSRs found in the repeat region using all expanded circular consensus repeat reads with a minimum read quality of 0.9 and at least two passes around the DNA molecule. Raw GC content within the repeat region was 99.2%. We also found G 4 C 2 SSR frequency was 81.6%, followed by G 3 C 2 and G 4 C 1 with 17.5% and 0.3% frequency, respectively.
Additionally, we performed the Long Amplicon Analysis (LAA2) using SMRTLink 5.1.0 to generate overall consensus sequences across all extended repeat reads and measure the likelihood that any of the non-G 4 C 2 sequence motifs were real. LAA2 generated one consensus sequence with a predicted accuracy > 99.9% (Additional file 1: Data S9) where the sequence was supported by 301 subreads with an estimated 99.91% accuracy. Looking at only the repeat region, the consensus sequence of the repeat was approximately 894 repeats (5364 nucleotides) with 100% GC and 95.7% G 4 C 2 content. LAA2 Fig. 7 Whole-genome PacBio Sequel reads aligned to hg38. Whole-genome reads generated using the PacBio Sequel were aligned to human reference genome hg38 using graphmap. We attained 7× genome-wide median coverage and 8× across the C9orf72 repeat locus. Four reads were from the individual's wild-type allele of eight repeats, while the other four, were expanded. Three of the four reads capturing an expanded allele did not bridge the entire repeat region, where one captured 178 nucleotides in the repeat region (approximately 30 repeats; red), another captured 419 nucleotides (approximately 69 repeats; blue), and the third captured 5471 nucleotides (approximately 912 repeats; green). The read capturing 419 nucleotides may have bridged the repeat because the end of the read closely matches the sequence adjacent to the repeat region, but was ambiguous (Additional file 1: Figure S3d). The final read spanned the entire repeat with 7941 nucleotides (approximately 1324 repeats; brown), which falls easily within the Southern blot's range (Fig. 5f). Soft-clipped nucleotides-nucleotides at the end of a read that did not align to the reference-are shown for all reads, and are outlined in green for the read capturing 912 repeats. The approximate location for the repeat expansion is marked by the light-blue lines. A histogram showing read depth per nucleotide is included near the top of the figure. Alignments were visualized using the Integrative Genomics Viewer (IGV) suggested 4.3% of the repeat consisted of G 3 C 2 interruptions. Some of the other consensus sequences with predicted accuracies > 96% supported small non-GC interruptions. Estimating a consensus sequence in a region with such high mosaicism is not trivial, thus, these results should be interpreted cautiously, but demonstrates the need for a larger study.
Because insertions and deletions (INDELs) are the most common error in PacBio sequencing [26,27], we suspect most or all of the G 3 C 2 and G 4 C 1 SSRs are sequencing or basecalling error. Thus, treating all SSR motifs within a Levenshtein distance [28] of one as the expected G 4 C 2 motif, G 4 C 2 accounted for 100% of all observed SSRs in the top consensus sequence. Levenshtein distance, which is closely related to the Hamming distance, measures the number of changes between two character sequences, including insertions, deletions, and substitutions. While most of the G 3 C 2 SSRs are likely false, we cannot rule out that some may be real, potentially affecting repeat-associated non-ATG (RAN) translation [29][30][31], and disease development and progression. A single G 3 C 2 interruption would cause a frameshift, resulting in translational transitions from the relatively benign poly(GP) dipeptide repeat to the highly toxic poly(GR) repeat [32][33][34][35], or stop a toxic poly(GR), transitioning to the less toxic poly(GA). A larger, deeper sequencing study across the C9orf72 repeat region in ALS/FTD cases and controls is merited to determine whether there is an association with clinical phenotypes.

Discussion
Here, we showed that both PacBio and ONT long-read sequencing technologies can sequence through the SCA36 'GGCCTG' and the C9orf72 G 4 C 2 repeat expansions in relatively controlled repeats in plasmids, and that the PacBio Sequel can sequence through a human C9orf72 repeat expansion, in its entirety, depending on length. Additionally, we demonstrated the PacBio No-Amp targeted sequencing method can identify the unexpanded allele and determine whether the individual carries a repeat expansion. These results demonstrate the potential these technologies offer in clinical testing, genetic counseling, and future structural mutation genetic discovery efforts-including those involving challenging repeat expansions. For example, the C9orf72 G 4 C 2 repeat expansion could have been discovered years earlier if long-read sequencing technologies had been available. Through great effort, using the best approaches available at the time, the G 4 C 2 repeat was discovered in 2011 [3,6], approximately 5 years after chromosome 9p was initially implicated in both ALS and FTD [36,37]. With current long-read sequencing technologies, we can decrease the time to discover and characterize such mutations, begin studying them at the molecular level, and potentially translate them for use in clinical and genetic counseling environments.
We found that both platforms are fully capable of sequencing through challenging repeats like the SCA36 'GGCCTG' and C9orf72 G 4 C 2 repeat expansions when cloned into plasmids. It is unclear why the read length distributions for ONT's MinION were much tighter than those from the PacBio RS II, but it shows promise for future ONT MinION applications. Additionally, while median read lengths were highly similar between the PacBio RS II and ONT MinION, the MinION had a higher percentage of reads that extended all the way through the repeat regions for all three repeat plasmids, particularly the C9-423 and C9-774 plasmids.
Both the PacBio RS II and ONT MinION correctly identified the maximal expected number of repeats in the C9-774 plasmid, based on their consensus sequences, but Approximately 70% (576 of 828) of reads covered the individual's wild-type allele (two repeats), 14% (115 of 828) were within six nucleotides of two repeats, and 16% (134 of 828) were from expanded alleles. The repeat distribution from the expanded alleles shows two modes at approximately 110 and 870 repeats. Without prior estimates from the Southern blot (Fig. 5c), we likely would have estimated the primary populations of this individual's repeats at 2, 110, and 870 repeats. Because of the Southern blot, however, we know the primary population of the individual's expanded repeat is near 1000 repeats, though it is possible the C9orf72 repeat expansion runs artificially high by Southern Blot because of methylation or high GC content. The 97.5th and 99th percentiles of this distribution's probability density function are approximately 964 and 1011 repeats, which closely resemble estimates by Southern blot (Fig. 5c) base calling accuracy in the repeat regions was higher for the PacBio RS II. The PacBio RS II attained approximately 99.8% consensus accuracy, while the ONT MinION consensus sequence was only 26.6% accurate because of the mixed nucleotides in the consensus sequence. We believe this will be relatively easy to address in the ONT base calling algorithms because it appears systematic based on the RRRRCM and RRRRMC repeats in the consensus sequences, which demonstrates the same base calling errors occur consistently.
After verifying both PacBio's and ONT's technologies were capable of sequencing repeats in plasmids, we tested the technologies' ability on two C9orf72 G 4 C 2 expansion carriers and found the PacBio Sequel is capable of sequencing through challenging GC-rich repeat expansions, but throughput was problematic for the ONT MinION in this study. Newer chemistries and hardware from ONT are likely to alleviate this issue. During the timeline of our study, ONT released the GridION and PromethION sequencers that are based on the same nanopore technology and have greater throughput. The PromethION, in particular, can run many more flowcells concurrently, and each individual PromethION flowcell has significantly more nanopores than the MinION and GridION flowcells. We anticipate at least the PromethION will be suitable for large repeat studies, based on the MinION's performance in the plasmids, but we cannot be certain without further testing.
Using the PacBio Sequel, we attained 8× coverage across the C9orf72 G 4 C 2 repeat region for sample 2 using the whole-genome approach, where four reads covered the individual's expected eight-repeat (unexpanded) allele, three reads that ended 30, 69, and 912 repeats into the expansion, respectively, and one read that fully spanned an expanded repeat region with 1324 repeats. The read spanning 1324 repeats is on the lower end of the Southern blot, suggesting longer repeat alleles may have been inaccessible to the PacBio Sequel, perhaps simply because their size impedes loading into the zero-mode waveguide (ZMW) wells. Deeper sequencing is generally required to detect the mutation using a variant caller, but we demonstrate here that the technology is capable of generating such reads, as a proof of principle. We also could not reliably estimate G 4 C 2 content for sample 2 because of few reads, and each read had only a single sequencing pass. These data do demonstrate, however, that the PacBio Sequel is capable of sequencing through at least a large portion of what may be the most challenging GC-rich repeat expansion known. Discovering whether a structural variant exists, its location, and its general nucleotide makeup is the critical first step to understanding its role in human disease.
Additional studies will be necessary to determine the maximum repeat size that these technologies can span, but our data reiterates the PacBio Sequel is adequate for genetic discovery efforts already [38][39][40][41]-and suggests it is capable of sequencing and identifying large repeats. With sufficient read depth, the reads do not necessarily need to bridge the entire repeat (or other large structural variant) to discover whether it exists and characterize the general nucleotide content. Additional experiments can clarify size and nucleotide content, if the sequencing technology was unable to span the variant entirely, or with lower-quality base calls.
After verifying the PacBio Sequel was able to sequence through the C9orf72 G 4 C 2 repeat expansion using whole-genome sequencing in a human case, we tested PacBio's No-Amp targeted sequencing approach to determine how well it can characterize nucleotide content with the increased read depth, and assess how amenable the approach is for clinical and genetic counseling environments. Our results suggest the method can identify an individual's unexpanded allele, determine whether the individual carries a repeat expansion, and can estimate size up to at least 5 kb, though a larger study is needed. While being able to perfectly determine an individual's expansion size regardless of its length would be ideal, knowing the exact repeat expansion size does not clarify prognosis for C9orf72 G 4 C 2 repeat expansion carriers [8], mitigating the need to determine the expansion's precise size. Additionally, the repeat size is known to be highly variable throughout various body tissues, including different brain regions, and even within a small tissue piece from the same brain region [3,8,[42][43][44]. This is further demonstrated by the smear within the Southern blots for both symptomatic carriers included in this study.
While repeat size is not informative for prognosis, being able to assess overall G 4 C 2 content and detect repeat interruptions may be informative for prognosis, but more information is required. We were able to more accurately assess G 4 C 2 content using the targeted approach, though distinguishing between G 4 C 2 and G 3 C 2 motifs is likely unreliable at this stage. Treating all G 3 C 2 motifs as G 4 C 2 , we estimate the G 4 C 2 content for sample 1 is > 99%, and potentially 100%. There is some evidence supporting potential G 3 C 2 and non-GC interruptions, but experimentally verifying these finer differences in low-complexity repeat regions is non-trivial. A larger study will be important to determine whether it is possible to identify more pronounced interruptions, or even distinguishing between G 4 C 2 and G 3 C 2 . The ability to identify an individual's unexpanded allele, clearly indicate expansion status, and assess nucleotide content in a single experiment could have important implications in clinical and genetic counseling environments, and will certainly be investigated further in the research environment.
Existing challenges for the No-Amp targeted sequencing method include low throughput, and it inherently selects for shorter reads, or repeats in this case, because of both loading bias (shorter fragments load preferentially) and that the polymerase is less likely to traverse longer repeats as reliably as shorter repeats. This is likely why there is a statistical mode at approximately 110 repeats (Fig. 8), even though there is no observable band at that size in the Southern blot. We are confident the reads are real, however, as the Southern blot clearly shows size mosaicism, and the adjacent sequence on both sides of the repeat region aligned on both sides of the repeat region for each read with ≥85% identity for all included reads. The bias towards shorter reads does misrepresent the primary size distribution, but we were still able to determine that the individual carries a repeat expansion, and we were able to accurately estimate the size in this case. Determining whether an individual carries a repeat expansion in an automated fashion would be relatively simple using this approach.
Given that several studies have shown the C9orf72 repeat expansion is variable across tissues within a given patient [8,[42][43][44], we suggest that a large, deep long-read sequencing study across the C9orf72 repeat is important to better understand how repeat content affects disease onset and progression. Repeat interruptions are known to mitigate disease in other neurodegenerative disorders [14][15][16][17]. Fully characterizing the repeat at the nucleotide level in a large cohort may have critical implications on our understanding of disease etiology, development, duration, and on future therapy. A large, long-read sequencing study of affected C9orf72 G 4 C 2 repeat expansion carriers would also allow us to characterize mosaicism within individuals; there may be expansion sub-species that explain the more aggressive forms of ALS and FTD, that are not measurable through traditional methods, such as Southern blotting.
Cost is a limiting factor for long-read sequencing technologies, often making it impractical for large studies or for diagnostic use. Because of these limitations, researchers have made great efforts to maximize the utility of short-read sequencing technologies, employing the large amount of short-read sequencing data already generated across nearly every disease currently studied. An excellent example is the effort to identify repeat expansions based on evidence in existing short-read data [45,46]. These efforts offer researchers that have already generated short-read data for individuals the ability to determine whether an individual has a repeat expansion, but only if the repeat expansion and its location are already known. The approaches are also generally unable to estimate the repeat size. The limitations in these approaches reflect the limitations of short-read sequencing, because short reads cannot span even relatively small repeat expansions. Long-read sequencing, while having a much higher error rate, addresses these limitations, and may be more amenable to regular use in the future. For now, long-read sequencing may be ideal for small familial studies or for smaller studies intent on identifying repeat expansions that exist among a small cohort of cases. Researchers could then follow up with more cost-effective methods such as repeat-primed PCR or Southern blotting.
Knowing PacBio and ONT long-read sequencing technologies are fully capable of sequencing through challenging disease-causing repeats, such as the SCA36 'GGCC TG' and C9orf72 'GGGGCC' repeats, lays important ground work for future sequencing studies to understand the nucleotide-level nature of all repeat-expansion disorders. It also demonstrates that long-read sequencing technologies offer great potential for future repeat expansion discovery efforts, and may be useful in clinical and genetic counseling environments for either the C9orf72 repeat expansion specifically, or for other large structural mutations; the ability to target specific regions will be particularly important in several settings. Further utilizing these technologies in larger studies will be critical to properly characterizing known repeats (e.g., C9orf72) and their allelic distributions (size and content) on the nucleotide level to better understand how they contribute to disease.

Conclusions
Our results demonstrate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. These results have important implications on future genetic discovery efforts, as many diseases are caused by repeat expansions or other large structural variants. Larger studies focusing on the C9orf72 expansion in ALS and FTD will be important to determine heterogeneity and whether the repeats are interrupted by non-G 4 C 2 content. Such interruptions are likely to modify the disease course or age of onset, as shown in other repeat-expansion disorders [14][15][16][17]. These results have broad implications across all diseases where the genetic etiology remains unclear.

Study participants
Cerebellar samples included in this study were obtained from the Mayo Clinic Brain Bank, following Mayo Clinic's IRB protocols. Sample 1 was female and was diagnosed with FTD with an age of onset and duration of approximately 44 and 12 years, respectively. Sample 2 was male, and was diagnosed with FTD with an age of onset and duration of 70 and 2 years, respectively. Fluorescent and repeat-primed PCR [3], and Southern blot [24] techniques were previously described.

Repeat plasmids
To generate the G 4 C 2 repeat 423 and 774 expression vectors, we used muscle or spleen DNA from an affected C9orf72 expansion carrier as a template in a nested PCR strategy. We used ThermalAce DNA Polymerase (Invitrogen) to amplify the 66 G 4 C 2 repeat region of a previously constructed G 4 C 2 repeat plasmid, including 113 and 99 nucleotides of 5′ and 3′ flanking sequence, respectively. We then used these intermediate plasmids to construct PCR products for the 423 and 774 sequence that were subsequently cloned into the pAG3 and pcDNA6 expression vectors containing 3 different C-terminal tags in alternate frames. The EGFP gene and SCA36 'GGCCTG' 66 repeat were cloned into pAAV expression vectors. The SCA36 clone was Sanger sequenced to determine the number of repeats, but the others were too long for Sanger sequencing.

Sequencing
Plasmid sequencing libraries were prepared by first linearizing plasmids individually and then pooling in equal concentration based on NanoDrop (ThermoFisher Scientific) measurements. Plasmids were linearized using the restriction site that provided the most non-repeat sequence up and downstream of the repeat, itself. EGFP, SCA36, and C9-774 plasmids were linearized using the AvrII restriction enzyme, while C9-423 was linearized using MluI (Fig. 1). DNA was then purified using Agencourt AMPure XP beads, per the manufacturer's recommended protocol, and split equally for ONT MinION and PacBio RS II library preparation.
We used the ONT MinION "1D Genomic DNA by ligation" kit and protocol (SQK-LSK108) for MinION library preparation and sequencing, skipping optional DNA fragmentation and DNA repair steps to maximize DNA size and quantity, respectively. Briefly, we began with end repair and dA-tailing using all recommended reagents and steps, which include mixing the DNA, Ultra II end-prep reaction buffer, Ultra II end-prep enzyme mix, and nuclease-free water, and incubating at 20°C for 5 min and 65°C for 5 min. DNA was then purified using Agencourt AMPure XP beads, and eluted in nuclease-free water. The 1D sequencing adapter was then ligated by mixing the DNA, 1D adapter mix (AMX1D), and Blunt/TA ligation master mix and incubating for 10 min. DNA was then purified again using the Agencourt AMPure XP beads, but washing with the ONT Adapter Bead Binding (ABB) buffer and eluting in ONT's elution buffer (ELB). Flow cells were primed and loaded per recommended procedure and sequenced using the 48-h sequencing protocol in MinKNOW. We performed the recommended quality control run on all flow cells prior to priming and loading the library and only used flow cells that had > 1000 active pores and did not have existing air bubbles in the Application-Specific Integrated Circuit (ASIC) upon opening.
Plasmid libraries for the PacBio RS II were generated following PacBio's protocol for the SMRTbell Template Prep Kit 1.0 (Part #100-259-100) and PacBio's "Procedure & Checklist-10 kb Template Preparation and Sequencing". Briefly, DNA damage repair, end repair, blunt end hairpin adapter ligation, and final exonuclease treatment were performed using 5 μg of intact, non-sheared pooled plasmid DNA. AMPure PB magnetic beads (Pacific Biosciences) were used for all purification steps. Qualitative and quantitative analysis were performed using Advanced Analytical Fragment Analyzer (AATI) and Qubit fluorometer with Quant-iT dsDNA BR Assay Kits (Invitrogen). SMRTbell templates were annealed to v2 sequencing primers then bound to DNA polymerase P6 following PacBio's protocol using the DNA/ Polymerase Binding Kit P6 (part #: 100-356-300), as directed using Binding Calculator version 2.3.1.1. Polymerase-template complexes were purified per manufacturer's protocol (PacBio) using Pacific Biosciences Magbead Binding Kit (part #: 100-133-600) and setting up sample reaction as directed using Binding Calculator. Sequencing was carried out on the PacBio RS II (SMRT) sequencer, equipped with MagBead Station upgrade, using C4 DNA Sequencing Kit 2.0 (Part #: 100-216-400) reagents. The sample was loaded onto a single SMRT cell v3 (part #: 100-171-800) and the movie length was 360 min. Secondary Analysis was performed using Pacific Biosciences SMRT Portal, using SMRT Analysis System software (v2.3.0) for Sub-read filtering.
When sequencing the affected C9orf72 repeat expansion carrier, we extracted DNA from the affected C9orf72 repeat expansion carrier using the Agilent Reco-verEase DNA Isolation Kit (Agilent Technologies) and a standard, previously-described isolation protocol [8]. We followed the same ONT MinION "1D Genomic DNA by ligation" kit and protocol used for the plasmids, using 15 total flow cells. For PacBio whole-genome sequencing, we sequenced the DNA using the Sequel instead of the RS II because the Sequel provides greater throughput. We prepared the PacBio Sequel library per PacBio's recommended protocol "Procedure & Checklist-20kb Template Preparation Using BluePippin Size-Selection System", using the Megaruptor 2 (Diagenode, Denville, NJ, USA) for shearing and the Fragment Analyzer (Advanced Analytical, Ankeny, IA, USA) to size the DNA.
To summarize, DNA was sheared to 35 kb on the Megaruptor 2 and prepared using the SMRTbell Template Prep Kit 1.0-SPv3 (part #: 100-991-900). Qualitative and quantitative analysis were performed using Advanced Analytical Fragment Analyzer (AATI) and Qubit fluorometer with Quant-iT dsDNA BR Assay Kits (Invitrogen). SMRTbell templates were annealed to v3 sequencing primers then bound to DNA polymerase 2.0 following PacBio's protocol using the Sequel Binding and Internal Ctrl Kit 2.0 (part #: 101-400-900). Excess polymerase was removed from the binding reaction using the PacBio Loading Cleanup Bead Kit (part #: 100-715-300). MagBead binding was performed using the PacBio protocol for kit 100-125-900. Cleaned samples were loaded onto five SMRTcells with Sequel Sequencing Kit 2.0 following PacBio recommendations, and sequenced on the PacBio Sequel (SMRT) sequencer.
The PacBio No-Amp targeted sequencing procedure, currently in development, uses the CRISPR-Cas9 system to target and enrich a region of interest without PCR amplification ( Fig. 9) [21]. For sample 1, 20 μg non-sheared, genomic DNA was digested with high fidelity restriction enzyme EcoRI-HF (New England Biolabs, PN R3101S) to excise the target region. A SMRTbell library was prepared from the EcoRI-HF digested products by ligation with a hairpin adapter containing an overhang sequence complementary to the EcoRI-HF cut site using E. coli DNA ligase (New England Biolabs, PN M0205S). Genome complexity reduction was then performed by incubating each sample with high fidelity restriction enzymes KpnI-HF and SpeI-HF (New England Biolabs, PN R3142S and R3133S, respectively) and Exonuclease III and VII (Pacific Biosciences, part of SMRTbell Template Prep Kit 1.0, PN 100-259-100). Up to 1 μg of the complexity-reduced SMRTbell library was subjected to Cas9 digestion with a single guide RNA specific to sequence adjacent to the target region. Oligos comprising the guide RNA (crRNA and tracrRNA) were obtained from Integrated DNA Technologies containing an Alt-R modification to prevent RNase degradation. Cas9 Nuclease, S. pyogenes, was obtained from New England Biolabs (PN M0386S). Cas9-digested SMRTbell templates were ligated with a poly(A) hairpin adapter using T4 DNA ligase (Pacific Biosciences, part of SMRTbell Template Prep Kit 1.0, PN 100-259-100) producing asymmetric SMRTbell templates. Failed ligation products were removed by treatment with Exonuclease III and VII (Pacific Biosciences). Asymmetric SMRTbell templates were then enriched with MagBeads and buffers from MagBead Kit v2 (Pacific Biosciences, PN 100-676-500). In preparation for sequencing, a standard PacBio sequencing primer, lacking a poly(A) sequence, was annealed to the enriched SMRTbell templates in diluted Primer Buffer v2 (Pacific Biosciences, PN 001-560-849). Sequel DNA Polymerase 2.1 was bound to the primer-annealed SMRTbell templates with reagents from the associated Sequel Binding Kit 2.1 (Pacific Biosciences, PN 101-365-900). The sample complex was purified with a modified AMPure PB purification protocol (Pacific Biosciences, PN 100-265-900). The entire purified sample complex was loaded on a single SMRT Cell (Pacific Biosciences, PN 101-008-000) and sequenced on a Sequel System using an immobilization time of 4 h and movie time of 10 h. Fig. 9 Schematic of PacBio no-amplification (No-Amp) targeted sequencing. We applied the PacBio no-amplification (No-Amp) Targeted Sequencing method to a C9orf72 G 4 C 2 repeat expansion carrier to better characterize the repeat's nucleotide content. The No-Amp targeted sequencing method begins with typical SMRTbell library preparation after the target region is excised by restriction enzyme digestion. Cas9 digestion follows with a guide RNA specific to sequence adjacent (Cas9 Cutting Site) to the region of interest (green), leaving the SMRTbells blunt ended. In this case, the guide RNA was specific to sequence upstream (5′) of the G 4 C 2 repeat expansion on the anti-sense strand. A new capture adapter (red) is then ligated to the blunt ends and captured using magnetic beads (magbeads). This process enriches the library for reads containing the region of interest to maximize read depth Availability of data and materials All data are available upon reasonable request to the corresponding author.
Authors' contributions ME, LP, and JF developed and designed the study, and wrote the manuscript. ME, BB, and MS performed analyses. SF, JS, KJW, TG, MP, IM, MDH, PB, and MvB performed necessary experiments. MvB and RR provided expansion status on samples. DD provided tissue from the Mayo Clinic Brain Bank and the pathology report. All authors read and approved the final manuscript.
Ethics approval and consent to participate The Mayo Clinic Institutional Review Board (IRB) approved all procedures for this study and we followed all appropriate protocols.

Consent for publication
All participants were properly consented for this study.
Competing interests IM, BB, and MS are full-time employees of Pacific Biosciences of California, Inc. IM's spouse is also a full-time employee of, and owns stock in, Pacific Biosciences of California, Inc. All other authors declare they have no conflicts of interest.