Identification of sixteen novel candidate genes for late onset Parkinson’s disease

Background Parkinson’s disease (PD) is a neurodegenerative movement disorder affecting 1–5% of the general population for which neither effective cure nor early diagnostic tools are available that could tackle the pathology in the early phase. Here we report a multi-stage procedure to identify candidate genes likely involved in the etiopathogenesis of PD. Methods The study includes a discovery stage based on the analysis of whole exome data from 26 dominant late onset PD families, a validation analysis performed on 1542 independent PD patients and 706 controls from different cohorts and the assessment of polygenic variants load in the Italian cohort (394 unrelated patients and 203 controls). Results Family-based approach identified 28 disrupting variants in 26 candidate genes for PD including PARK2, PINK1, DJ-1(PARK7), LRRK2, HTRA2, FBXO7, EIF4G1, DNAJC6, DNAJC13, SNCAIP, AIMP2, CHMP1A, GIPC1, HMOX2, HSPA8, IMMT, KIF21B, KIF24, MAN2C1, RHOT2, SLC25A39, SPTBN1, TMEM175, TOMM22, TVP23A and ZSCAN21. Sixteen of them have not been associated to PD before, were expressed in mesencephalon and were involved in pathways potentially deregulated in PD. Mutation analysis in independent cohorts disclosed a significant excess of highly deleterious variants in cases (p = 0.0001), supporting their role in PD. Moreover, we demonstrated that the co-inheritance of multiple rare variants (≥ 2) in the 26 genes may predict PD occurrence in about 20% of patients, both familial and sporadic cases, with high specificity (> 93%; p = 4.4 × 10− 5). Moreover, our data highlight the fact that the genetic landmarks of late onset PD does not systematically differ between sporadic and familial forms, especially in the case of small nuclear families and underline the importance of rare variants in the genetics of sporadic PD. Furthermore, patients carrying multiple rare variants showed higher risk of manifesting dyskinesia induced by levodopa treatment. Conclusions Besides confirming the extreme genetic heterogeneity of PD, these data provide novel insights into the genetic of the disease and may be relevant for its prediction, diagnosis and treatment. Supplementary Information The online version contains supplementary material available at 10.1186/s13024-021-00455-2.


Background
Parkinson's disease (PD) is a neurodegenerative movement disorder characterized by the loss of mesodiencephalic dopaminergic (mdDA) neurons of the substantia nigra pars compacta (SNpc), in association with the presence of Lewy bodies in some surviving neurons [1,2]. mdDA neurons play a crucial role in the control of motor, sensory-motor and motivated behavior [3]. Their degeneration causes the characteristic symptoms of PD, which include resting tremor, bradykinesia, rigidity, postural instability and a variety of other motor and non-motor symptoms [2]. It has been demonstrated that disease's symptoms are evident when patients have already lost 50-60% of their DA neurons, suggesting the need for early diagnosis of the disease [2]. Most PD cases are sporadic, with unknown etiology. Approximately 5-10% of PD cases appear to have monogenic forms of inheritance including autosomal dominant (SNCA, LRRK2, and VPS35) and recessive (PARK2, PINK1, DJ1) forms [4][5][6]. However, many additional loci associated to PD have recently been identified, even though for most of them the molecular function is still unknown [5]. Moreover, there is a high genetic heterogeneity at the basis of PD, with many different rare mutations usually detected only in a single family or in small populations [5]. Despite extensive ongoing studies, the pathogenic mechanisms underpinning PD remain largely elusive [6]. Moreover, although the Next Generation Sequencing (NGS) methods are widespread also in the diagnostic field, their application to PD is still limited to a small number of patients with a clear family history and to a small number of causative mutations, due to the high genetic heterogeneity associated with the disease and to the difficulty in interpreting test results.
In this study, we performed a multi-stage procedure aimed at the identification of potential PD-causative variants and then at demonstrating that the co-inheritance of multiple rare variants in Mendelian genes may increase the risk of PD in a non-Mendelian fashion, and that this may influence the occurrence of the disease and the manifestation of clinical signs.

Study participants PIB cohort
Twenty-three PD families with supposedly dominant transmission (Fig. 1, families numbered 1 through 23) were recruited at the "Parkinson Institute Biobank," member of the Telethon Network of Genetic Biobank (biobanknetwork.telethon.it/) (hereafter-called PIB cohort). Forty-seven PD patients (siblings or parent-offspring pairs) from these families underwent whole-exome sequencing (WES) analysis to identify novel PD candidate genes. Forty PD patients (of which only a small amount of DNA was available), from the same 23 families, were used for segregation analysis to confirm the variants identified through WES approach. All PD subjects (age at onset 61.7; Standard Deviation (SD) ± 8.57) were evaluated by neurologists of the Parkinson Study Group, according to published diagnostic criteria (Table S1) [7]. Written informed consent was obtained from all participants.

MNI cohort
Three PD families with supposedly dominant transmission were recruited at the IRCCS Mediterranean Neurological Institute (MNI) in Pozzilli (hereafter called MNI cohort) ( Fig. 1; families numbered 24 through 26). Two PD cases for each family underwent WES analysis to identify novel PD candidate genes. Segregation analysis to confirm WES results was performed on PD cases from the same families (Fig. 1).
Three hundred ninety-four independent and unrelated PD patients (243 males; 172 familiar and 222 sporadic cases), were used as validation cohort, as well as to assess the polygenic inheritance of rare variants as risk factor for PD occurrence and for endophenotypes manifestation (Table S1) [8]. The project was approved by the ethical committees of IRCCS Neuromed and written informed consent was signed by all the participants. All subjects were of European ancestry and were evaluated by qualified neurologist of the Parkinson Centre from June 2015 to December 2017, with a thorough protocol comprising neurological examination and evaluation of non-motor domains. Information about family history, demographic characteristics, anamnesis, and pharmacological therapy was also collected [8,9]. The mean age at diagnosis was 58.26 years (SD 9.74) [8,9].
The Movement Disorder Society revised version of the Unified Parkinson's Disease Rating Scale Part III (18 items, maximum score 72; hereafter called UPDRS) [10] was used to assess clinical motor symptoms. These included language, facial expressions, tremor, rigidity, agility in movements, stability, gait and bradykinesia. Cognitive abilities were tested through an Italian validated version of the Montreal Cognitive Assessment (MoCA) [11]. Cognitive domains assessed include shortterm memory (5 points); visuospatial abilities via clock drawing (3 points), and a cube copy task (1 point); executive functioning via an adaptation of Trail Making Test Part B (1 point), phonemic fluency (1 point), and verbal abstraction (2 points); attention, concentration, and working memory via target detection (1 point), serial subtraction (2 points), digits forward and backward (1 point each); language via confrontation naming with low-familiarity animals (3 points), and repetition of complex sentences (2 points); and orientation to time and place (6 points). The total score was given by the sum of these domains, and then divided by the maximum score that could be obtained (30 points). Where one or more domains could not be tested (e.g. visuospatial tasks, due to unavailability of optical devices), the corresponding score was subtracted from the total score obtainable.
Non-motor symptoms were assessed through an Italian validated version of Non Motor Symptoms Scale (NMMS) for Parkinson Disease [12]. This scale tests 9 items, including cardiovascular domain, sleep/ fatigue, mood/cognition, perceptual problems/hallucinations, attention/memory, gastrointestinal, urinary, sexual function, and ability to taste or smell. For each item, both severity and frequency of symptoms is measured, so that the scale accounts for both aspects. Here, the sleep domain was slightly modified by adding a further question on the occurrence of vivid dreams. This question was treated as all the others, i.e. the severity of impairment was scored from 0 (no symptoms) to 3 (severe impairment), and the frequency of impairment was scored from 0 (less than once a week) to 4 (daily impairment), then the total score of the sub-item was computed as the product of severity by frequency, and added to the scores of the other sub-items. In the endophenotypes analysis we used UPDRS score < 30 (corresponding to a mild phenotype: 1-2 of Hoehn and Yahr (HY) scale) and ≥ 30 (corresponding to a severe phenotype: 3-5 HY scale) [13], while in non-motor symptoms we used as cutoff ≤54 and > 54 (9 items × 2 (mild impairment) × 3 (weekly impairment)). Statistical Fig. 1 Graphical representation of the 26 families used as discovery cohort. Pedigrees of the 26 Italian families with supposedly dominant PD forms. Affected individuals are indicated with dark symbols. Patients who underwent WES analysis are shown with dark arrows. Additional family members for which DNA was available are indicated with an asterisk. Only affected subjects were used for segregation analysis. For each family, the rare deleterious variants reported in Table 1 were carried by the affected family members significance was set to α = 0.01, correcting for five phenotypes tested.

Italian control cohort
As general Italian population, we downloaded whole genome data of 107 samples of the Tuscan Italians (TSI) population of the 1000 Genome Project (phase 3 release) [14] and extracted the exome regions covered by the same capture kit used in our sequencing experiments (SureSelect All Exome kit v6; Agilent® Technologies, Santa Clara, CA, USA).

International cohort
WES data of 1148 young-onset unrelated PD cases (average age at onset 40.6 years; range 35-56 years) and 503 control participants of European ancestry from the International Parkinson's Disease Genomics Consortium (IPDGC) were used to replicate the observations made in the Italian cohort. Details on this dataset and on the QC carried out on these samples are reported elsewhere [16].

Next generation sequencing (NGS)
Genomic DNA was isolated from peripheral blood lymphocytes by PAX gene Blood DNA Midi Kit (QIAGEN, Hilden, Germany).

Whole-exome sequencing (WES)
WES was performed on 47 affected individuals (23 families) from PIB cohort, 6 affected individuals (3 families) and 106 independent and unrelated PD cases from MNI cohort. Exonic regions were enriched using the SureSelect All Exome kit v6 (Agilent® Technologies, Santa Clara, CA, USA) based on DNA fragmentation and capture. Exomes were barcoded and sequenced at Helmholtz Zentrum, München, Germany, using the Illumina® HiSeq2000 platform.

Targeted resequencing (NGS-TR)
NGS-TR was performed on 288 independent and unrelated PD cases from MNI cohort and on 96 Italian healthy individuals. Probes specific for 100 genes including the 26-targeted genes, identified in this study, and GBA were designed with Nimble Design software (Roche Diagnostics, Mannheim Germany). Targeting regions were enriched using the SeqCap kit based on DNA fragmentation (Kapa Hyper plus kit) and capture (Roche Diagnostics, Mannheim Germany). Targeted regions were barcoded and sequenced on MiSeq platform (Illumina, San Diego, CA, US).

Quality control and variant annotation
The alignments of the 100-bp paired-end reads to the human reference genome was performed by using the Burrows Wheeler Aligner (BWA) MEM v0.7.5 [17]. After removal of duplicate reads through Picard Mark-Duplicates command (with standard options), we called the single nucleotide variants (SNVs) and insertions/deletions (indels) for all samples using HaplotypeCaller (BP_RESOULTION option) and GenotypeGVCFs in Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, following the manufacturer best practice guidelines (available at https://software.broadinstitute.org/gatk/ best-practices/) [18]. Variants with Minor Allele Count (MAC) = 0, number of alternative alleles ≠ 2 and call rate < 95% were also filtered out, as well as samples with identical-by-descent (IBD) sharing and sex mismatches, and samples with call rate < 90%. Similarly, samples were checked for absence of outliers in terms of genome-wide homozygosity, of number of singleton variants and of genetic ancestry (through Multidimensional Scaling Analysis in PLINK) [19]. Variants passing quality control were annotated to genes (within 10 kb from transcription start/stop site) through ANNOVAR [20]. To discover novel PD causative variants we analyzed WES data from 53 PD patients (mean age at onset 61.71, SD 8.57) belonging to 26 late onset PD families with supposedly dominant transmission ( Fig. 1), by using a multi-step filtering approach as reported in our previous publications [21,22]. Prioritization of candidate genes was made through segregation analysis. STRING database analysis was used to identify network of PDinteracting genes products [23].

Polygenic variant load in unrelated Italian individuals
We identified 26 PD candidate genes in this study (AIMP2, CHMP1A, GIPC1, HMOX2, HSPA8, IMMT,  KIF21B, KIF24, MAN2C1, RHOT2, SLC25A39, SPTBN1,  TMEM175, TOMM22, TVP23A, ZSCAN21,  The co-inheritance of rare variants (MAF ≤ 0.001) was investigated by counting the number of variants laying in the 26 genes, for each sample analyzed, both cases and controls. Fisher's Exact tests were performed to test differences in the number of cases and controls carrying 0, 1 or ≥ 2 variants. Odds Ratios (ORs) and their corresponding 95% Confidence Intervals (95% CIs) were also estimated in R [24]. Bonferroni correction for two variant load contrasts was applied, resulting in a corrected α = 0.025.
Then an analysis of Receiver Operating Characteristic (ROC) curve was performed to establish the accuracy of the test.

Immunohistochemistry assays
Immunohistochemistry was performed on paraformaldehyde-fixed, wax-included brains as described previously [25]. Mouse and rat adult brains were obtained from healthy animals sacrificed in accordance with the recommendations of the European Commission. All the procedures related to animal treatments were approved by Ethic-Scientific Committee for Animal Experiments and Italian Health Ministry. Formalin-fixed and paraffin-embedded (FFPE) Substantia Nigra tissue sections from human adult normal brain were purchased at BioChain (

Identification of novel candidate genes for Parkinson's disease
The study aimed at the identification of genetic risk profiles for PD. The discovery cohort included 53 PD patients belonging to 26 families showing a dominant mode of inheritance of the disease (Fig. 1). As validation cohort, we used 1542 independent PD cases and 706 controls from different data sets of European ancestry (see Patients and Methods for details) (Table S1) [8,9,16,26].
We first analysed the WES data (without applying any filter) of the 53 PD patients belonging to 26 families to search for mutations in genes already known to be associated to dominant form of PD such as, SNCA, VPS35 and LRRK2. Subsequently, we searched for large insertion/deletions in SNCA and LRRK2 genes by using MLPA approach. All annotated variants were surveyed into the Parkinson's disease Mutation Database (https:// grenada.lumc.nl/LOVD2/TPI/home.php) to confirm their pathogenicity. One out of the 26 analyzed families carried a pathogenic mutation in LRRK2 gene (c.G4322A, p.R1441H).
To search for novel candidate-disease genes we used a multistep filtering approach. We selected splicing variants (regardless of their exonic or intronic position; minor allele frequency (MAF) ≤ 0.001) and exonic disruptive variants (MAF ≤ 0.001; CADD phred score ≥ 20; excluding synonymous changes), shared between affected relatives within each of the 26 families (siblings or parent-offspring pairs), assuming a dominant mode of transmission of the disease. The variants were selected assuming a MAF ≤ 0.001 in public exome databases including ESP6500si-v2, 1000 Genomes Project and ExAC. The cut off was set to 0.001, taking in account the prevalence of rare dominant diseases (1-2/10,000) and the MAF of the most frequent mutation associated with late onset PD p.(G2019S) in the LRRK2 gene, also known as rs34637584, which shows a MAF ranging from 0.001 to 0.0002 in different databases. Retained variants (6488) were filtered for CADD phred score ≥ 20 (3712 remaining variants), corresponding to those variants which are predicted to be amongst the 1% most deleterious variants in the genome. Subsequently we compared data of the affected relatives of each family to select shared variants. This analysis identified an average number of 10 variants shared between PD relatives per family. Segregation analysis was performed in additional affected relatives belonging to the same 26 PD families (1-3 individuals per family based on DNA availability) (Fig. 1). Considering the late age at onset of the disease unaffected family members were not investigated. This analysis disclosed 28 rare disruptive variants (23 nonsynonymous, 2 stop-gain, 1 frameshift, 2 non-frameshift deletions) laying in 26 genes, which were shared among familial PD cases in 18 out of the 26 analyzed families (Table 1). Among these, in 10 families we found single heterozygous deleterious variants in a single gene segregating with PD phenotype, supporting a dominant model of inheritance. Instead, we identified 2 variants in 6 families and 3 variants in 2 families in different genes segregating with PD phenotype suggesting a polygenic model of inheritance. Unfortunately, in the remaining eight families we were not able to identify potential causative variants, probably due to the highly stringent criteria used in the pipeline of analysis. Seven (PARK2, PINK1, DJ-1(PARK7), LRRK2, FBXO7, DNAJC6, DNAJC13) out of the 26 identified genes were already known as causing PD, including familial recessive forms, while three (HTRA2, EIF4G1, SNCAIP) genes were PD candidate genes/at risk factors for which a clear link with the disease is not completely established [5,6]. Moreover, it is interesting to note that in several families, single mutations in recessive genes such as PARK2, PINK1, PARK7, were co-inherited with variants in other candidate genes, suggesting that they might play a role as risk factors in the heterozygous status. The presence of a second mutation in these recessive genes including large deletions/insertions was excluded trough MLPA approach. Sixteen out of the 26 genes analyzed were novel PD candidate genes involved in pathways potentially deregulated in PD such as mitochondrial metabolism and oxidative stress (AIMP2, HMOX2, IMMT, MAN2C1, RHOT2, SLC25A39, TOMM22) [27][28][29][30][31][32][33][34][35], vesicular trafficking, microtubule dynamics, autophagy (CHMP1A, GIPC1, HSPA8, KIF21B, KIF24, SPTBN1, TMEM175, TVP23A) [36][37][38][39][40][41][42] or in SNCA gene expression (ZSCAN21) (Fig. 2a) [43,44]. STRING database analysis showed that nine out of the 16 novel genes (AIMP2, GIPC1, HSPA8, IMMT, RHOT2, SPTBN1, TMEM175, TOMM22, ZSCAN21) encoded for proteins interacting with known PD genes ( Fig. 2a and b) [39,45].
Only very recently, IMMT, ZSCAN21 and TMEM175 have been associated to PD [43,46,47] further supporting the discovery pipeline employed in this study.
To further support the genetic involvement of the 16 novel identified genes in PD pathogenesis, we extended the mutational analysis to 1542 independent PD patients (including MNI (394 cases) and IPDGC (1148 cases)) and 706 controls from different cohorts of European ancestry, to detect further disruptive variants within these genes and to compare the total load of these variants between cases and controls. Overall data identified 256 different variants (MAF ≤ 0.001; CADD phred score ≥ 20), of which 170 were present only in cases, 61 only in controls and 25 were shared between cases and controls. Remarkably, the 170 different deleterious variants (153 nonsynonymous, 5 stop-gain, 6 frameshifts, 3 nonframeshift deletions and 3 splicing) were identified in 90 Italian patients (MNI cohort) and 153 patients of the IPDGC cohort, confirming the extreme genetic heterogeneity at the basis of PD (Table S3). None of these variants was found in 706 healthy control subjects, including both IPDGC and Italian population controls (namely MNI healthy subjects and Tuscans Noteworthy, all the affected residues identified in PD patients occupy functionally important amino acid positions, which are highly conserved amongst vertebrates ( Fig. 3; Fig. S1). In summary, although functional data are necessary to substantiate the causative role of the variants identified, these results suggest a potential involvement of the candidate genes in PD etiopathogenesis and indicate an extreme genetic heterogeneity of mutations in PD patients.

Expression analysis in human, mouse and rat DA neurons
Expression analysis through quantitative PCR (qPCR) assays showed that the 16 novel PD genes were all transcribed in the mesencephalon of adult mice at post-natal day (P) 45 (Fig. 4). Then we studied through immunohistochemistry experiments whether five representative genes (TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8) were co-expressed with TH in adult DA neurons of the substantia nigra (SN) and ventral tegmental area (VTA). We first analyzed adult human SN neurons and we found that TH + neurons coexpressed all the five genes ( Fig. 5a and Fig. S2a). This analysis was repeated in P45 mouse (Fig. 5b-g and Fig.  S2b-d) and P60 rat (Fig. S3) brains. In mouse mdDA neurons belonging to the anterior SN located in the pretectum (Fig. 5b, c and Fig. S2b) or to the SN and VTA of mesencephalic origin (Fig. 5d-g and Fig. S2c, d) showed that the expression of TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 colocalized with most of the TH + neurons (Fig. 5b-g and Fig. S2b-d). A similar result was observed when this expression analysis was performed in rat SN and VTA neurons of mesencephalic origin (Fig. S3). These data indicate that at least five out of the 16 novel PD genes were expressed in mdDA neurons supporting the possibility that they may be cellautonomously involved to provide mdDA neurons with functional features potentially controlling identity and/or survival and/or DA neurotransmission. Therefore, together with previous data, this expression analysis further supports the possibility that mutations identified in the 16 novel PD candidate genes may potentially predispose for PD occurrence.

Polygenic load analysis of rare exonic variants in familial and sporadic cases
Based on the observation that 8 out of the 18 families of the discovery cohort carried two or more variants in the 26 PD genes (Table 1), we hypothesized that the genetic combination of multiple rare variants in these genes may represent a specific genetic hallmark of PD risk. To test a b Fig. 2 Venn diagram and STRING analysis representing the logical relations between the 16 novel gene products and the network of PDinteracting genes products. a Venn diagram showing the 16 novel identified genes involved across different interconnected pathways: interactors of PD related genes, genes involved in mitochondrial metabolism and those involved in vesicular trafficking and autophagy. b Schematic representation of the network of interactions of the 16 novel PD gene products with known PD-associated genes. STRING database analysis identified 9 genes interacting with PD-associated genes this hypothesis, we analyzed in 394 unrelated and independent PD patients (including 172 familial and 222 sporadic cases from MNI data set) and in 203 controls (MNI, Moli-sani, and TSI data sets) [8,9,14,15] the coinheritance (per individual) of rare (MAF < 0.001) exonic (excluding synonymous) and splicing variants detected in the panel of the 26 genes (details in Patients and Methods). For this analysis, we selected rare variants based on a frequency threshold (MAF ≤ 0.001), without considering their degree of deleteriousness. This analysis identified 220 different variants among cases and 65 in the controls; only 18 of them were shared between case and controls (Table S4). The co-inheritance of these variants is reported in Table 2 and Fig. 6a Table 2). These differences remained statistically significant after Bonferroni correction for multiple testing of two contrasts (0 vs 1 and 0 vs ≥2 variants load classes) (α = 0.025). This finding corroborates the polygenic view of PD risk for both sporadic and familial cases. To measure the accuracy of the test, we applied the ROC curve and found a maximum sensitivity of 0.50 (based on the presence of at least one variant (specificity 0.66)) and a maximum specificity > 0.93 (based on the co-occurrence of two or more variants (sensitivity of 0.17)) (AUC = 0.59, SE = ± 0.020, p = 4.42 × 10 − 6 ) (Fig. 6b)  Moreover, the pool of rare variants identified in the PD cases was different from those detected in controls (Table S4). This suggests that, besides the load of rare variants (≥2), also the nature of the mutation is important in predisposing/causing PD.
Thus, the availability of a detailed database including variants detected only in PD cases or only in controls might be relevant to improve the operational characteristics of our predictive protocol in the general population.

Analysis of GBA gene
To test the involvement of GBA gene, that is the most important genetic risk factor for PD, alone and in combination with the 26 PD genes, we analyzed all GBA exonic variants annotated in the 26 late onset PD families and in the Italian validation cohort (394 independent PD cases and 203 controls) (Table S5). In the discovery cohort the GBA variant c.G1093A p.E365K (E326K, previous nomenclature) was present in the two affected siblings of family 7, co-inherited with single heterozygous variants in AIMP2, IMMT and PINK1 genes (Table 1), while the GBA variant c.C1223T p.T408M (T369M, previous nomenclature) was present in one of the two siblings of family 16 who also carried the c.G4322A p.R1441H mutation in LRRK2 gene and the deleterious variant c.A2710G p.K904E in SNCAIP. In the independent cohort of PD cases and controls we found a significant distribution of GBA variants (42 cases (10.6%) vs 8 controls (3.9%); p = 0.002, OR = 2.91 [1.34-6.32]).
Polygenic load analysis including multiple rare variants in the 26 genes as well as rare pathogenic variants in GBA gene showed that, approximately 20% of the PD patients carried two or more variants (cases 20.5% vs controls 7.2%; OR = 3.59 [1.97-6.90]; p = 3.4 × 10 − 6 ) ( Table 2). Remarkably, the increase in significance was mainly attributable to sporadic cases (sporadic cases increase from 13.9 to 18.5% with GBA mutations) ( Table 2).

Analysis of PD endophenotypes
To investigate the relation of the polygenic variant load in the selected 26 candidate genes with PD endophenotypes, including both motor and non-motor symptoms, we analyzed the cohort of 394 PD-MNI unrelated Fig. 4 Expression analysis of the novel identified Parkinson's disease genes. qPCR experiment showed that the 16 novel genes were expressed in mesencephalon of P45 adult mice. Pitx3 was used as an example of mdDA neurons restricted gene expression. Gapdh was used to normalize the results. Data were shown as absolute values with standard deviation calculated from three different experiments. Genes were grouped on the basis of their expression level in three different scales patients mentioned before, for which clinical assessments were available (see Patients and Methods for details on the scales used). More specifically, we tested how the presence of 0, 1 ≥ 2 rare variants in the 26 PD genes affected PD age at onset (AAO), motor symptoms (UPDRS scale) [10,13], non-motor symptoms (NMS scales) [12] and presence of L-dopa induced Dyskinesia (LID), a common motor side effect of levodopa therapy in PD patients [8,9]. Overall data show that the selected genes might influence preferentially LID occurrence, although the contrast would not survive correction for multiple testing of five phenotypes (p 0.038; Fig. 6c; Fig. 5 Expression analysis of TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 in human and mouse DA neurons. a-g Immunohistochemistry experiments performed on human (a) and P45 mouse (b-g) adult brain sections showed that TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 were co-expressed with TH in DA neurons of the human SN (a) and in mouse DA neurons of the SN (b-e) and VTA (f, g). Images in (c, e, g) correspond to magnification of the boxed area reported in (b, d, f). Scale bars correspond to 100 μm. Abbreviations: SN stands for substantia nigra and VTA for ventral tegmental area   Table S6A). When we took into account also GBA variants, this contrast was not significant anymore, while variant load was inversely associated with age at PD onset at the nominal significance level (p 0.044; Table  S6B; Fig. 6d).

Discussion
In this study, we discovered a set of 16 genes worthy of future studies to confirm their role in onset or progression in PD. We identified 170 rare and deleterious variants, absent in healthy individuals, in 15.7% of PD patients (243 PD cases out of 1542) supporting the extreme genetic heterogeneity of mutations in PD.
Although additional studies are needed to confirm the functional role of the novel identified genes in PD etiopathogenesis, a number of published studies support this hypothesis.
In particular, the transcription factor ZSCAN21 may directly control SNCA gene expression [28,37], and the c.C199T p.(R67W) mutation identified in this study affects a highly conserved arginine residue necessary for the formation of the ZSCAN21 functional homodimer [53]. TMEM175 is a lysosomal K+ channel that is required to maintain membrane potential and pH stability in lysosomes as well as to control α-synuclein aggregation [42,54]. These functions may be affected by knock down of the gene TMEM175 [42]. Although TMEM175 was previously identified in a genome wide association study of PD [47], detrimental mutations have not been identified before this study.
RHOT2 (MIRO2), IMMT and TOMM22 directly interact with PINK-1 and Parkin and influence both mitochondrial homeostasis and vulnerability of DA cells to toxins [27][28][29][30][31]55]. The aminoacyl-tRNA synthetase complex interacting multifunctional protein-2 (AIMP2) is a parkin substrate and its overexpression leads to a selective, age-dependent, progressive loss of DA neurons via activation of poly (ADP-ribose) polymerase-1 (PARP1) [35]. GIPC1 interacts with LRRK2 in DA cells in vitro and in the mouse ventral mesencephalon in vivo [45]. The Drosophila homolog of GIPC1 is expressed in the central brain of adult flies and its reduced expression significantly affects locomotive activity and longevity [56]. SLC25A39 is a mitochondrial carrier of manganese that promotes neuronal survival in Drosophila [34]. βIIspectrin, encoded by SPTBN1, is required for axon growth and axonal transport of synaptic cargo [57]. Moreover, it specifically interacts with α-synuclein and its overexpression induces neurite formation [39]. The chaperone-mediated autophagy (CMA) protein HSPA8/ HSC70 (heat shock protein 8) is essential in post-mitotic neurons for diluting toxic intracellular components including the disassembling of α-synuclein amyloid fibrils [58]. KIF21B is a processive kinesin whose depletion alters neuronal dendritic tree branching and spine formation [37]. Noteworthy, similarly to Kif21b knockout mice, which exhibit learning and memory deficits [37], our patients carrying KIF21B mutations (Table S3) showed mild cognitive impairment assessed with Montreal Cognitive Assessment (MoCA) test (mean 23 SD 2). KIF24 is a centriolar kinesin that specifically regulates ciliogenesis by dynamically controlling microtubule polymerization [38]. Single Nucleotide Polymorphisms (SNPs) located in the KIF24 gene have been associated to Frontotemporal Lobar Degeneration (FTLD) [59]. Interestingly, the patients of the MNI cohort carrying mutations affecting the motor domain of KIF24 protein (Fig. 3) showed mild-severe cognitive impairment (MoCA score ranging from 13 to 23).
HMOX exerts both pro-and anti-oxidant effects, which may influence the pathogenesis or progression of Parkinson's disease [32,60]. α-Mannosidase (MAN2C1) is the enzyme responsible for the partial demannosylation occurring in the cytosol. Man2c1-deficient mice revealed neuronal and glial degeneration [61]. Charged multivesicular body protein 1A (CHMP1A) is a member of the ESCRT-III (endosomal sorting complex required for transport-III) complex which localizes to the nuclear matrix and regulates chromatin structure [36,62]. CHMP1A regulates proliferation of neuronal progenitor cells [36,62]. TVP23A is a novel gene reported in gene database (Gene ID: 780776) as involved in intracellular vesicular transport.
Previous studies identified the p.G2019S mutation in LRRK2 as a major causative event in 1-4% of PD patients, depending on the analyzed population [63]. Within our cohort of patients, the p.G2019S mutation in LRRK2 was detected in 8 out of 394 PD patients (2%). Moreover, other studies also correlated the severity but not the occurrence of the PD to the cumulative effect of LRRK2 mutations and rare variants in other PD related genes (e.g. GBA) [26,[64][65][66].
In this study, we highlight the polygenic nature of both sporadic and familial forms of PD, especially when the familial forms come from small nuclear families. We identified a panel of 26 candidate genes carrying likely pathogenic variants and observed the presence of multiple rare variants (≥ 2) in these 26 genes in about 17% of PD patients, including familial and sporadic cases. This percentage increased to 20% of PD patients when we included rare pathogenic variants in GBA. Remarkably, this increase was mainly attributable to sporadic cases.
This means that the molecular diagnosis of PD should always take in account the analysis of multiple candidate genes. However, although the evidence reported here looks very promising, at this stage of advancement of our study we cannot offer any improvement in practicing medicine, until further confirmatory studies and functional validations are warranted. Undoubtedly, the extreme genetic heterogeneity observed in PD patients, everyone genetically and biologically unique, carrying different genetic combinations of rare variants not shared by the majority of PD patients, could cause great uncertainty in the interpretation of the results. Nevertheless, at a molecular level, these findings offer a rational support to address additional genetic studies testing this protocol in larger cohort of PD patients and controls. If these data were confirmed we could consider the possibility of using this protocol to identify, with high specificity, subjects potentially at risk of developing PD by counting per individual the number of rare variants in the panel of candidate genes.
Moreover, our data also underline the importance of rare variants in the genetics of sporadic PD and corroborate the proposed approach as complementary to genome wide association studies (GWAS) and to classical co-segregation analyses of NGS data to identify variants implicated in PD.
The present study also reports, for the first time, statistical evidence that the polygenic variant load in the PD candidate genes analyzed is associated with the occurrence of L-dopa induced Dyskinesia, an important motor side effect of PD treatment. Although it is tempting to speculate about the translational relevance of this finding, caution is suggested in the interpretation since the association was only nominally significant and disappeared when GBA variants were included in the analysis. Interestingly, in the latter setting a high variant burden was associated with earlier PD onset, in line with previous evidence [67,68].

Conclusions
Our findings encourage further studies to improve the predictive/diagnostic power of this protocol for PD and lead us to surmise that this strategy may hypothetically be extended to other polygenic disorders with largely unknown genetic etiology, such as dyslexia [69]. Therefore, in the future studies on PD genetic aetiology should be directed towards analyzing the polygenic burden of rare variants with presumably large effect size, and validating them at the functional level.
Additional file 1: Figure 1S. Evolutionary conservation study of the likely pathogenic variants identified in the 16 novel Parkinson's disease candidate genes. The affected residues are reported in bold underlined. Figure 2S. Single channel images of human and mouse immunostainings shown in Fig. 5. (a-d) Single channel images showing expression of TH in combination with TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 in human SN neurons (a) and in pretectal SN (b) mesencephalic SN (c) and VTA (d) mouse neurons. Images in (a) correspond to those reported in Fig. 5a; images in (b-d) correspond to those reported in Fig. 5c, e, g, respectively. Figure 3S. Expression analysis of TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 in rat DA neurons. ae Immunohistochemistry experiments performed on P60 rat brain sections showed that TOMM22, GIPC1, ZSCAN21, SLC25A39 and HSPA8 were co-expressed with TH in mesencephalic DA neurons of the SN (b, c) and VTA (d, e). Images in (c, e) correspond to magnification of the boxed area reported in (b, d). Sections in (a) are low magnification images showing only TH immunostaining and corresponding to the same sections reported in (b-e). Scale bars correspond to 100 μm. Abbreviations: SN stands for substantia nigra and VTA for ventral tegmental area. In (c, e) are shown merge and single channel images. Table S1. Cohort description. Table S2. List of primers used for gene expression analysis by quantitative PCR (qPCR). Table S3. Rare disruptive variants (MAF ≤ 0.001, CADD phred ≥ 20) identified in the 16 novel PD candidate genes in the validation cohorts. Table S4. Rare variants (MAF ≤ 0.001) in the 26 PD candidate genes in affected and healthy individuals of the Italian cohort. Table S5. GBA variants annotated in Italian cohort of patients and controls.