Deciphering cellular transcriptional alterations in Alzheimer’s disease brains

Large-scale brain bulk-RNAseq studies identified molecular pathways implicated in Alzheimer’s disease (AD), however these findings can be confounded by cellular composition changes in bulk-tissue. To identify cell intrinsic gene expression alterations of individual cell types, we designed a bioinformatics pipeline and analyzed three AD and control bulk-RNAseq datasets of temporal and dorsolateral prefrontal cortex from 685 brain samples. We detected cell-proportion changes in AD brains that are robustly replicable across the three independently assessed cohorts. We applied three different algorithms including our in-house algorithm to identify cell intrinsic differentially expressed genes in individual cell types (CI-DEGs). We assessed the performance of all algorithms by comparison to single nucleus RNAseq data. We identified consensus CI-DEGs that are common to multiple brain regions. Despite significant overlap between consensus CI-DEGs and bulk-DEGs, many CI-DEGs were absent from bulk-DEGs. Consensus CI-DEGs and their enriched GO terms include genes and pathways previously implicated in AD or neurodegeneration, as well as novel ones. We demonstrated that the detection of CI-DEGs through computational deconvolution methods is promising and highlight remaining challenges. These findings provide novel insights into cell-intrinsic transcriptional changes of individual cell types in AD and may refine discovery and modeling of molecular targets that drive this complex disease.


4
Background 62 Alzheimer's disease (AD) is a neurodegenerative disease that affects ~5.7 million 63 patients with annual cost of more than $230 billion in the US 1 . Effective disease-modifying 64 drugs are still elusive despite the urgent need and increasing global burden 2,3 . Pathologically, AD 65 is marked by amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and 66 gliosis in the affected brain regions. Transcriptome-wide expression levels have been analyzed 67 from bulk brain tissue of hundreds of AD patients and neuropathologically normal controls 4-8 to 68 discover genes and biological pathways that are perturbed in and/or lead to AD. Systems biology 69 and bioinformatics analysis of these data have implicated altered pathways in AD including 70 immune response 6 and myelin metabolism 4,5 . However, a fundamental knowledge gap remains 71 concerning whether disease-associated changes in brain gene expression are due to changes in 72 cellular composition of the AD samples secondary to disease neuropathology, or due to changes 73 in the intrinsic regulation/activity of genes in the central nervous system (CNS) cells. From a 74 clinical perspective, it is difficult to target changes in cellular composition secondary to 75 neuropathology, whereas intrinsic changes in gene expression that may drive AD progression are 76 potentially "druggable". 77 We expect that identifying cell-intrinsic differentially expressed genes (CI-DEGs) of 78 individual CNS cell types will reveal novel insights into the genes and pathways that could 79 potentially identify drug targets and lead to AD therapeutics. This approach circumvents the 80 influence of cell-composition changes that can impact disease associated DEGs obtained from 81 bulk tissue transcriptome analysis. RNA sequencing (RNAseq) studies from single cell, single 82 nucleus or purified adult human CNS cells 9-11 can be used to identify CI-DEGs. Even though 83 such single cell-type RNAseq data can effectively serve as a reference to annotate bulk-tissue transcriptome data 4 , such approaches remain costly, cumbersome and limited in sample sizes. On 85 the other hand, there exist large-scale bulk brain RNAseq datasets 5,8,12 , which can be leveraged to 86 identify CI-DEGs through analytic deconvolution of bulk tissue expression into signals of 87 individual cell types by using cell proportions or their proxies 13,14 . 88 In this study, we describe the design and application of a bioinformatics pipeline that uses 89 cell type marker genes to estimate cell proportion 15,16 to deconvolute bulk tissue transcriptome 90 data using three computational approaches 13,14,17 and to subsequently identify CI-DEGs. We 91 applied our pipeline to the analysis of three post-mortem brain datasets, one from dorsolateral 92 prefrontal cortex (DLPFC) 8 and two from temporal cortex (TCX) 4,12,18 regions, comprised of 685 93 unique samples. Consensus CI-DEGs common to both TCX and DLPFC regions were analyzed 94 for enrichment of gene ontology (GO) terms. We compared the results of consensus CI-DEGs to 95 consensus bulk-DEGs. In addition, for the DLPFC 8 dataset that had both bulk and single nucleus 96 RNAseq 19 (snRNAseq) data, we compared the CI-DEGs from the computational deconvolution 97 to CI-DEGs obtained from snRNAseq 19 . 98 To our knowledge, this is the first study to detect consensus CI-DEGs and their enriched 99 gene ontology (GO) terms from multiple brain regions using multiple computational 100 deconvolution algorithms for AD and control RNAseq samples. Our study illustrates the cell 101 proportion landscape of AD and control brain regions assessed in three independent RNAseq 102 studies 4,7,8,12 . We identify consensus CI-DEGs many of which are not observed in bulk-DEG 103 analysis and characterize their cell-type specificity. GO terms that are enriched for CI-DEGs 104 implicate cell intrinsic transcriptional alterations that may influence AD, rather than be a result 105 of cell-proportion changes in this disease. These CI-DEGs and their biological pathways may 106 serve as refined molecular targets for therapeutic discoveries and disease modeling in AD. Our study also demonstrates that detection of CI-DEGs through computational deconvolution 108 methods is promising while some challenges remain.

111
Cellular composition in three brain cohorts from two brain regions 112 We analyzed three cohorts each consisting of post-mortem brains from AD and control 113 subjects ( An inspection about the pairwise correlation between marker genes (Fig 1a) revealed that In addition, a computer simulation study (Fig S1) 129 demonstrated that the estimated proportions of OPC were not robust upon using different selection of marker genes. Therefore, we did not include OPC in downstream analyses in this 131 study.

132
In all three datasets, neuronal cell proportion estimates were significantly lower in AD 133 compared to controls (Fig 1b). The magnitude of this decrease was the greatest for TCX-Mayo (1.40 and 1.30 respectively) than in DLPFC (1.07 and 1.14 respectively) for both cell types.

142
Oligodendrocyte proportion is significantly higher in AD in DLPFC with AD:control ratio 1.14 143 and TCX-MSBB with AD:control ratio 1.27, although remains unchanged in TCX-Mayo with 144 the ratio 0.94. Collectively, these findings demonstrate that the proportions of CNS cell types are 145 different in post-mortem AD vs. control brains for most cell types. Although these proportional 146 changes with AD are mostly consistent across the different studies, their extent varies across 147 brain regions, with TCX tending towards higher magnitude of neuronal loss and microglia 148 proliferation than DLPFC. It needs to be emphasized that the cell proportion changes estimated 149 here are relative values, rather than absolute cell proportion changes between ADs and controls.

158
In this study, three computational approaches were applied to identify cell intrinsic 159 differential expression in individual cell types (CI-DEGs, Table S4-S6), namely CellCODE 14 , 160 PSEA 13 and our method WLC. Differentially expressed genes from bulk brain tissue (bulk-161 DEGs) were identified through linear regression without adjusting for cellular composition 162 (Table S7). For the DLPFC, TCX-Mayo and TCX-MSBB datasets, we obtained bulk-DEGs and CI-DEGs from the three computer algorithms for neuronal, oligodendrocytic, microglial, 164 astrocytic and endothelial cell types respectively. 165 We compared bulk-DEGs across the three datasets (Fig 2a, top panel). Similarly, CI-166 DEGs from CellCODE, PSEA and WLC are compared across datasets (Fig 2a, lower panels), 167 such that CI-DEGs shared between datasets are required to be consistent in the designated cell

185
To obtain the consensus CI-DEGs that are shared between DLPFC and TCX brain 186 regions, we selected those CI-DEGs that are detected in "DLPFC and TCX-Mayo" or in "DLPFC and TCX-MSBB" under any of the three algorithms (Fig S2). We combined all such 188 genes, which collectively comprised the consensus CI-DEGs for each cell type (Fig 2b). 189 Similarly, consensus bulk-DEGs were the combined set of bulk-DEGs shared between "DLPFC 190 and TCX-Mayo" or "DLPFC and TCX-MSBB".  With regards to consensus bulk-DEGs (Fig S3), 28.2% of them (885/3135) are cell type 200 markers; 10.4% neuronal markers, 5.6% oligodendrocyte, 3.4% microglia, 4.8% astrocyte and 201 4.0% endothelial markers. The above observations indicate that computational deconvolution 202 algorithms could identify CI-DEGs for both marker genes and non-marker genes. Importantly, 203 the proportion of non-marker CI-DEGs is greater than that in bulk-DEGs. This suggests that 204 compared to bulk-DEGs, CI-DEGs may be capturing a greater proportion of expression changes 205 that are not due to mere cell population changes. 206 We also compared the consensus bulk-DEGs with consensus CI-DEGs (Fig S4). We 207 determined that only a small fraction (15.0% or 29/193) of the up-regulated neuronal CI-DEGs 208 was also present in up bulk-DEGs although the overlap is still significant (Fig 2c). In DEGs were also present in down bulk-DEGs, whereas most of the down-regulated CI-DEGs of 212 the other four cell types were absent from this group. Since bulk-DEGs did not adjust for 213 neuronal loss and gliosis in AD (Fig 1b), its ability to identify up-regulated neuronal genes and 214 down-regulated glial genes is likely to be compromised. For the same reason, bulk-DEGs may 215 have a false inflation of detecting down-regulated neuronal and up-regulated glial genes.

218
To identify pathways implicated by CI-DEGs that are robust across brain regions, we 219 performed Gene Ontology (GO) enrichment analysis 20,21 for the consensus CI-DEGs, assessing 220 separately those that are up vs. down in AD subjects (Table S8- (Table S10). 249 Astrocytes, a cell type that plays a critical role in maintaining brain energy dynamics 33  Table S12). 262 Importantly, some CI-DEGs highlight protein translation as a top perturbed biological  (Table S15). Similarly, down-regulated endothelial consensus CI-DEGs also 266 harbor ribosomal protein encoding genes, with enrichment in protein translation related GO 267 processes (GO:0006413 and GO:0006613) ( Table S17).

Comparison of CI-DEGs from computational deconvolution vs. snRNAseq
We determined the extent to which each of the three computational deconvolution 276 algorithms could detect CI-DEGs from bulk tissue by comparison of their results with those 277 obtained in a published snRNAseq study 19 . The ROSMAP dataset utilized in our study has both 278 bulk RNAseq from DLPFC (bulk-DLPFC) as well as snRNAseq (snDLPFC) in a subset of its 279 participants 19 . We compared the bulk-DLPFC data deconvoluted with three different algorithms 280 with the published snDLPFC 19 data. Endothelial CI-DEGs were not available from the 281 snRNAseq study, therefore overlap of results could be assessed only for four cell types. 282 We tested the overlap between the top CI-DEGs for each cell type obtained from 283 deconvoluted bulk-DLPFC and those from snDLPFC ranked by their p values (Fig 4a). We 284 evaluated the overlap for a range of top CI-DEGs up to top 1,000 genes. Overlap for CI-DEGs 285 that are either up (Fig 4a, upper panel) or down (Fig 4a, lower panel) in AD were assessed 286 separately. Hence, overlapping genes had both similar ranks and direction of effect in both 287 deconvoluted bulk-DLPFC and snDLPFC analyses. We established the significance of overlap 288 using simulations for a range of top ranked genes (N=200, 600 and 1,000) ( Table S18).  Amongst these comparisons, we determined that the significance for overlap was best for 297 all algorithms for the top ranked 600 genes. Using WLC deconvoluted results, the overlap for the top 600 CI-DEGs from bulk-DLPFC and snDNPFC are statistically significant for all eight 299 comparisons (Fig 4a). For the top 600 genes, overlap with CellCODE results is significant for all 300 except down-regulated oligodendrocyte and up-regulated astrocyte CI-DEGs. For PSEA, none of 301 the microglia CI-DEGs had significant overlap. PSEA results for the top 600 genes were 302 otherwise significant for all but up-regulated oligodendrocyte and down-regulated astrocyte 303 genes. 304 We also performed a comparison of CI-DEGs identified at nominal significance (p-value 305 <0.05) with each algorithm from bulk-DLPFC to nominally significant snDLPFC results (Fig   306   4b, Table S19). As with the above comparison, genes that are either up or down in both 307 deconvoluted bulk-DLPFC and snDLPFC data were analyzed separately for each cell type.

308
Not surprisingly, down-regulated neuronal CI-DEGs have the greatest overlap (537/3732 309 or 14.4% for WLC, 292/3213 or 9.1% for CellCODE, 415/3516 or 11.8% for PSEA). These 310 overlaps are significant for all three algorithms (Table S19) (Table S19). There is an increasing number of large scale RNAseq-based transcriptome datasets 329 generated in bulk tissue for many diseases, including brain tissue from patients with AD, other 330 neurodegenerative diseases and controls 5-8,12 . Comparison of such transcriptome data from 331 patient and control individuals has been instrumental in the identification of genes and co-332 expression networks that are altered in and may therefore be risk factors for these diseases 4-7,44 .

333
The discovery that many of these transcriptional networks harbor genes with disease risk variants 334 provides support for the utility of this bulk-transcriptome approach in deciphering molecules and 335 pathways that are risk factors for these conditions. Nevertheless, there is clear evidence for There are two general approaches to decipher cell-specific transcriptional changes in AD.

351
One approach is to conduct single nucleus (snRNAseq), single cell (scRNAseq) or purified cell 352 RNAseq experimentally, followed by data analyses. The alternative approach is to design 353 relatively complex bioinformatics pipelines to decipher cell-intrinsic information of individual 354 cell types from bulk tissue microarray or RNAseq data. The first approach is limited due to

363
In this study, applying three different algorithms 13,14 including our own novel approach, 364 we estimated cell-intrinsic gene expression for deconvoluted cell types from three large bulk 365 RNAseq datasets 4,8,12,18 from two brain regions. We identified consensus cell-intrinsic 366 transcriptional alterations (CI-DEGs) in AD, which are conserved across cohorts and brain 367 regions. We also performed an in-depth comparison of these CI-DEGs with bulk brain RNAseq 368 data obtained from the same datasets, collectively comprised of 685 unique brain samples. To

374
The main findings from our study can be summarized as follows: 1) The direction of 375 change in cellular proportions in AD is consistent across two brain regions and three datasets for   Notably, the computational approach we describe can be extended to cellular sub-types.

434
To exemplify this, we utilized published snRNAseq 19 data to identify marker genes of excitatory 435 and inhibitory neurons. Using those and the previously applied markers for the glial cells, we ran 436 DSA algorithm [3] to estimate cell proportion (Fig S6). We applied PSEA, WLC and CellCODE 437 to identify consensus CI-DEGs in the same fashion as described (Table S20, Fig S7). As  down-regulated, respectively, in endothelia (Fig 3, Tables S12, S17). These findings highlight 467 the ability of the computational deconvolution approach to provide biological insights that could 468 be missed by snRNAseq approaches.

488
Comparison of CI-DEGs deconvoluted from bulk-DLPFC and those detected using an 489 orthogonal approach of snRNAseq 19 from the same cohort (ROSMAP) demonstrates the ability 490 of these computational approaches to identify true cell-specific expression changes in AD. Using 491 our in-house WLC algorithm, there was significant overlap with the results from snRNAseq and 492 CI-DEGs of most cell types (Fig 4, Tables S18-S19). CellCODE and PSEA also identified 493 significant overlaps, but to a lesser extent, especially for rarer cell types such as microglia.

494
Hence, our WLC algorithm demonstrates at least comparable performance to these two 495 algorithms 13,14 . This is also supported by the higher degree of overlap among different cohorts 496 for CI-DEGs detected by WLC than the other two algorithms (Fig 2a). Due to the challenges in 497 deconvoluting noisy data from human series, different computational approaches may be utilized 498 and combined, and that calls for a more devoted effort in developing such algorithms.

499
In summary, using three distinct computational approaches, we deconvoluted brain bulk-500 RNAseq data from three large and independent cohorts 8,12,18 . We detected cell population 501 changes that are observed consistently across cohorts, and congruent with the known disease 502 pathology. Although there is significant overlap between consensus CI-DEGs and consensus

511
Analysis datasets 512 We generated the TCX-Mayo data, which consists of temporal cortex RNAseq 513 measurement of 80 AD patients and 28 controls diagnosed according to neuropathologic 514 criteria 4,12 . RNAseq data were processed and quality control (QC) was conducted as It should be noted that the CERAD 55 neuritic plaque score as applied by the ROSMAP 525 study is defined such that high CERAD indicates lower neuritic plaque burden and decreased 526 probability of AD. In the MSBB study, higher CERAD indicates higher plaque burden.

527
Raw RNA read counts were normalized using conditional quantile normalization (CQN) 528 method 56 implemented in R cqn package, as previously described 4 . This normalization takes into 529 consideration library size, gene length and GC content. It also performs a log2 transformation so 530 that the resulting distribution for each gene is Gaussian-like. We also determined covariates that 531 contributed significantly to the variation of gene expression in these RNAseq cohorts (Fig S5) 532 for adjustment in the analyses.

543
In this study we identified CI-DEGs from deconvoluted bulk RNAseq data using three 544 different algorithms, namely PSEA 13 , CellCODE 14 and our in-house algorithm WLC. All 545 analyses adjusted for the following variables: Sex, RIN, age at death and batch for DLPFC and 546 TCX-MSBB datasets, and sex, RIN and age at death for TCX-Mayo dataset (Fig S5). 547 PSEA 13 applies model selection procedures to select cell type(s) that should be included 548 in baseline (control) or AD condition, and then estimate differential expression in specific cell 549 types (CI-DEGs). We used the R package PSEA to obtain CI-DEG results of PSEA approach, 550 through functions em_quantvg (to generate candidate models) and lmfitst (to fit all candidate 551 models and pick the best one). Expression values used in PSEA are in linear scale (non log-552 transformed).

553
CellCODE 14 assesses overall gene expression difference between AD and control groups 554 with adjustment of relative cell proportion, followed by assigning the cell type that correlates 555 best with the difference (CI-DEGs). R package CellCODE was used to obtain DEG results of CellCODE approach, through functions getAllSPVs (to construct surrogate variable using 557 marker genes through singular value decomposition) and lm.coef (to estimate difference between 558 AD and control groups). Strictly speaking, CellCODE measures the overall differential 559 expression rather than CI-DEGs but identifies the cell type that is most correlated with the 560 observed difference between AD and control using cellPopT function. However, for simplicity, In Eqs.1-3, is the observed expression of a gene in subject i; x i,t is the median marker gene 583 expression of cell type t in subject i; , is covarite in subject i. In Eq.1, is the overall 584 relative expression in cell type t. In Eq.2, β t is relative expression at the baseline condition in cell 585 type t; d i = 0 if subject i is in control group, and d i = 1 if subject i is in AD group; therefore, 586 β t ∆ is the difference of relative gene expression between baseline condition and AD condition in 587 cell type t. Of note, due to the biological meaning these coefficents, they must satisfy constraints 588 such that , , + ∆ be non-negative. In addition, y i is in linear scale rather than in log 589 scale 57 , and these non-   (Table S20). Excitatory and inhibitory CI-DEGs were 648 identified as described above.  from Mayo Foundation. This study was in part funded by NIH RF1 AG051504 and R01