Large-scale genomic and transcriptomic initiatives offer unprecedented insight into complex traits, but clinical translation remains limited by variant-level associations without biological context and lack of analytic resources. Our resource, PhenomeXcan, synthesizes 8.87 million variants from genome-wide association study summary statistics on 4091 traits with transcriptomic data from 49 tissues in Genotype-Tissue Expression v8 into a gene-based, queryable platform including 22,515 genes. We developed a novel Bayesian colocalization method, fast enrichment estimation aided colocalization analysis (fastENLOC), to prioritize likely causal gene-trait associations. We successfully replicate associations from the phenome-wide association studies (PheWAS) catalog Online Mendelian Inheritance in Man, and an evidence-based curated gene list. Using PhenomeXcan results, we provide examples of novel and underreported genome-to-phenome associations, complex gene-trait clusters, shared causal genes between common and rare diseases via further integration of PhenomeXcan with ClinVar, and potential therapeutic targets. PhenomeXcan (phenomexcan.org) provides broad, user-friendly access to complex data for translational researchers.
Unprecedented advances in genetic technologies over the past decade have identified over tens of thousands of variants associated with complex traits (1). Translating these variants into actionable targets for precision medicine or drug development, however, remains slow and difficult (2). Existing catalogs largely organize associations between genetic variants and complex traits at the variant level rather than by genes and often are confined to a narrow set of genes or traits (3). This has greatly limited development and application of large-scale assessments that account for spurious associations between variants and traits. As a result, only 10% of genes are under active translational research, with a strong bias toward monogenic traits (4, 5).
Complex diseases are generally polygenic, with many genes contributing to their variation. Concurrently, many genes are pleiotropic, affecting multiple independent traits (6). Phenome-wide association studies (PheWAS) aim to complement genome-wide association studies (GWAS) by studying pleiotropic effects of a genetic variant on a broad range of traits. Many PheWAS databases aggregate individual associations between a genetic variant and a trait, including GeneATLAS [778 traits from the UK Biobank (http://geneatlas.roslin.ed.ac.uk/trait/)] (7), GWAS Atlas [4155 GWAS examined over 2965 traits (https://atlas.ctglab.nl/)] (8), and PhenoScanner [more than 5000 datasets examined over 100 traits (www.phenoscanner.medschl.cam.ac.uk/)] (9). Other PheWAS databases are constructed on the basis of polygenic scores estimated from multiple variants per GWAS locus (10), latent factors underlying groups of variants (11), or variants overlapping between GWAS and PheWAS catalogs (12). By building associations directly from variants (most of which are noncoding), most PheWAS results lack mechanistic insight that can support proposals for translational experiments. Genes are primarily assigned to PheWAS results by genomic proximity to significant variants, which can be misleading (13). Some studies have attempted to improve translation of PheWAS results using gene sets and pathways (14) or networks of PheWAS variants and diseases (15, 16). However, these studies rely on the same variant-trait associations on which PheWAS are built and fall short of prioritizing likely actionable targets.
Integration of genomic, transcriptomic, and other regulatory and functional information offers crucial justification for therapeutic target identification efforts, such as drug development (17). Translational researchers also need access to this integrated information in a comprehensive platform that allows convenient investigation of complex relationships across multiple genes and traits.
To meet this need, we present PhenomeXcan, a massive integrated resource of gene-trait associations to facilitate and support translational hypotheses. Predicted transcriptome association methods test the mediating role of gene expression variation in complex traits and organize variant-trait associations into gene-trait associations supported by functional information (18–20). These methods can describe direction of gene effects on traits, supporting how up- or down-regulation may link to clinical presentations or therapeutic effects. We trained transcriptome-wide gene expression models for 49 tissues using the latest Genotype-Tissue Expression (GTEx; v8) data (21) and tested the predicted effects of 8.87 million variants across 22,515 genes and 4091 traits using an adaptation of the PrediXcan method (18), Summary-MultiXcan (S-MultiXcan), that uses summary statistics and aggregates results across tissues (22). We then prioritized genes with likely causal contributions to traits using colocalization analysis (23). To make computation feasible given the large scale of data in this study, we developed fastENLOC (fast enrichment estimation aided colocalization analysis), a novel Bayesian hierarchical colocalization method. We showed separately that this approach of combining an association and a colocalization method performs better than each method individually at prioritizing causal genes and is comparable to baselines such as the nearest gene while incorporating greater biological context (24). We demonstrate results from integrating this tool with a deeply annotated gene-trait dataset to identify associations; this integration can be performed in any deeply annotated database of genes and traits, including molecular or biological traits rather than disease traits. PhenomeXcan is the first massive gene-based (rather than variant-based) trait association resource. Our approach not only uses state-of-the-art techniques available to biologically prioritize genes with possible contributions to traits but also presents information regarding pleiotropy and polygenicity across all human genes in an accessible way for researchers. Below, we provide several examples that showcase the translational relevance and discovery potential that PhenomeXcan offers.
PhenomeXcan design and overall findings
We built a massive gene-to-phenome association resource that integrates GWAS results with gene expression and regulation data. We ran a version of PrediXcan (18), S-MultiXcan, designed to use summary statistics and aggregate effects across tissues (22) on publicly available GWAS. In total, we tested the predicted effects of 8.87 million variants across 22,515 genes and 4091 traits from publicly available GWAS summary statistics (see Supplementary Materials). Traits incorporate binary, categorical, or continuous data types and range from basic anthropometric measurements to clinical traits and biochemical markers. We inferred association statistics (P values and Z scores) between predicted gene-expression variation and traits using optimal prediction models trained using 49 tissues from GTEx v8 (21, 25). LD (linkage disequilibrium) contamination due to proximity between expression quantitative trait loci (eQTLs) and causal variants can produce noncausal, spurious gene-trait associations (21, 24). We therefore first performed Bayesian fine mapping using the DAP-1/fgwas algorithm in TORUS (26, 27). We then calculated the posterior probability of colocalization between GWAS loci and cis-eQTLs to prioritize possible causal genes via fastENLOC, a newly developed Bayesian hierarchical method that uses precomputed signal clusters constructed from fine mapping of eQTL and GWAS data to speed up colocalization calculations (see Supplementary Materials). The result is a matrix of 4091 traits and 22,515 genes in which each intersection contains a PrediXcan P value aggregated across 49 tissues and refined by a locus regional colocalization probability (locus RCP) (Fig. 1). While a given colocalization threshold may be arbitrary, to minimize false negatives given the conservative nature of colocalization approaches (24), we defined putative causal gene contributors as those genes with locus RCP >0.1.
We found 72,994 significant associations (Bonferroni-corrected P value of <5.49 × 10−10) across the entire genome/phenome space, where 22,219 (30.5%) had locus RCP >0.1 (table S1). We constructed a quantile-quantile plot of all associations, which did not show evidence of systematic inflation (fig. S1). These associations represent numerous potential targets for translational studies with biological support.
Replicating known gene-trait associations
We evaluated PhenomeXcan’s performance using three different independent validation approaches. For the first validation, we compared significant results from PhenomeXcan to significant results from the PheWAS catalog, which combines the NHGRI-EBI (National Human Genome Research Institute – European Bioinformatics Institute) GWAS catalog (as of 4/17/2012) and Vanderbilt University’s electronic health record to establish unique associations between 3144 variants and 1358 traits (https://phewascatalog.org/phewas) (12, 28). These gene-trait pairs, mapped to GWAS loci mostly by proximity, are likely enriched in but do not necessarily represent causal genes. We mapped traits from PhenomeXcan to those in the PheWAS catalog using the Human Phenotype Ontology (HPO) (29). After filtering for genes included in both PhenomeXcan and the PheWAS catalog, we tested 2202 gene-trait associations. At a nominal threshold (P < 0.01), 1005 PhenomeXcan gene-trait associations replicated with matched traits in the PheWAS catalog [area under the curve (AUC) = 0.62; Fig. 2A]. Considering different methods of gene assignments for each GWAS locus (PheWAS: proximity, PhenomeXcan: PrediXcan and Bayesian colocalization), we further evaluated our replication rate using random classifiers in a precision-recall (PR) curve (Fig. 2B) and found significant replicability between PhenomeXcan and PheWAS results (empirical P value of <0.01).
For the second validation, we identified a set of high-confidence gene-trait associations using the Online Mendelian Inheritance in Man (OMIM) catalog (30). We previously demonstrated that integrated analysis using PrediXcan (18) and colocalization (23) successfully predicts OMIM genes for matched traits (24). We mapped 107 traits from PhenomeXcan to those in OMIM using the HPO (29) and curated a list of 7809 gene-trait associations with support for causality. We compared gene-trait associations from this standard near GWAS loci (table S2) and found that both PrediXcan and fastENLOC in PhenomeXcan successfully predict OMIM genes (AUC = 0.64; Fig. 2C). The combination of PrediXcan and fastENLOC improves precision in this dataset (fig. S2). The limited precision seen here is expected in the setting of genes, such as those in OMIM, with large effects and rare variants (Fig. 2D). The conservative nature of colocalization analysis can lead to increased false negatives (24), which may contribute to decreased performance of fastENLOC.
For the third validation approach, we applied a “medium-throughput” approach to examine a disease trait with multiple functionally established gene-trait associations. The Accelerating Medicines Partnership: Type 2 Diabetes (AMP T2D) Knowledge Portal curates a list of genes with causal, strong, moderate, possible, and weak associations to type 2 diabetes based on functional data (table S3) (31). We tested the ability of both PrediXcan and fastENLOC in PhenomeXcan to successfully predict the causal, strong, and moderate genes curated by AMP T2D Knowledge Portal paired with seven UK Biobank traits: “type 2 diabetes,” “type 2 diabetes without complications,” “type 2 diabetes with ophthalmic complications,” “type 2 diabetes with peripheral circulatory complications,” “Self-reported type 2 diabetes,” “Non-insulin dependent diabetes mellitus,” and “Unspecified diabetes mellitus.” PhenomeXcan successfully predicted the causal gene list for type 2 diabetes (AUC = 0.67; Fig. 3, A and B).
Identifying novel and underreported gene-trait associations
PhenomeXcan provides a resource for hypothesis generation using gene-trait associations, with more than 22,000 potentially causal associations (P < 5.49 × 10−10, locus RCP > 0.1; table S1). As case studies, we discuss associations identified on the basis of trait [“Morning/evening person (chronotype)”] and gene (TPO).
We reviewed the 15 most significant genes associated with Morning/evening person (chronotype) (a UK Biobank trait) based on PrediXcan P values across the 49 tissues and locus RCP >0.1 (table S4). Three of 15 genes had not been previously reported in any GWAS involving UK Biobank participants related to sleep or chronotype: VIP, RP11-220I1.5, and RASL10B. Notably, a variant associated with VIP (P = 1.812 × 10−17, locus RCP = 0.26) is discussed in a GWAS of 89,283 individuals from the 23andMe cohort who self-report as “a morning person” (rs9479402 near VIP, 23andMe GWAS P = 3.9 × 10−11) (32). VIP produces vasoactive intestinal peptide, a neurotransmitter in the suprachiasmatic nucleus associated with synchronization of circadian rhythms to light cycles (33). The long noncoding RNA RP11-220I1.5 (P = 6.427 × 10−11, locus RCP = 0.20) and the gene RASL10B (P = 1.098 × 10−10, locus RCP = 0.15) have not been previously reported in any GWAS or functional/clinical studies associated with this trait. RASL10B produces a 23-kDa guanosine triphosphatase protein that demonstrates overexpression in the basal ganglia in GTEx (21), potentially representing a novel association. Besides VIP, three other genes in this set had clinical/functional studies associated with sleep or chronotype in PubMed: RAS4B, CLN5, and FBXL3. RAS4B (P = 1.660 × 10−19, locus RCP = 0.63) was linked to a transcriptional network regulated by LHX1 involved in circadian control (34). CLN5 (P = 5.248 × 10−18, locus RCP = 0.34) mutations are associated with neuronal ceroid lipofuscinosis, which can manifest with sleep-specific dysfunction (35). FBXL3 (P = 1.54 × 10−16, locus RCP = 0.35) assists with turnover of the CRY protein through direct interaction to regulate circadian rhythms (36). Our results were also significant for the overlapping genes PER3 (P = 1.65 × 10−17, locus RCP = 0.08) and VAMP3 (P = 7.317 × 10−18, locus RCP = 0.63). PER3 is one of the Period genes characterized as part of the circadian clock and described in numerous functional studies, animal models, and human polymorphism association studies (37), whereas VAMP3 has little research in chronotype or sleep. VAMP3, in this instance, is likely to be a false positive in the setting of the overlapping gene structure and coregulation.
We also reviewed PhenomeXcan’s performance in associating chronotype traits with well-established circadian rhythm genes that have been identified through functional approaches. In mammals, the transcription factors CLOCK and BMAL1 influence the expression of the Period genes (PER1 and PER2) and the Cryptochrome genes (CRY1 and CRY2). PER3 stabilizes PER1 and PER2 (38). NPAS2 acts as a paralog to CLOCK. All genes demonstrated nominal significance (P < 0.01) with at least one chronotype trait in PhenomeXcan except CRY2 (strongest association P = 0.11) and CLOCK (strongest association P = 0.08). Except for PER1 (locus RCP = 0.24) and NPAS2 (locus RCP = 0.12), all genes showed locus RCP <0.1.
PhenomeXcan, to our knowledge, is one of the first hypothesis-generating tools to provide unbiased links between a trait and associated genes for the researcher’s evaluation. In conjunction with rich knowledge obtained from functional studies, PhenomeXcan can be used to generate or support subsequent translational efforts.
We next evaluate PhenomeXcan as a platform to study novel and underreported gene-trait associations. Thyroid peroxidase (TPO) encodes a membrane-bound glycoprotein that plays a crucial role in thyroid gland function (39). The strongest associations in PhenomeXcan support the known role of TPO in thyroid hormone production: “Self-reported hypothyroidism or myxedema” (P = 1.40 × 10−14, locus RCP = 0.99) and “Treatment with levothyroxine” (P = 1.54 × 10−10, locus RCP = 0.99). Hypothyroidism has been clinically linked to increased respiratory symptoms. Although the mechanism for this is not well understood (40), our results suggest that these could be explained by common genetic factors; “Treatment with salmeterol” (a medication used to treat lung disease such as asthma or chronic obstructive pulmonary disease) showed moderate associations with TPO in PhenomeXcan (P = 7.45 × 10−5, locus RCP < 0.1). TPO is also contained in the National Institutes of Health (NIH) Biosystems Pathways for the development of pulmonary dendritic cells (41). “Time to complete round” (drawing as a measure of cognitive function) showed another moderate association in PhenomeXcan (P = 1.19 × 10−4, locus RCP < 0.1). Thyroid function has been clinically linked to time to draw a clock as a form of cognitive measurement (42). Other trait associations identified in PhenomeXcan with TPO include “Single major depression episode” (P = 2.48 × 10−4, locus RCP < 0.1) and “Treatment with doxazosin” (a medication used in the United Kingdom for hypertension) (P = 8.80 × 10−4, locus RCP < 0.1), both of which have demonstrated clinical association with thyroid abnormalities (43, 44). When reviewing thyroid dysfunction traits in PhenomeXcan, TPO is among the 35 most significantly associated genes, with the others primarily involved in immune regulation or the hypothalamic-pituitary-thyroid axis. To our knowledge, depression and doxazosin use have not been deeply investigated with TPO previously, highlighting how PhenomeXcan may be useful in expanding gene-trait association studies and functional studies through consideration of independent traits associated with a given gene.
Revealing complex clusters of pleiotropy and polygenicity for translational hypotheses
PhenomeXcan allows more complex investigation of associated genes and traits beyond individual queries. As an example, to study genes associated with white blood cell count, we can cluster related genes and traits. Starting from the trait “Lymphocyte percentage,” the top associated genes include PSMD3, CD69, KLF2, CXCL2, CREB5, CXCL3, ZFP36L2, JAZF1, NCOR1, and TET2. These genes represent pathways associated with chemokine and interleukin signaling as well as peptide ligand binding but are not specific to one particular pathway or genomic location (45). We can assess these genes’ associations with white blood cell traits (neutrophil count/percentage, lymphocyte count/percentage, eosinophil count/percentage, and monocyte and basophil percentages) and infer some understanding of their causal mechanism. PSMD3, for instance, demonstrates stronger associations with neutrophil and lymphocyte traits (mean P < 1 × 10−30, mean locus RCP = 0.50), whereas ZFP36L2 demonstrates consistent associations across white blood cell, platelets, and red blood cell traits (mean P < 1.54 × 10−24, mean locus RCP = 0.36) (Fig. 4). Disruption of ZFP36L2 results in defective hematopoiesis in mice (46), whereas PSMD3 has been identified in GWAS related to white blood cell count and inflammatory states (47). Clusters of associated genes and traits can support more robust translational hypotheses through similarities in associations and generate more nuanced experimental designs through differences between associations.
Discovering links between common traits and rare diseases
PhenomeXcan can also be integrated with any gene-trait databases to study pleiotropically linked traits and shared associated genes. We integrated PhenomeXcan with ClinVar, a publicly available archive of rare human diseases and associated genes (including OMIM) and one of the most widely used gene-trait databases in the clinical setting (48). We examined the associations between the 4091 GWAS-derived traits in PhenomeXcan and 5094 ClinVar diseases by (i) calculating PrediXcan Z scores for every gene-trait association in PhenomeXcan and (ii) for each PhenomeXcan/ClinVar trait pair, we computed the average squared PrediXcan Z score considering the genes reported in the ClinVar trait (see Materials and Methods). We then created a matrix of PhenomeXcan traits by ClinVar traits with mean squared Z scores (Fig. 5, A and B), where peaks represent shared genes. We defined significant associations between traits as those with Z score >6; this represents the equivalent of a Bonferroni-adjusted P value of 0.05 based on our map of the distribution of Z scores (fig. S3).
As an example, we found links between the ClinVar trait “Parkinson disease 15” and the following traits: mean corpuscular volume, mean reticulocyte volume, and mean spherical red cell volume (Fig. 5C). The gene linked to Parkinson disease 15 in ClinVar is FBXO7. The mean Z score across eight red blood cell traits was 21.14; the mean locus RCP was 0.84 with P values all <1 × 10−30. FBXO7 plays a role in the ubiquitin system; its entry in ClinVar is associated with an autosomal recessive, juvenile-onset form of Parkinson’s disease (49). Three GWAS [the HaemGen consortium, eMERGE (Electronic Medical Records and Genomics), and van der Harst et al.] link FBXO7 with red blood cell attributes including mean corpuscular volume and mean cell hemoglobin (50–52). At least one mouse model describes defective erythropoiesis and red blood cell changes due to induced mutations in FBXO7 (53). Through PhenomeXcan, we found a pleiotropic relationship between Parkinson’s disease and red blood cell traits mediated through FBXO7 that has not been studied in humans. The nearest adjacent genes, SYN3 and BPIFC, are unlikely to be separately affecting red blood cells; they have no published association to red blood cells and demonstrate mean locus RCPs with red blood cell traits in PhenomeXcan of 0.55 and 0, respectively. Validating this finding, one mouse model specifically studies the pleiotropy of FBXO7 on both parkinsonism and red blood cell traits (54). This case study demonstrates how this powerful variation on PhenomeXcan can substantially improve translational hypothesis generation by supporting genetic links between associated rare diseases and common traits across research platforms.
Identification of potential therapeutic drug targets and related adverse effects
PhenomeXcan offers direct translational applicability, providing genomic evidence to support therapeutic targets and associated side effects. As an example, PCSK9 is a genetically supported, clinically validated target for cardiac prevention through inhibition of its binding to the low-density lipoprotein (LDL) receptor and reduction of blood LDL cholesterol levels (55). We can study the cluster of genes and traits produced by PCSK9 in PhenomeXcan for relevant information about this target. Most of the traits with strongest associations to PCSK9 relate to diagnosis and treatment of elevated cholesterol or atherosclerosis, including familial heart disease. Because inherited PCSK9 variation is associated with increased likelihood of type 2 diabetes, there was concern that PCSK9 therapies could elevate risk to type 2 diabetes. The inhibiting drugs therefore required large substudies from clinical trials to confirm no association with worse diabetes (56, 57). While not at genome-wide significance, PCSK9 has a negative association with type 1 diabetes in PhenomeXcan (P = 8.2 × 10−4, locus RCP < 0.1), consistent with the clinical concern that down-regulation of the gene could lead to increased diabetes risk. We recognize that type 1 and type 2 diabetes have different clinical etiologies. For the purpose of drug development, though, assessing PCSK9 in PhenomeXcan produces both its primary target (blood cholesterol levels as related to atherosclerosis) and, through independently identified traits, potential adverse effects via diabetes. The most commonly represented genes associated with the strongest traits for PCSK9 include APOE, LDLR, APOB, PSRC1, CELSR2, SORT1, ABCG8, ABCG5, and HMGCOR. Unsurprisingly, all of these genes have all been implicated in genetic susceptibility to hypercholesterolemia (some, such as SORT1, may be the primary causative gene in their pathway) (58). Examining potential targets in PhenomeXcan could not only help anticipate side effects via independent traits but also identify related gene networks or alternative targets with therapeutic relevance.
Here, we introduce PhenomeXcan, an innovative, powerful resource that makes comprehensive gene-trait associations easily accessible for hypothesis generation. Using PrediXcan allows us to derive gene-based associations with traits in context by integrating GWAS summary statistics with transcriptome-wide predicted expression and regulatory or functional information. We previously demonstrated that integrated analysis using PrediXcan and colocalization improves precision and power for target gene identification (24). To build PhenomeXcan, we also develop a novel, rapid colocalization method, fastENLOC, that could handle data at this scale (4091 traits × 22,515 genes × 49 tissues) (see Materials and Methods). PhenomeXcan implements the best practices derived from applying GTEx v8 (21, 59) to biologically prioritize genes with possible causal contribution to a given trait.
PhenomeXcan’s flexible structure and adaptability allow translational researchers to easily explore clinically relevant questions. The resource can be queried by gene or trait and allows identification of novel and underrepresented associations. It offers exploration of polygenicity and pleiotropy dimensions by allowing for queries across multiple genes and traits. It can also be integrated with other gene-trait datasets to explore linked traits and report common associated genes. We offer ClinVar as an example, but any deeply annotated database of genes and traits, including molecular or biological traits, may be integrated in this manner. Other possible translational uses of PhenomeXcan include biomarker exploration, identification of clinically relevant disease modifiers, and polygenic score building (using genes associated with queried traits), as well as novel directions for basic science collaborations and clinical study of linked traits (using traits associated with queried genes).
We note some caveats. Diseases with variability not related to changes in gene expression (e.g., epigenetic regulation or traits with important environmental contributions) are not expected to be captured well by this method. With just expression levels, this resource is a starting point, and additional molecular traits, such as microRNA levels, protein levels, and alternative splicing structures, are a priority for us to incorporate as data become available in sufficiently large sample sizes. Our model also better captures common overall genetic contributors rather than genes identified from rare variants. We do note that our validation standards tend to favor larger-effect genes with monogenic etiology, while the PhenomeXcan association method itself is less biased. Regulatory pleiotropy is widespread across the genome (21). In our chronotype example, VAMP3 and PER3 demonstrate regulatory pleiotropy. VAMP3, from our findings associated with chronotype, is likely to be a false positive because of coregulation of both genes by causal variants. With that degree of proximity, large-scale tools are not able to well distinguish causal genes, exemplifying the need for additional functional data to determine the causality of the gene (21). We discuss this finding to acknowledge how PhenomeXcan encounters this phenomenon and show the benefit of performing these associations across all human genes. We offer colocalization as a possible means of prioritizing causal variants, but significance of association, colocalization, and coregulatory sites must be taken into account in our results. Work from large-scale statistical genetics tools, such as PhenomeXcan, and Mendelian genetics and functional studies must then be combined to best understand the breadth of genetic contributors to complex traits. We have favored a locus RCP threshold of 0.1 to limit false negatives related to colocalization. Poor RCP (locus RCP ~ 0) may reflect a lack of sufficient evidence with available data, particularly for understudied genes, rather than true lack of causality. We therefore reported traits in this paper that had a locus RCP <0.1 but had functional support for potential association. Similarly, the genome-wide threshold of significance is conservative, and we discuss associations with functional support even with less significant P values. GWAS summary statistics used in this project were for participants and patients of European ancestry. Improving the applicability of this type of work to global populations remains of paramount importance throughout genetic medicine, and we will continue to integrate more GWAS summary statistics from broader consortia.
Resources that translate biologically relevant genomic and transcriptomic information into gene-trait associations are already critical for hypothesis generation and clinically relevant research (60). We offer PhenomeXcan, an integrated mapping for the function of every human gene, as a publicly available resource to advance the investigation of complex human diseases by improving the accessibility of relevant links between the entire genome and the phenome.
MATERIALS AND METHODS
PrediXcan and S-MultiXcan
S-MultiXcan is a method in the PrediXcan family (18) that associates genes and traits by testing the mediating role of gene expression variation in complex traits but (i) requires only GWAS summary statistics and (ii) uses multivariate regression to combine expression information across tissues (22). First, linear prediction models of genotype in the vicinity of the gene to expression are trained in reference transcriptome datasets such as the GTEx project (21). Second, predicted expression based on actual genetic variation is correlated to the trait of interest to produce a gene-level association result for each tissue. In S-MultiXcan, the predicted expression is a multivariate regression of expression across multiple tissues. To avoid collinearity issues and numerical instability, the model decomposes the predicted expression matrix into principal components and keeps only the eigenvectors of non-negligible variance. We considered a principal components analysis regularization threshold of 30 to be a conservative choice. This approach improves detection of associations relative to use of one tissue type alone and offers a reduced false-negative rate relative to a Bonferroni correction. We used optimal prediction models based on the number and proportion of colocalized gene-level associations (24). These models select features based on fine mapping (25) and weights using eQTL effect sizes smoothed across tissues using mashr (59). The result of this approach is a genome-wide gene-trait association list for a given trait and GWAS summary statistic set.
Colocalization of GWAS and eQTL signals
Bayesian fine mapping was performed using TORUS (27). We estimated probabilities of colocalization between GWAS and cis-eQTL signals using Bayesian RCP, as described in the ENLOC (enrichment estimation aided colocalization analysis) methodology (23). For this particular study, given the large scale of the data, we developed a novel implementation, entitled fastENLOC. fastENLOC was applied for all trait-tissue pairs, and the maximum colocalization probability across all tissues was used, thus obtaining a single RCP value for each gene-trait pair. This aggregation of RCP values across tissues allowed us to combine results from fastENLOC and S-MultiXcan.
Validation of PhenomeXcan using PheWAS catalog, OMIM, and type 2 diabetes reported genes
We evaluated the accuracy of gene-trait associations in PhenomeXcan by using two different gene-trait association datasets (PheWAS catalog and OMIM) as well as genes linked with functional evidence with type 2 diabetes (T2D) according to the AMP T2D. We then derived the receiver operating characteristic curve (ROC) and PR curves for PrediXcan and fastENLOC independently and a combination of both.
We mapped traits from PhenomeXcan to those in either PheWAS catalog (28) or OMIM (30) by using the HPO (29) and the GWAS catalog as intermediates. For traits in the PheWAS catalog, we tested 2202 gene-trait associations that could be mapped in both PhenomeXcan and the PheWAS catalog, from a total 19,119 gene-traits associations consisting of all genes present in an LD block with GWAS signal. For the OMIM traits, we developed a standard (table S2) of 7809 high-confidence gene-trait associations that could be used to measure the performance of PhenomeXcan, of which 125 presented in the LD block of GWAS signal so those were included in the analysis. This standard, as described in our recent work (24), was obtained from a curated set of trait-gene pairs from the OMIM database by mapping traits in PhenomeXcan to those in OMIM. Briefly, traits in PhenomeXcan were mapped to the closest phecode using the GWAS catalog–to–phecode map proposed in (28). As disease description in OMIM has been mapped to the HPO (29), we created a map from phecodes to terms in HPO, which allowed us to link our GWAS traits to OMIM disease description by using phecodes and HPO terms as intermediate steps. For each gene-trait pair considered causal in this standard, we determined whether PhenomeXcan identified that association as significant on the basis of the resulting P value. The OMIM-based standard is publicly available through R package (https://github.com/hakyimlab/silver-standard-performance).
For T2D, we obtained a list of predicted effector transcripts identified by AMP T2D and used 76 genes categorized as “causal,” “strong,” or “moderate” as our gold standard for evaluation (table S3). As we did for OMIM and PheWAS catalogs, 20 of these causal genes could be mapped in PhenomeXcan, from a total of 5036 genes present in an LD block with GWAS signal. We used seven traits highly related to T2D: International Classification of Diseases 10 codes E11 and E14, Self-reported type 2 diabetes (data-field 20002 in UK Biobank with code 1223), and four phenotypes manually curated by the FinnGen Consortium (type 2 diabetes without complications, type 2 diabetes with ophthalmic complications, type 2 diabetes, and type 2 diabetes with peripheral circulatory complications); then, we took the maximum Z score obtained (for MultiXcan) and the maximum RCP (for fastENLOC) across the seven T2D traits for each gene evaluated. The results are shown in Fig. 3 and fig. S2. Notice that multiple testing is not an issue, since for the performance curves, we are not using a significance threshold, but all levels are assessed in terms of the false-positive and true-positive rates.
Supporting evidence for PhenomeXcan results
PhenomeXcan results for case studies were included on the basis of their P values and locus RCP. We defined putative causal gene contributors as those genes with P values less than 5.49 × 10−10 and locus RCP >0.1. Given these conservative measures, however, we did discuss associations that were less significant or had a lower locus RCP with functional evidence. We used the NHGRI-EBI GWAS catalog (21 October 2019) to identify GWAS results both using the UK Biobank (given the predominance of this dataset in PhenomeXcan) and other datasets. We performed systematic literature searches on PubMed using the gene name alone, with the specific trait category and trait name to identify functional studies relevant to a trait of interest.
Building PhenomeXcan × ClinVar
We examined links between 4091 PhenomeXcan traits and 5094 ClinVar traits and associated genes. ClinVar traits were excluded if they did not have known associated genes in PhenomeXcan. To compare a PhenomeXcan trait t and a ClinVar trait d, we calculated the mean squared Z score
where k is the number of genes reported in ClinVar for trait d and Z is the Z score of gene i obtained with S-MultiXcan for trait t. We then created a matrix of PhenomeXcan traits by ClinVar traits with mean squared Z scores. We defined significant associations between traits as those with Z score >6; this represents the equivalent of a Bonferroni-adjusted P value of 0.05 based on our map of the distribution of Z scores (fig. S3).
- Copyright © 2020 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).