%\VignetteIndexEntry{SKAT}
%\documentclass{article}
\documentclass[11pt]{article}


\setlength{\topmargin}{-0.25in}
\setlength{\oddsidemargin}{0.15in}

\setlength{\textwidth}{6.5in}
\setlength{\textheight}{8.6in}


\usepackage{amsmath}
\usepackage{amscd}
\usepackage[tableposition=top]{caption}
\usepackage{ifthen}
\usepackage[utf8]{inputenc}

\begin{document}
\SweaveOpts{concordance=TRUE}

\title{SKAT Package}
\author{Seunggeun (Shawn) Lee}
\maketitle

\section{Overview}
SKAT package has functions to 1) test for associations between SNP sets
and continuous/binary phenotypes with adjusting for covariates and kinships and 2) to compute power/sample size for future studies.


\section{Association test}

An example dataset (SKAT.example) has a genotype matrix (Z) of  2000 individuals and 67 SNPs, 
vectors of continuous (y.c) and binary (y.b) phenotypes, and a covariates matrix (X).

<<data>>=
library(SKAT)
data(SKAT.example)
names(SKAT.example)

attach(SKAT.example)
@

To test for associations, SKAT\_Null\_Model function should be used in prior to run SKAT to estimate parameters under the null model of no associations.

<<SKAT1>>=
# continuous trait 
obj<-SKAT_Null_Model(y.c ~ X, out_type="C")
out.c<-SKAT(Z, obj)
out.c$p.value

# dichotomous trait 
obj<-SKAT_Null_Model(y.b ~ X, out_type="D")
out.b<-SKAT(Z, obj)
out.b$p.value

@


The returned object from SKAT has many information, such as number of markers in Z and number of markers to be used for the test. The version 2.1.0 has test.snp.mac which has MAC of each markers used in the test. 
<<SKAT111>>=

out.c$param

out.c$test.snp.mac

@

When the trait is binary and the sample size is small, 
SKAT can produce conservative results. We developed a moment matching adjustment (MA)  
that adjusts the asymptotic null distribution by estimating empirical variance and kurtosis.
By default, SKAT will conduct the MA adjustment when the sample size $ < 2000$. 
In the following code, we use only 200 samples to run SKAT. 

<<SKAT11>>=

IDX<-c(1:100,1001:1100)	
# With-adjustment
obj.s<-SKAT_Null_Model(y.b[IDX] ~ X[IDX,],out_type="D")
SKAT(Z[IDX,], obj.s, kernel = "linear.weighted")$p.value

@

If you don't want to use the adjustment, please set Adjustment=FALSE in the SKAT\_Null\_Model function.

<<SKAT12>>=
# Without-adjustment
obj.s<-SKAT_Null_Model(y.b[IDX] ~ X[IDX,],out_type="D", Adjustment=FALSE)
SKAT(Z[IDX,], obj.s, kernel = "linear.weighted")$p.value
@


Resampling based approaches to adjust for binary traits have been developed and 
implemented in SKATBinary function. When you use the SKATBinary function, Adjustment=TRUE in SKAT\_Null\_Model is not necessary.
Implemented methods are 1) Efficient resampling (ER); 2) ER with adaptive resampling (ER.A);
3) Quantile adjusted moment matching (QA); 4) Moment matching adjustment (MA);
5) No adjustment (UA);  and 6) Hybrid. 
"Hybrid" (default method) selects a method based on the total minor allele count (MAC), the number of individuals with minor 
alleles (m), and the degree of case-control imbalance. 
Detailed description of these methods can be found in the following reference: \\

Lee, S., Fuchsberger, C., Kim, S., Scott, L. (2016)
An efficient resampling method for calibrating single and gene-based rare variant association analysis in caseâ€“control studies. \textit{Biostatistics} (2016) 17 (1): 1-15.
  
<<SKAT22>>=

# default hybrid approach 
out<-SKATBinary(Z[IDX,], obj.s, kernel = "linear.weighted")
out$p.value

@

We have recently developed more scalable and accurate method for binary traits, which is implemented in SKATBinary\_Robust function. Detailed description of these methods can be found in the following reference: \\

Zhao, Z., Bi, W., Zhou, W., VanderHaar, P., Fritsche, L.G., Lee, S. (2020)
UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-based Rare-Variant Test. \textit{AJHG}, 106: 3-12, doi:https://doi.org/10.1016/j.ajhg.2019.11.012

 
<<SKAT23>>=

# Robust approach
out<-SKATBinary_Robust(Z[IDX,], obj.s, kernel = "linear.weighted")
out$p.value

@


\subsection{Assign weights for each SNP}

It is assumed that rarer variants are more likely to be causal variants with large effect sizes. 
To incorporate this assumption, the linear weighted kernel uses a weighting scheme and is formulated as 
$ Z W W Z'$, where $Z$ is a genotype matrix, and $W = diag \{ w_1, \ldots, w_m \}$ is a weight matrix. 
In the previous examples, we used the default beta(1,25) weight, 
$w_i = dbeta(p_i, 1, 25) $, where $dbeta$ is a beta density function,
and $p_i$ is a minor allele frequency (MAF) of SNP $i$.
Different parameters for the beta weight can be used by changing weights.beta. 
For example, weight.beta=c(0.5,0.5) will use the Madsen and Browning weight.


<<SKAT3>>=
SKAT(Z, obj, kernel = "linear.weighted", weights.beta=c(0.5,0.5))$p.value
@

You can use your own weight vector by using the weights parameter. For the logistic weight, we provide a function to generate the weight. 

<<SKAT4>>=
# Shape of the logistic weight

MAF<-1:1000/1000
W<-Get_Logistic_Weights_MAF(MAF, par1=0.07, par2=150)
par(mfrow=c(1,2))
plot(MAF,W,xlab="MAF",ylab="Weights",type="l")
plot(MAF[1:100],W[1:100],xlab="MAF",ylab="Weights",type="l")
par(mfrow=c(1,2))

# Use logistic weight
weights<-Get_Logistic_Weights(Z, par1=0.07, par2=150)
SKAT(Z, obj, kernel = "linear.weighted", weights=weights)$p.value
@


\subsection{SKAT-O: Combined Test of burden test and SKAT}

A test statistic of the combined test is
$$Q_{\rho} = (1-\rho) Q_S + \rho Q_B,$$
where $Q_S$ is a test statistic of SKAT, and $Q_B$ is a score test statistic of the burden test. 
The $\rho$ value can be specified by using the r.corr parameter (default: r.corr=0).

<<SKAT41>>=
#rho=0, SKAT
SKAT(Z, obj, r.corr=0)$p.value

#rho=0.9
SKAT(Z, obj, r.corr=0.9)$p.value

#rho=1, Burden test
SKAT(Z, obj, r.corr=1)$p.value
@


If method=``optimal.adj'' or ``SKATO'' (both are equivalent), SKAT-O method will be performed, 
which computes p-values with eight different values of $\rho=(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1)$
and then uses the minimum p-value as a test statistic. 
If you want to use the original implementation of SKAT-O, use method=``optimal'', which uses 
eleven equally spaced $\rho$ values from 0 to 1 as a grid of $\rho$s.
We recommend to use ``SKATO'' or ``optimal.adj'', since it has a better type I error control.

<<SKAT42>>=

#Optimal Test
SKAT(Z, obj, method="SKATO")$p.value

@

\subsection{Combined test of common and rare variants}

It is possible that both common and rare variants are associated with phenotypes. 
To test for combined effects of common and rare variants, SKAT\_CommonRare function can be used.
The detailed description of the combined test can be found in the following reference: \\

Ionita-Laza, I., Lee, S., Makarov, V., Buxbaum, J. Lin, X. (2013). 
Sequence kernel association tests for the combined effect of rare and common variants.  
\emph{AJHG}, 92(6):841-53.  \\

<<SKAT43>>=
# Combined sum test (SKAT-C and Burden-C)

SKAT_CommonRare(Z, obj)$p.value
SKAT_CommonRare(Z, obj, r.corr.rare=1, r.corr.common=1 )$p.value

# Adaptive test (SKAT-A and Burden-A)

SKAT_CommonRare(Z, obj, method="A")$p.value
SKAT_CommonRare(Z, obj, r.corr.rare=1, r.corr.common=1, method="A" )$p.value

@


\subsection{Impute missing genotypes.}

If there are missing genotypes, SKAT automatically imputes them
based on Hardy-Weinberg equilibrium. 
You can choose from ``bestguess'', ``fixed'' or ``random''. 
The ``bestguess'' imputes missing genotypes as most likely values (0,1,2), 
the ``fixed'' imputes missing genotypes by assigning the mean genotype value (2p, p is the MAF)
and the "random" imputes missing genotypes by generating binomial(2,p) random variables. The default imputation 
method for the SKAT function is ``fixed'' and for the SKATBinary function is ``bestguess''.

<<SKAT5>>=
# Assign missing 
Z1<-Z
Z1[1,1:3]<-NA

# bestguess imputation
SKAT(Z1,obj,impute.method = "bestguess")$p.value

# fixed imputation
SKAT(Z1,obj,impute.method = "fixed")$p.value

# random imputation
SKAT(Z1,obj,impute.method = "random")$p.value


@

\subsection{Resampling}

SKAT package provides functions to carry out resampling method to
compute empirical p-values and to control for family wise error rate. 
Two different resampling methods are implemented.
``bootstrap'' conducts a parametric bootstrap to resample residuals from $H_0$ 
with adjusting for covariates. When there is no covariate, ``bootstrap'' is equivalent to
the permutation. ``perturbation'' perturbs the residuals by multiplying 
standard normal random variables. The default method is ``bootstrap''.
From ver 0.7, we do not provide the ``perturbation'' method.


<<SKAT6>>=
# parametric boostrap.
obj<-SKAT_Null_Model(y.b ~ X, out_type="D", n.Resampling=5000, 
type.Resampling="bootstrap")

# SKAT p-value
re<- SKAT(Z, obj, kernel = "linear.weighted")
re$p.value	# SKAT p-value
Get_Resampling_Pvalue(re)	# get resampling p-value

detach(SKAT.example)
@

When there are many genes/SNP sets to test, 
resampling methods can be used to control family-wise error rate. 
Examples are provided in the next section.

\subsection{Adjust for kinship}

If related individuals exist in your data, you need to adjust for kinship. 
SKAT\_NULL\_emmaX function uses linear mixed model (EMMAX) to estimate the variance component, 
which will be subsequently used to adjust for kinship. For the kinship adjustment, 
SKAT\_NULL\_emmaX function should be used instead of SKAT\_Null\_Model. 

<<SKATKin1>>=
data(SKAT.fam.example)
attach(SKAT.fam.example)

# K: kinship matrix 
obj<-SKAT_NULL_emmaX(y ~ X, K=K)
SKAT(Z, obj)$p.value

# SKAT-O
SKAT(Z, obj, method="SKATO")$p.value	

detach(SKAT.fam.example)
@

\subsection{X chromosome test}

Since male has only one copy of X-chromosome, special care is needed to test for associations in X-chromosome. 
We have developed a method to test for X-chromosome in region based rare variant test with and without X-inactivation. 
To use it, you need to use SKAT\_Null\_Model\_ChrX to fit the null model and SKAT\_ChrX for association tests.
Detailed description of association tests in X-chromosome  can be found in the following reference: \\

Ma, C., Boehnke, M., Lee, S., the GoT2D Investigators (2015) Evaluating the Calibration and Power of Three Gene-based Association Tests of Rare Variants for the X Chromosome, \textit{Genetic Epidemiology}, 39 (7): 499-508.

<<SKATX1>>=
data(SKAT.example.ChrX)
attach(SKAT.example.ChrX)

Z = SKAT.example.ChrX$Z
#############################################################
#	Compute the P-value of SKAT 

# binary trait
obj.x<-SKAT_Null_Model_ChrX(y ~ x1 +x2 + Gender, SexVar="Gender", out_type="D", data=SKAT.example.ChrX)

# run SKAT-O
SKAT_ChrX(Z, obj.x, method="SKATO")$p.value



@

For Y chromosome, you can use the same null model function for X with Model.Y=TRUE. 
The p-value can be calculated with SKAT\_ChrY function. 
The following example use the same genotype matrix previously used to show how these functions can be used. 

<<SKATY1>>=

#############################################################
#	Compute the P-value of SKAT 

# binary trait
obj.x<-SKAT_Null_Model_ChrX(y ~ x1 +x2 + Gender, SexVar="Gender", out_type="D", data=SKAT.example.ChrX, Model.Y=TRUE)

# run SKAT-O
SKAT_ChrY(Z, obj.x, method="SKATO")$p.value
detach(SKAT.example.ChrX)
@


\section{Plink Binary format files}

For the genome-wide data analysis, plink binary format files can be used in SKAT.
To use plink files, plink bed, bim and fam files, and your own setid file 
that contains information of SNP sets are needed. 
Example files can be found on the SKAT/MetaSKAT google group page.

<<SKAT_B1>>=
# To run this code, first download and unzip example files

##############################################
# 	Generate SSD file

# Create the MW File
File.Bed<-"./Example1.bed"
File.Bim<-"./Example1.bim"
File.Fam<-"./Example1.fam"
File.SetID<-"./Example1.SetID"
File.SSD<-"./Example1.SSD"
File.Info<-"./Example1.SSD.info"

# To use binary ped files, you have to generate SSD file first.
# If you already have a SSD file, you do not need to call this function. 
Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
@

Now you can open SSD and Info file and run SKAT. 

<<SKAT_B2>>=
FAM<-Read_Plink_FAM(File.Fam, Is.binary=FALSE)
y<-FAM$Phenotype

# To use a SSD file, please open it first. After finishing using it, you must close it.
 
SSD.INFO<-Open_SSD(File.SSD, File.Info)

# Number of samples 
SSD.INFO$nSample 

# Number of Sets
SSD.INFO$nSets

obj<-SKAT_Null_Model(y ~ 1, out_type="C")
@

<<SKAT_B21,echo=TRUE, results=hide>>=
out<-SKAT.SSD.All(SSD.INFO, obj)
@

<<SKAT_B22>>=
out
@

If you have a plink covariate file, Read\_Plink\_FAM\_Cov function can be used to read both FAM and covariate files. 

<<SKAT_B2Cov>>=
File.Cov<-"./Example1.Cov"
FAM_Cov<-Read_Plink_FAM_Cov(File.Fam, File.Cov, Is.binary=FALSE)

# First 5 rows
FAM_Cov[1:5,]

# Run with covariates
X1 = FAM_Cov$X1
X2 = FAM_Cov$X2
y<-FAM_Cov$Phenotype

obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="C")
@

<<SKAT_B2Cov1,echo=TRUE, results=hide>>=
out<-SKAT.SSD.All(SSD.INFO, obj)
@

<<SKAT_B2Cov2>>=
out
@

To use custom weight, you need to make a weight file and read it using
``Read\_SNP\_WeightFile'' function. The weight file should have two columns, 
SNP ID and weight values. The output object of ``Read\_SNP\_WeightFile''
can be used as a parameter in SKAT.SSD functions

<<SKAT_B2Weight>>=

# Custom weight
# File: Example1_Weight.txt
obj.SNPWeight<-Read_SNP_WeightFile("./Example1_Weight.txt")
@

<<SKAT_B2Weight1,echo=TRUE, results=hide>>=
out<-SKAT.SSD.All(SSD.INFO, obj, obj.SNPWeight=obj.SNPWeight)
@
<<SKAT_B2Weight2>>=
out
@

The output object of SKAT.SSD.All has an output dataframe object ``results''. 
You can save it using write.table function.
<<SKAT_B2Save>>=

output.df = out$results
write.table(output.df, file="./save.txt", col.names=TRUE, row.names=FALSE)

@


If more than one gene/SNP sets are to be tested, 
multiple test should be adjusted to control for family-wise error rate. 
It can be done by the bonferroni correction. If gene/SNP sets are correlated, however,
this approach can be conservative. Alternatively, you can directly control family wise error rate (FWER) using the resampling method. 

<<SKAT_B3,echo=TRUE, results=hide>>=
obj<-SKAT_Null_Model(y ~ 1, out_type="C", n.Resampling=1000, type.Resampling="bootstrap")
out<-SKAT.SSD.All(SSD.INFO, obj)
@

<<SKAT_B31>>==
# No gene is significant with controling FWER = 0.05
Resampling_FWER(out,FWER=0.05)

# 1 gene is significnat with controling FWER = 0.5
Resampling_FWER(out,FWER=0.5)
@

``SKAT.SSD.OneSet'' or ``SKAT.SSD.OneSet\_SetIndex'' functions can be used
to test for a single gene/SNP set. Alternatively, 
you can obtain a genotype matrix using ``Get\_Genotypes\_SSD'' function and then run SKAT. 

<<SKAT_B4>>==

obj<-SKAT_Null_Model(y ~ 1, out_type="C")

# test the second gene
id<-2
SetID<-SSD.INFO$SetInfo$SetID[id]
SKAT.SSD.OneSet(SSD.INFO,SetID, obj)$p.value
 
SKAT.SSD.OneSet_SetIndex(SSD.INFO,id, obj)$p.value

# test the second gene with the logistic weight.
Z<-Get_Genotypes_SSD(SSD.INFO, id)
weights = Get_Logistic_Weights(Z, par1=0.07, par2=150)
SKAT(Z, obj, weights=weights)$p.value

@

SKAT\_CommonRare function also can be used with SSD files. 

<<SKAT_B5,echo=TRUE, results=hide>>==

# test all genes in SSD file
obj<-SKAT_Null_Model(y ~ X1 + X2, out_type="C")
out<-SKAT_CommonRare.SSD.All(SSD.INFO, obj)

<<SKAT_B51>>==
out
@


After finishing to use SSD files, please close them.

<<SKAT_B5>>==
Close_SSD()
@


\subsection{Plink Binary format files: SKATBinary}

SKATBinary functions can also be used with plink formatted files. This section shows 
an example code. Example plink files can be found on the SKAT/MetaSKAT google group page.

<<SKAT_BB1>>=

# File names
File.Bed<-"./SKATBinary.example.bed"
File.Bim<-"./SKATBinary.example.bim"
File.Fam<-"./SKATBinary.example.fam"
File.Cov<-"./SKATBinary.example.cov"
File.SetID<-"./SKATBinary.example.SetID"
File.SSD<-"./SKATBinary.example.SSD"
File.Info<-"./SKATBinary.example.SSD.info"

# Generate SSD file, and read fam and cov files
# If you already have a SSD file, you do not need to call this function. 
Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
FAM<-Read_Plink_FAM_Cov(File.Fam, File.Cov, Is.binary=TRUE, cov_header=FALSE)

# open SSD files
 
SSD.INFO<-Open_SSD(File.SSD, File.Info)

# No adjustment is needed
obj<-SKAT_Null_Model(Phenotype ~ COV1 + COV2, out_type="D", data=FAM, Adjustment=FALSE)
@

<<SKAT_BB1,echo=TRUE, results=hide>>==
# SKAT
out.skat<-SKATBinary.SSD.All(SSD.INFO, obj, method="SKAT")

# SKAT-O
out.skato<-SKATBinary.SSD.All(SSD.INFO, obj, method="SKATO")
@

<<SKAT_BB2>>==
# First 5 variant sets, SKAT
out.skat$results[1:5,]

@

The effective number of tests and QQ plots can be obtained using the minimum achievable p-values (MAP).


<<SKAT_BB2>>=

# Effective number of test is smaller than 30 (number of variant sets)
# Use SKAT results
Get_EffectiveNumberTest(out.skat$results$MAP, alpha=0.05)

# QQ plot
QQPlot_Adj(out.skat$results$P.value, out.skat$results$MAP)

@



\section{Power/Sample Size calculation.}


\subsection{Dataset}
SKAT package provides a haplotype dataset (SKAT.haplotypes) 
which contains a haplotype matrix of 10,000 haplotypes over 200kb region (Haplotype), 
and a dataframe with information on each SNP. 
These haplotypes were simulated using a calibrated coalescent model (cosi) with mimicking 
linkage disequilibrium structure of European ancestry. 
If no haplotype data are available, this dataset can be used to compute power/sample size.

<<data>>=
data(SKAT.haplotypes)
names(SKAT.haplotypes)

attach(SKAT.haplotypes)
@

\subsection{Power/Sample Size calculation}

The following example uses the haplotypes in SKAT.haplotypes with the following parameters. 

\begin{enumerate}
\item Subregion length = 3k bp 
\item Causal percent = $20 \%$
\item Negative percent = $20 \%$
\item For continuous traits, $\beta = c |log_{10}(MAF)|$ (BetaType = ``Log'') with $\beta = 2$ at MAF = $10^{-4}$
\item For binary traits, $log(OR) = c |log_{10}(MAF)|$ (OR.Type = ``Log'') with OR $= 2$ at MAF = $10^{-4}$, and $50 \%$ of samples are cases and $50 \% $ of samples are controls
\end{enumerate}

<<SKAT_P1>>==
set.seed(500)
out.c<-Power_Continuous(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000,    
Causal.Percent= 20, N.Sim=10, MaxBeta=2,Negative.Percent=20)
out.b<-Power_Logistic(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000,   
Causal.Percent= 20, N.Sim=10 ,MaxOR=7, Negative.Percent=20)

out.c
out.b

Get_RequiredSampleSize(out.c, Power=0.8)
Get_RequiredSampleSize(out.b, Power=0.8)

@

In this example, N.Sim=10 was used to get the result quickly. 
When you run the power calculation, please increase it to more than 100. 
When BetaType = ``Log'' or OR.Type = ``Log'', the effect size of  continuous trait 
and the log odds ratio of binary traits are $c |log_{10}(MAF)|$, 
where $c$ is determined by Max\_Beta or Max\_OR. 
For example, $ c= 2/4 = 0.5$ when the Max\_Beta = 2. 
In this case, a causal variant with MAF=0.01 has $\beta = 1$. 
For binary traits, $c= log(7)/4 = 0.486$ with MAX\_OR=7. 
And thus, a causal variant with MAF=0.01 has log OR = 0.972.

Power\_Continuous\_R or Power\_Logistic\_R functions can be used to compute power with 
with non-zero  r.corr ($\rho$).
Since these functions use slightly different method to compute power, 
power estimates from Power\_Continuous\_R and Power\_Logistic\_R can be slightly different from 
estimates from Power\_Continuous and Power\_Logistic even when r.corr=0.
If you want to computer the power of SKAT-O by estimating the optimal r.corr, please use r.corr=2. 
The estimated optimal r.corr is 
$$
r.corr = p_1^2 ( 2p_2-1)^2,
$$
where $p_1$ is the proportion of nonzero $\beta$s, and $p_2$ is the proportion of negative (or positive) $\beta$s 
among the non-zero $\beta$s. 

<<SKAT_P2>>==
set.seed(500)
out.c<-Power_Continuous_R(Haplotype,SNPInfo$CHROM_POS, SubRegion.Length=5000,    
Causal.Percent= 20, N.Sim=10, MaxBeta=2,Negative.Percent=20, r.corr=2)

out.c

Get_RequiredSampleSize(out.c, Power=0.8)

@


\end{document}