---
title: "RFlocalfdr A Simulated Data Example \n"
author: "Rob Dunne"
date: "Wednesday, January 29, 2025"
output: 
    rmarkdown::html_vignette:
      toc: true
      toc_depth: 2
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{RFlocalfdr Simulated Data Example}
  %\VignetteEncoding{UTF-8}	 
bibliography: paper.bib
header-includes:
  \usepackage{dcolumn}
  \usepackage{placeins}
  \usepackage{float}
  \usepackage{biblatex}
  \usepackage{caption}
  \usepackage{subcaption}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=6, fig.height=6, dpi=300,echo = FALSE)
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")

```
\clearpage

# Introduction


Variable importance measures are often used to perform variable selection. It is possible to identify two final aims (@genuerVariableSelectionUsing2010):

1. find a small number of variables with a maximized accuracy, or 
2. detect and rank all influential variables to focus on for interpretation and further exploration with domain experts. 

It is the second objective that is more often our aim in biomolecular work. After the important variables are selected, they
can be joined with other data from the literature or databases such as gene annotations. They can also inform further rounds of experiments.
In additions, the reduced data set can be used in other analysis methods such as extraction of structural causal models. 

The problem is characterized by:

1. a relatively small $n$ and a $p$ that is typically two to three orders of magnitude larger than $n$;
2. a lot of correlation between the variables;


# Discussion and Conclusion

We simulate a classification problem with 6350 variables of which 127 are related to the classification and a strong correlation structure. 
We consider a number of variable ranking and selection methods. We give the results for variable selection, not for the performance of the classifier.

We consider two problems:

- ranking the variables by variable importance. We have taken the point on the ROC curve that
gives the minimum cost (with equal cost for positives and negatives). So these are the best results that could be
obtained for these methods. 
- where there is a method of thresholding the importances to do variable selection, we have used it.
- where there is a method of threshold the variable importance to do variable selection, we consider the accuracy of the selected set of variables.

The results are given in the table below.
Considering the second part of the table where we select variables based on a RF impurity measure.
RFlocalfdr has done reasonably well on the FDR. Recursive Feature
Elimination has performed better but only selected a small number of
variables.


I could not get many of the SHAP implementations to converge in a 2 1/2 hour run time
limit. Shaff converged and returned one variable.
This is not surprising. The Shapley values are an "all subsets" calculation and this is NP hard. All of the methods 
available use some heuristic to get around the hardness of the problem.

As this is quite interesting, I tried a number of SHAP implementations on the smaller that set that we used for visualizing the problem.
This goes beyond the original intent of a vignette for the **RFlocalfdr** package but seems interesting to pursue.


\marginpar{ select variables based on a RF impurity measure}

```{r table3, echo=FALSE, message=FALSE, warnings=FALSE, eval=FALSE}
tabl <- "
Method       |  true positives | false positives | Sensitivity | Specificity | FDR | AUC  |
                 rank variables
RF impurity  | 55             | 6               | 0.99         | 0.43       | 0.1  | 0.99 |
RF perm.     | 61             | 3               | 0.99         | 0.48       | 0.05 | 0.74 |
RF corrected | 87             | 37              | 0.99         | 0.68       | 0.3  | 0.99 | 
cforest      | 61             | 3               | 0.99         | 0.84       | 0.12 | 0.99 |
SobolMDA     | 61             | 3               | 0.99         | 0.48       | 0.05 | 0.99 |
Shaff        | 1              | 0               |              |            |      |      |
iml Shapley  |  -             |   -             | -            |  -         | -    |  -   |
fastshap     |  -             |   -             | -            |  -         | -    |  -   |

                    select variables based on a RF impurity measure
RFlocalfdr   | 59             | 36              | 0.99         | 0.46       | 0.34 | 0.73 |
Boruta       | 2              | 2               | 0.99         | 0.016      | 0.5  | 0.51 |
AIR          | 127            | 273             | 0.96         | 1          | 0.68 | 0.98 |
AIR+locfdr   | 123            | 186             | 0.97         | 0.97       | 0.61 | 0.97 |
PIMP         | 39             | 556             | 0.91         | 0.31       | 0.58 | 0.98 |
RFE          | 22             | 2               | 0.99         | 0.17       | 0.08 | 0.58 |
"

kable(tabl, format = "markdown") #disasterous

```

\clearpage
```{R , echo=FALSE, eval=FALSE, result='asis',fig.cap='...'}

cat(tabl) # output the table in a format good for HTML/PDF/docx conversion

```


\clearpage
\begin{verbatim}
    Method     | true positives | false positives | Sensitivity | Specificity | FDR     | AUC   |
                  rank variables
    RF impurity| 55             | 6               | 0.99         | 0.43       | 0.1     | 0.99  |
    RF perm.   | 61             | 3               | 0.99         | 0.48       | 0.045   | 0.74  |
  RF corrected | 87             | 37              | 0.99         | 0.68       | 0.3     | 0.99  | 
    cforest    | 61             | 3               | 0.99         | 0.84       | 0.12    | 0.99  |
    SobolMDA   | 61             | 3               | 0.99         | 0.48       | 0.05    | 0.99  |
    Shaff      | 1              | 0               |              |            |         |       |
   fastshap    |  -             |   -             | -            |  -         | -       |  -    |
   iml Shapley |  -             |   -             | -            |  -         | -       |  -    |

    select variables based on RF
   RFlocalfdr  | 59             | 36              | 0.99         | 0.46       | 0.34    | 0.73  |
     Boruta    | 2              | 2               | 0.99         | 0.016      | 0.5     | 0.51  |
    AIR        | 127            | 273             | 0.96         | 1          | 0.68    | 0.98  |
    AIR+locfdr | 123            | 186             | 0.97         | 0.97       | 0.61    | 0.97  |
    PIMP       | 39             | 556             | 0.91         | 0.31       | 0.58    | 0.98  |
    RFE        | 22             | 2               | 0.99         | 0.17       | 0.08    | 0.58  |


    The small data set (508 variables)	
   fastshap    |  109           |   18            | 0.97         |   0.858    | 0.1     |  -    |
   iml Shapley |  -             |  -              | -            |  -         | -       |  -    |


\end{verbatim}



\clearpage


# The Simulation 

We simulate a small data set with a covariance structure to explore the actions of **RFlocalfdr**.
This package implements an empirical Bayes method of determining a significance level for the Random Forest 
MDI (Mean Decrease in Impurity) variable importance measure. 
See @dunneThresholdingGiniVariable2022 for details.

We 

- simulate the data set
- fit a Random Forest model
- use RFlocalfdr to estimate the significant variables
- for comparison, we also estimate the significant variables using

   - Boruta (@Kursa.and.Rudnicki.2010)
   - Recursive Fearure Elimination
   - AIR (@Nembrini.et.al.2018)
   - PIMP (@Altmann.et.al.2010).


## data setup


This dataset consist of _bands_, with _blocks_ of $\{1, 2, 4, 8, 16, 32, 64\}$ of
identical variables. The variables are $\in \{0,1,2\}$, a common encoding for
genomic data where the numbers represent the number of copies of the minor allele. Only band 1 is used to calculate the
$y$ vector and $y$ is 1 if any of $X[, c(1, 2, 4, 8, 16, 32, 64)]$ is non-zero, so $y$ is unbalanced, containing more
1's than 0's. 
We plot a small data set of $300\times  508$ to explore the data generation method, see figure \@ref(fig:simulation2).

For the problem we generate a data set with 50 bands and 300 observations so $X$ is $300 \times 6350$ with 127 non-null variables.  We fit a
standard RF to this dataset and record the resulting MDI importance score.

```{r eval=FALSE,echo=TRUE}
library(ranger)
library(RFlocalfdr)
library(caret)
library(pROC)

```
```{r ,  eval=FALSE,echo=FALSE}
packageDescription("RFlocalfdr")$GithubSHA1
#source("/home/dun280/Dropbox/R_libraries/RF_localfdr/RFlocalfdr/R/my.pimp.s")

```
```{r eval=FALSE,echo=FALSE}
#just plot a small data set to show the structure
set.seed(13)
num_samples <- 300
num_bands <- 4
band_rank <- 6
num_vars <- num_bands * (2 ** (band_rank+1) -1)
print(num_vars)

X <- matrix(NA, num_samples , num_vars)
set.seed(123)

temp<-matrix(0,508,3)
var_index <- 1
for(band in 1:num_bands) {
     for (rank in 0:band_rank) {
         for (i in 1:2**rank) {
           temp[var_index,]<-c(band , rank, var_index)
             var_index <- var_index +1 
         }
     }
}

#png("./supp_figures/small_simulated_data_set.png")
plot(temp[,1],ylim=c(0,10),type="p") 
lines(temp[,2],type="b",col="red")
legend(0,10,c("band","rank"),pch=1,col=c(1,2))
#dev.off()
abline(v=c( 1,  2 , 4 , 8 ,16 ,32, 64 ))

table(temp[temp[,1]==1,2])
## # 0  1  2  3  4  5  6 
## # 1  2  4  8 16 32 64


X <- matrix(NA, num_samples , num_vars)
set.seed(123)
  
var_index <- 1
for(band in 1:num_bands) {
    for (rank in 0:band_rank) {
#        cat("rank=",rank,"\n")
        var <- sample(0:2, num_samples, replace=TRUE)
        for (i in 1:2**rank) {
            X[,var_index] <- var
            var_index <- var_index +1 
        }
#        print(X[1:2,1:140])
#        system("sleep 3")
    }
}

y <- as.numeric(X[,1] > 1 |  X[,2] > 1  |  X[,4] > 1 |  X[,8] > 1 |  X[,16] > 1 |
                X[,32] > 1 | X[,64] > 1 )



data <- cbind(y,X)
colnames(data) <- c("y",make.names(1:num_vars))
    rfModel <- ranger(data=data,dependent.variable.name="y", importance="impurity",
                      classification=TRUE,  mtry=100,num.trees = 10000, replace=TRUE,
                      seed  = 17 )
zz2 <-log(importance(rfModel))

plot(zz2,type="b",col=temp[,1]+1)

roccurve <- roc(c(rep(1,127),rep(0,508-127)),zz2)
plot(roccurve)
auc(roccurve) # 0.993

```

```{r simulation2, echo=FALSE, fig.cap="A small simulated data set. Each band contains blocks of size {1, 2, 4, 8, 16, 32, 64}, and each block consists of correlated (identical variables).", fig.align="center", out.width = '50%'}
knitr::include_graphics("./supp_figures/small_simulated_data_set.png")
``` 


```{r eval=FALSE,echo=TRUE}
# generate the data
set.seed(13)
num_samples <- 300
num_bands <- 50
band_rank <- 6
num_vars <- num_bands * (2 ** (band_rank+1) -1)
print(num_vars)

X <- matrix(NA, num_samples , num_vars)
set.seed(123)
  
var_index <- 1
for(band in 1:num_bands) {
    for (rank in 0:band_rank) {
#        cat("rank=",rank,"\n")
        var <- sample(0:2, num_samples, replace=TRUE)
        for (i in 1:2**rank) {
            X[,var_index] <- var
            var_index <- var_index +1 
        }
#        print(X[1:2,1:140])
#        system("sleep 3")
    }
}

y <- as.numeric(X[,1] > 1 |  X[,2] > 1  |  X[,4] > 1 |  X[,8] > 1 |  X[,16] > 1 |
                X[,32] > 1 | X[,64] > 1 )


```

## Random forest

### parameter setting for a RF

- **ranger** sets **mtry** to \(\lfloor \sqrt{p} \rfloor \)
- **randomForest** sets **mtry**   \(\lfloor \sqrt{p} \rfloor\) for classification and \(\ \lfloor p/3 \rfloor \) for regression
- if there are only a few relevant variables out of many, which is the case in many genetic datasets, **mtry** should be set
high, so that the algorithm can find the relevant variables (@goldsteinRandomForestsGenetic2011). A large **mtry**
ensures that there is (with high probability) at least one strong variable in the set of **mtry** candidate variables.
- For high dimensional data they observe lower error rates for higher **mtry** values for both classification and
  regression, corroborating (@goldsteinRandomForestsGenetic2011)
-  **min.node.size**  specifies the minimum number of observations in a terminal node. Setting it lower
leads to trees with a larger depth which means that more splits are performed until the terminal nodes. Both **ranger**
and **randomForest** set the default value to 1 for classification and 5 for regression.
- **num.trees** we have set this to a large value. 
- **replace=FALSE** 

We fit a Random Forest (@Breiman.2001) model and recover the Mean Decrease in Impurity variable importance. 

```{r eval=FALSE,echo=TRUE}
data <- cbind(y,X)
colnames(data) <- c("y",make.names(1:num_vars))

rfModel <- ranger(data=data,dependent.variable.name="y", importance="impurity",
             classification=TRUE,num.trees = 10000, replace=FALSE, mtry=100, seed  = 17 )


imp <-log(ranger::importance(rfModel))
t2 <-count_variables(rfModel)
plot(density(imp))

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) # 0.993

palette("default")
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:1016,imp[1:1016],type="p",col=rep(col,8),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
plot(imp,type="p",col=rep(col,8),pch=16,cex=0.8,xlab="variable number",
     ylab="log importances")


predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
cost_perf = performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]
#     X22 
#-3.485894 

predicted_values <-rep(0,6350);predicted_values[ which(imp> -3.485894) ]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6217   72
#               1    6   55
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]] 0.09836066 FDR
sensitivity(conf_matrix) #] 0.9990358 TP/(TP+FN)
specificity(conf_matrix) # 0.4330709          TN/(FP+TN)


```

We have set **num.trees** to a large value and **mtry** to a larger value than the default. This is because of the large number of variables.
\marginpar{refs for setting parameters for large p}


We plot the log importances for the first 8 bands (figure \@ref(fig:simulation2zz2)). The plot for all 50 bands it too compressed.
The blocks are shown in different colors. The bias described by @Strobl.et.al.2007 is clearly visible.
Blocks 1 to 7 of band 1 should be of equal expected influence on $y$, but the importance is decreasing as the number of 
variables in the block is increased. 



```{r simulation2zz2, echo=FALSE, fig.cap="The log importances for the first 8 bands.", fig.align="center", out.width = '50%'}
knitr::include_graphics("./supp_figures/simulation2_zz2.png")

```

## RFlocalfdr 
### determine cutoff
```{r eval=FALSE,echo=TRUE}
cutoffs <- c(0,1,4,10)
#png("./supp_figures/simulated_data_determine_cutoff.png")
res.con<- determine_cutoff(imp,t2,cutoff=cutoffs,plot=c(0,1,4,10))
#dev.off()

``` 
We plot the kernel density estimate of the histogram of the data $y$, and the skew normal fit, $t_1$, using the data up to the
quantile $Q$, shown in red.

```{r , echo=FALSE, fig.cap="For this data set, the selected cutoff value is 0.", fig.align="center", out.width = '50%'}
knitr::include_graphics("./supp_figures/simulated_data_determine_cutoff.png")
```

```{r eval=FALSE,echo=TRUE}
plot(cutoffs,res.con[,3])
cutoffs[which.min(res.con[,3])]

```
By plotting $\max(|y - t_1|)$ versus the cutoff values, we determine the appropriate cutoff. In this case it is 
just $t2>0$


### recover variables
```{r eval=FALSE,echo=TRUE}
temp<-imp[t2 > 0]
palette("R3")

qq <- plotQ(temp,debug.flag = 1)
ppp<-run.it.importances(qq,temp,debug=0)
 aa<-significant.genes(ppp,temp,cutoff=0.05,do.plot=2)
length(aa$probabilities) # 95

tt1 <-as.numeric(gsub("X([0-9]*)","\\1",names(aa$probabilities)))
tt2 <- table(ifelse((tt1 < 127),1,2))
# 1  2 
# 59 36 
# The false discovery rate is 0.3789474
tt2[2]/(tt2[1]+tt2[2])
#59 36   36/(36+59) = 0.3789474

predicted_values<-rep(0, 6350);predicted_values[tt1]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.3789474 FDR 
sensitivity(conf_matrix) #0.994215 TP/(TP+FN)
specificity(conf_matrix) #0.4645669 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.7294

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy # 0.983622

```
The FDR is 0.379. We can also calculate the sensitivity, sensitivity and accruacy.


```{r eval=FALSE,echo=FALSE}
temp <- temp - min(temp) + .Machine$double.eps

palette(topo.colors(n = 7))
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )

plot(1:127,temp[1:127],type="p",col=rep(col,2),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")


plot(1:1016,temp[1:1016],type="p",col=rep(col,8),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:1016,temp[1:1016],col = "gray62",lwd=0.5)

abline(h=3.699622,col="red")
abline(v=127,col="green")
legend("topright",legend=c("RFlocalfdr cutoff","non-null variables"),lty=1,col=c("red","green" ))

```


In  order to make the comparisons with other methods, the following packages may need to be installed.

```{r eval=FALSE,echo=TRUE}
install.packages("Boruta")
install.packages("locfdr")
install.packages("vita") 
install.packages("locfdr")
devtools::install_github("silkeszy/Pomona")

```

```{r eval=FALSE,echo=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("twilight")

```

## Boruta

We try the Boruta algorithm (@Kursa.and.Rudnicki.2010).

```{r eval=FALSE,echo=TRUE}
library(Boruta)
set.seed(120)  
system.time(boruta1 <- Boruta(X,as.factor(y), num.threads = 7,getImp=getImpRfGini,
                              classification=TRUE,num.trees = 10000, replace=FALSE, mtry=100, seed  = 17))
print(boruta1)
#Boruta performed 99 iterations in 19.54727 secs.
#4 attributes confirmed important: X4859, X58, X6132, X7;
# 6346 attributes confirmed unimportant: X1, X10, X100, X1000, X1001 and 6341 more;
plotImpHistory(boruta1)
aa <- which(boruta1$finalDecision=="Confirmed") 
bb <- which(boruta1$finalDecision=="Tentative") 
predicted_values <-rep(0,6350);predicted_values[c(aa,bb)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.3789474 FDR 
sensitivity(conf_matrix) #0.9996786 TP/(TP+FN)
specificity(conf_matrix) #0.01574803 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.5077

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #0.98

```
```{verbatim, eval=FALSE,echo=FALSE}
  D  !D
+ TP FP
- FN TN

- Sensitivity=[TP/(TP+FN)]Ã—100
- Specificity=[TN/(FP+TN)]Ã—100
- Positive predictive value(PPV)= [TP/(TP+FP)]Ã—100
- Negative predictive value(NPV)=[TN/(FN+TN)]Ã—100.
```


## Recursive Feature Elimination

- This is provided by the package Pomona (@Pomona.2022)
- recalculates importance scores in each step.
- does not recognize the **ranger** option  **replace=FALSE**

\marginpar{why is  ntree = 500}
```{r eval=FALSE,echo=TRUE}

library(Pomona)
colnames(X) <- make.names(1:dim(X)[2])
set.seed(111)
res <- var.sel.rfe(X, y, prop.rm = 0.2,  recalculate = TRUE, tol = 10, 
    ntree = 500, mtry.prop = 0.2, nodesize.prop = 0.1, no.threads = 7, 
    method = "ranger", type = "classification", importance = "impurity", 
    case.weights = NULL) 
 res$var
#[1] "X1"    "X106"  "X11"   "X12"   "X127"  "X13"   "X15"   "X16"   "X2"   
#[10] "X23"   "X24"   "X3"    "X4"    "X44"   "X46"   "X4885" "X5"    "X54"  
#[19] "X5474" "X6"    "X7"    "X72"   "X9"    "X91"  
tt <-as.numeric(gsub("X([0-9]*)","\\1", res$var))
table(ifelse((tt < 127),1,2))
# 1  2 
#21  3 0.0833

res<-c(1,106, 11, 12, 127, 13, 15, 16,  2, 23, 24,  3,  4, 44, 46,  4885,  5, 54, 5474,  6,
       7, 72,  9, 91)
predicted_values <-rep(0,6350);predicted_values[c(res)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6221  105
#               1    2   22

conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.083 FDR 
sensitivity(conf_matrix) # 0.9996786 TP/(TP+FN)
specificity(conf_matrix) #0 0.1732283 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.5865
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9831496

```

## AIR

See @Nembrini.et.al.2018. 

- AIR is provided in the package _ranger_, using the option _impurity_corrected_.
- the suggested approach is to apply the **janitza* method and select the variables with $p < 0.05$

```{r eval=FALSE,echo=TRUE}

rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="impurity_corrected",
       classification=TRUE,  mtry=100,num.trees = 10000, replace=TRUE, seed  = 17 )
ww<- importance_pvalues( rfModel2, method = "janitza")

p <- ww[,"pvalue"]
cc <- which(p< 0.05)  
predicted_values <-rep(0,6350);predicted_values[cc]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 5950    0
#               1  273  127
#FDR is 273/(127+273) = 0.6825

sensitivity(conf_matrix) #0.9561305 TP/(TP+FN)
specificity(conf_matrix) #1          TN/(FP+TN)
roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) # 0.9781
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9570079




```
As the null values, as modelled bt the AIR procedure, are symmetric around 0, the
question arises as to whether we can apply the locfdr (@locfdr.2015) procedure.
In this case, it reduces the FDR from 0.682 to 0.601.


```{r eval=FALSE,echo=TRUE}
plot(density(ww[,"importance"]))
imp <- ww[,"importance"]
#imp <-imp/sqrt(var(imp))
#plot(density(imp))
library(locfdr)

aa<-locfdr(imp,df=13)
which( (aa$fdr< 0.05) & (imp>0))
cc2 <-  which( (aa$fdr< 0.05) & (imp>0))
length(cc2) # [1] 309
length(intersect(cc,cc2)) #[1] 309

(length(cc2)  - length(which(cc2 <= 127)))/length(cc2) #[1] 0.6019417  fdr
predicted_values <-rep(0,6350);predicted_values[cc2]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #  0.6019417 FDR 
sensitivity(conf_matrix) # 0.9701109 TP/(TP+FN)
specificity(conf_matrix) #0.9685039         TN/(FP+TN)
roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) # 0.9693
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #   0.9700787

```

## PIMP


note that PIMP

- permutes the response vector $y$ $S$ times to give $y^{*s}$
- For each permutation, a new RF is grown and the permutation
  variable importance measure (VarImp$^{*s}$) for all predictor variables X is computed. 
- The vector â€˜perVarImpâ€™ of $S$ VarImp measures for every predictor variable is  used to approximate the null importance distributions
   for each variable  (see **PimpTest**)
- we are doing $p$ tests, hence a multiple testing correction is in order
- we base our work on the  PIMP implementation provided by  @vita.2015 _Variable Importance Testing Approaches_ 
- pump as described by Altman is only applicable to permutation importance.
- however we see no impediment in extending the method to Gini importance
- we have done that in our code and also 

    - fixed a small error
	- extended the code to use **ranger** as well as **randomForest** implementations of RF 

```{r eval=FALSE,echo=TRUE}
#vita: Variable Importance Testing Approaches
library(vita) 
y<-factor(y)
X<-data.frame(X)
set.seed(117)
cl.ranger <- ranger(y~. , X,mtry = 100,num.trees = 10000, classification=TRUE, replace=FALSE,
                    importance = 'impurity') 
system.time(pimp.varImp.cl<-my_ranger_PIMP(X,y,cl.ranger,S=10, parallel=TRUE, ncores=10)) 
pimp.t.cl <- vita::PimpTest(pimp.varImp.cl,para = FALSE)
aa <- summary(pimp.t.cl,pless = 0.1)

tt<-as.numeric(gsub("X([0-9]*)","\\1",  names(which(aa$cmat2[,"p-value"]< 0.05))))
table(ifelse((tt < 127),1,2))
# 1    2 
# 126 176      176 /( 126+ 176  ) = 0.582

predicted_values <-rep(0,6350);predicted_values[which(aa$cmat2[,"p-value"]< 0.05)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1] 0.5794 FDR
sensitivity(conf_matrix) #0.971 TP/(TP+FN)
specificity(conf_matrix) #1          TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #  0.9859
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9724409

```

```{r eval=FALSE,echo=FALSE}
# twilight uses a stochastic downhill search algorithm for estimating the local false discovery rate
library(twilight)
p.values <- aa$cmat2[,"p-value"]
ans <- twilight(p.values) #Twilight cannot run properly.
fdr <- ans$result$fdr
sum(fdr < 0.05) #[1] 0

```

```{r eval=FALSE,echo=FALSE}
#how to make a neat html table in Rmarkdown
library(tables)
X <- rnorm(125, sd=100)
Group <- factor(sample(letters[1:5], 125, rep=TRUE))
tab <- tabular( Group ~ (N=1)+Format(digits=2)*X*((Mean=mean) + Heading("Std Dev")*sd) )
table_options(knit_print = FALSE)
tab        # This chunk uses the default results = 'markup'

toHTML(tab)  # This chunk uses results = 'asis'
table_options(CSS =
"<style>
#ID .center { 
  text-align:center;
  background-color: aliceblue;
} 
</style>", doCSS = TRUE)
tab

table_options(doCSS = FALSE)

```
\clearpage





# Other Methods

## permutation importance

```{R, echo=TRUE, eval=FALSE}
rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="permutation",
             classification=TRUE,  mtry=100,num.trees = 10000, replace=FALSE, seed  = 17 )
imp <-ranger::importance(rfModel2)
plot(density(imp))

palette(topo.colors(n = 7))
plot(1:1016,imp[1:1016],type="p",col=rep(col,2),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:1016,imp[1:1016],col = "gray62",lwd=0.5)
abline(v=127,col="green")

# could apply Briemans and Cutlers argument that the permutation importance is Gaussian --
# or could use empirical Bayes

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) #  0.9924


library(ROCR)
predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(imp> 0.0001063063  ,c(rep(1,127),rep(0,6350-127)))
#          0    1
#  FALSE 6220   66
#  TRUE     3   61

predicted_values <-rep(0,6350);predicted_values[which(imp> 0.0001063063 )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6220   66
#               1    3   61
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]  0.046875 FDR
sensitivity(conf_matrix) # 0.9995179 TP/(TP+FN)
specificity(conf_matrix) # 0.480315         TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #   0.7399
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9724409




```
## impurity corrected

```{R, echo=TRUE, eval=FALSE}
rfModel3 <- ranger(data=data,dependent.variable.name="y", importance="impurity_corrected",
             classification=TRUE,  mtry=100,num.trees = 10000, replace=FALSE, seed  = 17 )
imp <-ranger::importance(rfModel3)
plot(density(imp))

palette(topo.colors(n = 7))
plot(1:1016,imp[1:1016],type="p",col=col,pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:1016,imp[1:1016],col = "gray62",lwd=0.5)
abline(v=127,col="green")

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) # 0.9951

library(ROCR)
predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(imp> 0.0106588   ,c(rep(1,127),rep(0,6350-127)))
#          0    1
#  FALSE 6186   40
#  TRUE    37   87
#best FDR is
37/(37 +  87) #[1] ]  0.2983871
#
predicted_values <-rep(0,6350);predicted_values[ which(imp> 0.0106588)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]] 0.2983871 FDR
sensitivity(conf_matrix) #0.9940543 TP/(TP+FN)
specificity(conf_matrix) #0.6850394          TN/(FP+TN)



```


## cforest

we consider the cforest forest fitting procedure (@hothorn_unbiased_2006) and the Conditional Permutation Importance (CPI) 
(@Strobl.et.al.2008). It appears that  the implementation of CPI provided by **party::varimp** is excessively slow.
We use the implementation provided by the **permimp** package. 


```{R, echo=TRUE, eval=FALSE}
library(party)
library(permimp) #
data <- data.frame(y,X)

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                           control = party::cforest_unbiased(ntree = 10,  mtry = 100)))
# 12.848   0.000  12.885 
system.time(aa<-party::varimp(mod1.cf, conditional = TRUE))
#    user   system  elapsed 
#3089.301    9.484 3100.554
system.time(aa3<-permimp(mod1.cf, conditional = TRUE))
#   user  system elapsed 
#  4.685   0.000   4.707 

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                            control = party::cforest_unbiased(ntree = 100,  mtry = 100)))
#  27.233   0.464  27.703 
system.time(aa<-party::varimp(mod1.cf, conditional = TRUE))
#    user   system  elapsed 
#29380.980    81.511 29475.138 
save(aa,file="aa.Rdata")
system.time(aa3<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 6.084   0.000   6.089 


system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                       ontrol = party::cforest_unbiased(ntree = 1000,  mtry = 100)))
#35.438   0.636  36.095 
system.time(aa4<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 47.135   0.429  47.578 

palette("default")
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:1016,aa4$values[1:1016],type="p",col=col,pch=16,cex=0.8,
      xlab="variable number",ylab="log importances")



system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                           control = party::cforest_unbiased(ntree = 10000, replace=FALSE, mtry = 100)))
#35.438   0.636  36.095 
system.time(aa4<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 47.135   0.429  47.578


roccurve <- roc(c(rep(1,127),rep(0,6350-127)),aa4$values)
plot(roccurve)
auc(roccurve) #0.9486

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                                    control = party::cforest_unbiased(ntree = 10000,  mtry = 100)))
#Process R killed at Mon Feb 26 07:18:23 2024 on robubuntu
#   user  system elapsed 
# 98.673   2.832 101.977  on petra, seems to have used 26GB of RAM

system.time(aa4<-permimp(mod1.cf, conditional = TRUE))
#  user  system elapsed 
#352.681   2.262 356.232 
head(aa4$values)
#          X1           X2           X3           X4           X5           X6 
#2.600188e-05 9.454988e-05 1.217855e-04 1.682538e-04 1.572708e-04 1.706311e-04 

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),aa4$values)
plot(roccurve)
auc(roccurve) # 0.9978


library(ROCR)
predictions <- aa4$values
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)

cost_perf <- performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(predictions> 3.264138e-05  ,c(rep(1,127),rep(0,6350-127)))
#           0    1
#  FALSE 6220   66
#  TRUE     3   61
predicted_values <-rep(0,6350);predicted_values[ which(predictions>  3.264138e-05   )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]]  0.1229508 FDR
sensitivity(conf_matrix) # 0.9975896 TP/(TP+FN)
specificity(conf_matrix) #  0.8425197         TN/(FP+TN)

#https://stats.stackexchange.com/questions/61521/cut-off-point-in-a-roc-curve-is-there-a-simple-function
library(pROC)
rocobj <- roc(labels,predictions)
plot(rocobj)
coords(rocobj, "best")
coords(rocobj, x="best", input="threshold", best.method="youden") # Sa
auc(rocobj) # 0.9978

predicted_values <-rep(0,6350);predicted_values[ which(predictions>  8.976665e-06   )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

#see also
#https://stats.stackexchange.com/questions/132547/roc-curves-for-unbalanced-datasets
#https://cran.r-project.org/web/packages/qeML/vignettes/Unbalanced_Classes.html
#https://juandelacalle.medium.com/how-and-why-i-switched-from-the-roc-curve-to-the-precision-recall-curve-to-analyze-my-imbalanced-6171da91c6b8
#https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/
            
```
## sobolMDA
@benardMDARandomForests2022

```{R, echo=FALSE, eval=FALSE}
#library(remotes)
#install_gitlab("drti/sobolmda")
library(sobolMDA)
data <- data.frame(y,X)
system.time(rng.sobolmda <- sobolMDA::ranger(dependent.variable.name = "y", classification = TRUE, data = data, replace=FALSE, mtry = 100,
                                             num.trees = 10000, importance = "sobolMDA",   seed = 125))

palette("default")
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:1016,rng.sobolmda$variable.importance[1:1016],type="p",col=col,pch=16,cex=0.8,
      xlab="variable number",ylab="log importances")
plot(rng.sobolmda$variable.importance,type="p",col=col,pch=16,cex=0.8,
      xlab="variable number",ylab="log importances")

imp<- rng.sobolmda$variable.importance

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) #0.9954

library(ROCR)
predictions <- rng.sobolmda$variable.importance
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)

#perf <- performance(pred, "acc")
#plot(perf, avg= "vertical", spread.estimate="boxplot", lwd=3,col='blue',
#     show.spread.at= seq(0.1, 0.9, by=0.1),
#      main= "Accuracy across the range of possible cutoffs")

#plot(0,0,type="n", xlim= c(0,0.001), ylim=c(0,10000),
#      xlab="Cutoff", ylab="Density",
#      main="How well do the predictions separate the classes?")
#
#lines(density(pred@predictions[[1]][pred@labels[[1]]=="1"]), col= "red")
#lines(density(pred@predictions[[1]][pred@labels[[1]]=="0"]), col="green")

cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(rng.sobolmda$variable.importance> 0.001048671 ,c(rep(1,127),rep(0,6350-127)))
#           0    1
#  FALSE 6220   66
#  TRUE     3   61
predicted_values <-rep(0,6350);predicted_values[ which(imp> 0.001048671 )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6220   66
#               1    3   61

conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]] 0.046875 FDR
sensitivity(conf_matrix) # 0.9995179 TP/(TP+FN)
specificity(conf_matrix) # 0.480315          TN/(FP+TN)

#best FDR is
3/(61+3) #[1]  0.046875

```

```{R, echo=FALSE, eval=FALSE}
# example from  Variable Importance in Random Forests: Traditional Methods and New Developments
#https://towardsdatascience.com/variable-importance-in-random-forests-20c6690e44e0

library(kernlab)
library(drf) #An implementation of distributional random forests as introduced in Cevid & Michel & Meinshausen & Buhlmann (2020) 
library(Matrix)
library(DescTools)
library(mice)
library(sobolMDA)
source("./simulated2/compute_drf_vimp.R") 
source("./simulated2/evaluation.R")


##Simulate Data that experiences both a mean as well as sd shift
n <- 200
# Simulate from X
x1 <- runif(n,-1,1)
x2 <- runif(n,-1,1)
X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)
x3 <- x1+ runif(n,-1,1)
X <- cbind(x1,x2, x3, X0)

# Simulate dependent variable Y
Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))
colnames(X)<-paste0("X", 1:10)

head(cbind(Y,X))


Xfull <-X
## Variable importance for conditional Expectation Estimation
XY <- as.data.frame(cbind(Xfull, Y))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA') #was this a classification model? check
sobolMDA <- forest$variable.importance
names(sobolMDA) <- colnames(X)

sort(sobolMDA, decreasing = T)


forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'permutation')
MDA <- forest$variable.importance
names(MDA) <- colnames(X)
sort(MDA, decreasing = T)

MMDVimp <- compute_drf_vimp(X=X,Y=Y)
sort(MMDVimp, decreasing = T)


load("~/Downloads/wage_benchmark.Rdata")
##Define the training data

n<-1000

Xtrain<-X[1:n,] 
Ytrain<-Y[1:n,]
Xtrain<-cbind(Xtrain,Ytrain[,"male"])
colnames(Xtrain)[ncol(Xtrain)]<-"male"
Ytrain<-Ytrain[,1, drop=F]


##Define the test data
ntest<-2000
Xtest<-X[(n+1):(n+ntest),]  
Ytest<-Y[(n+1):(n+ntest),]
Xtest<-cbind(Xtest,Ytest[,"male"])
colnames(Xtest)[ncol(Xtest)]<-"male"
Ytest<-Ytest[,1, drop=F]

# Calculate variable importance for both measures
# 1. Sobol-MDA
XY <- as.data.frame(cbind(Xtrain, Ytrain))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')
SobolMDA <- forest$variable.importance
names(SobolMDA) <- colnames(Xtrain)

# 2. MMD-MDA
MMDVimp <- compute_drf_vimp(X=Xtrain,Y=Ytrain,silent=T)



print("Top 10 most important variables for conditional Expectation estimation")
sort(SobolMDA, decreasing = T)[1:10]
print("Top 5 most important variables for conditional Distribution estimation")
sort(MMDVimp, decreasing = T)[1:10]

# Remove variables one-by-one accoring to the importance values saved in SobolMDA
# and MMDVimp.
evallistSobol<-evalall(SobolMDA, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MSE"), num.trees )
evallistMMD<-evalall(MMDVimp, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MMD"), num.trees )    #slow

plot(1:79,evallistSobol$evalMSE, type="l", lwd=2, cex=0.8, col="darkgreen", main="MSE and (MMD+0.8) loss" , xlab="Number of Variables removed", ylab="Values")
lines(1:79,evallistMMD$evalMMD+0.8, type="l", lwd=2, cex=0.8, col="blue")

#Random Forests in 2023: Modern Extensions of a Powerful Method
#https://towardsdatascience.com/random-forests-in-2023-modern-extensions-of-a-powerful-method-b62debaf1d62

```



# Shap values

SHapley Additive exPlanations, (SHAP values), are used as a measure of variable importance. It
is based on Shapley values, which use game theory to assign credit for a model's prediction to each feature or feature
value.

The Shapley value is defined for any value
function and SHAP is a special case of the Shapley value by taking the function to be the conditional
expectation function of our model.  A SHAP value is the average marginal contribution of a feature value across all possible sets of features.

-  the very fast TreeSHAP is an algorithm to compute SHAP values for tree ensemble models such as decision trees, random forests, and
    gradient boosted trees in a polynomial-time proposed by @lundbergConsistentIndividualizedFeature2019
- the R library *treeshap* only accepts regression models (not classification)
- other options 
   
   - *fastshap* uses Monte-Carlo sampling to approximate SHAP values, 
   -  *shapr* and *kernelshap* provide implementations of KernelSHAP. 
   - shapper A port to Pythonâ€™s â€œshapâ€ package is provided in shapper
   
Global explanations
Aside from explaining individual prediction (i.e., local explanation), it can be useful to aggregate the results of
several (i.e., all of the training predictions) into an overall global summary about the model (i.e., global
explanations). 


The interpretation of the Shapley value for feature value j is: The value of the j-th feature contributed $\phi_j$ to the prediction of this particular instance compared to the average prediction for the dataset.




## SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

@benardSHAFFFastConsistent2022

1.  improve the Monte-Carlo approach by using importance sampling to focus on the most relevant subsets of variables identified by the forest.
2. uses a projected random forest algorithm to compute fast and accurate estimates of the conditional expectations for any variable subset (same algorithm as SobolMDA?)
3. initial forest sets **mtry** to $p/3$, much larger than we are using
Finds one variable.


```{R, echo=FALSE, eval=FALSE}
# devtools::install_gitlab("drti/shaff")
library(shaff)
library("nnls") #needs this but does not load it
data <- cbind(y,X)
names(data)[1] <- "Y"
system.time(rng <- shaff(data = data, mtry = 100, num.trees = 10000, K=500))
which(rng>0)
#[1] 17

```

## iml: Interpretable Machine Learning

- See @Molnar.2021
-  did not finish in 3 hours (20 processors but seems to be only using 1, can we make it parallel?)


```{R, echo=FALSE, eval=FALSE} 
library(iml)
library(ranger)
set.seed(314)
data <- data.frame(y,X)
rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="permutation",
             classification=TRUE,mtry=50,num.trees = 1000, replace=TRUE, seed  = 17 )
X <- data.frame(data[,-which(names(data) == "y")])
mod <- Predictor$new(rfModel2, data = X, type = "response")
shapley <- Shapley$new(mod, x.interest = X[1, ]) 
plot(shapley)

#full data set killed -- using a lot of RAM 
#  dim(X) [1] 300 508 OK


shapley$plot(sort = TRUE) 
shapley$y.hat.interest

phis <- shapley$results$phi
names(phis) <- shapley$results$feature
par(mar=c(5, 10, 4, 2)+0.1) # for wide enough left margin
barplot(sort(phis), horiz=TRUE, las=1)
# ## pick features e.g. by a limit for the absolute value of phis
# values <- shapley$results

N <- 10 # number of features to show
# Capture the ggplot2 object
p <- plot(shapley, sort = TRUE)
# Modify it so it shows only top N features
print(p + scale_x_discrete(limits=rev(p$data$feature.value[order(-p$data$phi)][1:N])))

#  dim(X) [1] 300 508 OK
library(iml)
library(ranger)
set.seed(314)
data <- data.frame(y,X)
rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="permutation",
             classification=TRUE,mtry=50,num.trees = 1000, replace=TRUE, seed  = 17 )
X <- data.frame(data[,-which(names(data) == "y")])
mod <- Predictor$new(rfModel2, data = X, type = "response")
shapley <- Shapley$new(mod, x.interest = X[1, ]) 
plot(shapley)

```

## mlr3

## fastshap

- https://bgreenwell.github.io/fastshap/articles/fastshap.html
- did not finish in 2 1/2 hours with 20 processors 

- tried the samll data set 
- 

```{R, echo=FALSE, eval=FALSE}
#Source: vignettes/fastshap.Rmd

library(fastshap)
library(ranger)
set.seed(2053)  # for reproducibility

pfun <- function(object, newdata) {
  predict(object, data = newdata)$predictions
}

library(doParallel)

# With parallelism
#salloc --account=OD-225217 --mem=100GB --nodes=1 --ntasks-per-node=1  --cpus-per-task=20  -J interactive -t 6:00:00 srun --pty /bin/bash -l

registerDoParallel(cores = 20)  # use forking 
set.seed(5038)
system.time({  # estimate run time
  ex.ames.par <- explain(rfModel2, X = X, pred_wrapper = pfun, nsim = 50, adjust = TRUE, parallel = TRUE)
})


######## a test data set from the fastshap documentation
# Load the sample data; see ?datasets::mtcars for details
data(mtcars)

# Fit a projection pursuit regression model
fit <- ppr(mpg ~ ., data = mtcars, nterms = 5)
# Prediction wrapper
pfun <- function(object, newdata) {  # needs to return a numeric vector
    predict(object, newdata = newdata)  
}

# Compute approximate Shapley values using 10 Monte Carlo simulations
set.seed(101)  # for reproducibility
shap1 <- explain(fit, X = subset(mtcars, select = -mpg), nsim = 10,  pred_wrapper = pfun)
head(shap1)
apply(shap1,2,f<-function(x){mean(abs(x))})

shap <- explain(fit, X = subset(mtcars, select = -mpg), nsim = 10,  pred_wrapper = pfun,adjust=TRUE)


x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime

############################################################################################
######## try the small data set
#salloc --account=OD-225217 --mem=100GB --nodes=1 --ntasks-per-node=1  --cpus-per-task=20  -J interactive -t 6:00:00 srun --pty /bin/bash -l
dim(data) #[1] 300 509
rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="none",
                   classification=TRUE,mtry=50,num.trees = 1000, replace=TRUE, seed  = 17 )

pfun <- function(object, newdata) {  # needs to return a numeric vector
    predict(object, data = newdata)$predictions  
}
system.time(shap <- fastshap::explain(rfModel2, X = subset(data, select = -y), nsim = 10,  pred_wrapper = pfun))
#   user  system elapsed 
#201.161  31.122 153.928 
head(shap)
plot(apply(shap,2,f<-function(x){mean(abs(x))}))



library(doParallel)

registerDoParallel(cores = 20)  # use forking 
set.seed(5038)
system.time({  # estimate run time
  shap <- fastshap::explain(rfModel2, X = subset(data, select = -y), nsim = 1000,  pred_wrapper = pfun, adjust = TRUE, parallel = TRUE)
})
#     user     system    elapsed 
#23285.927  5089.726  3168.814 



predictions <- SHAP.global  <- apply(shap,2,f<-function(x){mean(abs(x))})

palette(topo.colors(n = 7))
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:508,predictions,type="p",col=col,pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:508,predictions,col = "gray62",lwd=0.5)
abline(v=127,col="green")

roccurve <- roc(c(rep(1,127),rep(0, 508-127)),predictions)
plot(roccurve)
auc(roccurve) # 0.9806

library(ROCR)
labels <- c(rep(1,127),rep(0,508-127))
pred <- prediction(predictions, labels)
cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]
#0.0002388418 
table(SHAP.global> 0.0002388418    ,c(rep(1,127),rep(0,508-127)))
#          0   1
#  FALSE 363  11
#  TRUE   18 116

predicted_values <-rep(0,508);predicted_values[ which(predictions> 0.0002388418    )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,508-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]]   0.1343284 FDR
sensitivity(conf_matrix) # 0.9527559 TP/(TP+FN)
specificity(conf_matrix) #  0.9133858         TN/(FP+TN)

```