Introduction to ‘drglm’

Package ‘drglm’ provide users to fit GLMs to big data sets which can be attached into memory. This package uses popular “Divide and Recombine” method to fit GLMs to large data sets. Lets generate a data set which is not that big but serves our purpose.

Generating a Data Set

set.seed(123)
#Number of rows to be generated
n <- 1000000
#creating dataset
dataset <- data.frame( 
Var_1 = round(rnorm(n, mean = 50, sd = 10)), 
Var_2 = round(rnorm(n, mean = 7.5, sd = 2.1)), 
Var_3 = as.factor(sample(c("0", "1"), n, replace = TRUE)), 
Var_4 = as.factor(sample(c("0", "1", "2"), n, replace = TRUE)), 
Var_5 = as.factor(sample(0:15, n, replace = TRUE)), 
Var_6 = round(rnorm(n, mean = 60, sd = 5))
)

This data set contains six variables of which four of them are continuous generated from normal distribution and two of them are catagorial and other one is count variable. Now we shall fit different GLMs with this data set below.

Fitting Multiple Linear Regression Model

Now, we shall fit multiple linear regression model to the data sets assuming Var_1 as response variable and all other variables as independent ones.

nmodel= drglm::drglm(Var_1 ~ Var_2+ Var_3+ Var_4+ Var_5+ Var_6,  
                     data=dataset, family="gaussian", 
                     fitfunction="speedglm", k=10)
#Output
print(nmodel)
##                  Estimate standard.error      Z_value  Pr...z..
## (Intercept) 49.9938921629    0.132222414 378.10451704 0.0000000
## Var_2       -0.0045648136    0.004721587  -0.96679652 0.3336458
## Var_31       0.0140777358    0.020007935   0.70360764 0.4816772
## Var_41      -0.0070996373    0.024495862  -0.28983006 0.7719462
## Var_42       0.0031706649    0.024509469   0.12936490 0.8970689
## Var_51      -0.0572865412    0.056620740  -1.01175896 0.3116533
## Var_52      -0.0110496857    0.056615948  -0.19516914 0.8452605
## Var_53      -0.0448607620    0.056694044  -0.79127821 0.4287817
## Var_54      -0.0268086198    0.056646008  -0.47326582 0.6360235
## Var_55       0.0466234380    0.056633526   0.82324801 0.4103670
## Var_56      -0.0270480470    0.056580123  -0.47804857 0.6326156
## Var_57       0.0433609651    0.056668648   0.76516675 0.4441723
## Var_58      -0.0297390739    0.056712763  -0.52438062 0.6000138
## Var_59       0.0432931453    0.056669237   0.76396203 0.4448899
## Var_510      0.0672852618    0.056660012   1.18752643 0.2350200
## Var_511     -0.0583903308    0.056635285  -1.03098856 0.3025462
## Var_512      0.0091184125    0.056623543   0.16103571 0.8720653
## Var_513      0.0049039721    0.056698929   0.08649144 0.9310758
## Var_514     -0.0151239426    0.056584592  -0.26728023 0.7892534
## Var_515      0.0548865463    0.056635643   0.96911668 0.3324870
## Var_6        0.0004866552    0.001996329   0.24377510 0.8074050
##                     normal.CI
## (Intercept) [ 49.73 , 50.25 ]
## Var_2           [ -0.01 , 0 ]
## Var_31       [ -0.03 , 0.05 ]
## Var_41       [ -0.06 , 0.04 ]
## Var_42       [ -0.04 , 0.05 ]
## Var_51       [ -0.17 , 0.05 ]
## Var_52        [ -0.12 , 0.1 ]
## Var_53       [ -0.16 , 0.07 ]
## Var_54       [ -0.14 , 0.08 ]
## Var_55       [ -0.06 , 0.16 ]
## Var_56       [ -0.14 , 0.08 ]
## Var_57       [ -0.07 , 0.15 ]
## Var_58       [ -0.14 , 0.08 ]
## Var_59       [ -0.07 , 0.15 ]
## Var_510      [ -0.04 , 0.18 ]
## Var_511      [ -0.17 , 0.05 ]
## Var_512       [ -0.1 , 0.12 ]
## Var_513      [ -0.11 , 0.12 ]
## Var_514       [ -0.13 , 0.1 ]
## Var_515      [ -0.06 , 0.17 ]
## Var_6               [ 0 , 0 ]

Fitting Binomial Regression (Logistic Regression) Model

Now, we shall fit logistic regression model to the data sets assuming Var_3 as response variable and all other variables as independent ones.

bmodel=drglm::drglm(Var_3~ Var_1+ Var_2+ Var_4+ Var_5+ Var_6, 
                    data=dataset, family="binomial",
                    fitfunction="speedglm", k=10)
#Output

print(bmodel)
##                  Estimate Odds.Ratio standard.error    t.value   Pr...z..
## (Intercept)  0.0498850493  1.0511503   0.0281923787  1.7694516 0.07681854
## Var_1        0.0001406428  1.0001407   0.0001999858  0.7032641 0.48189121
## Var_2       -0.0010289335  0.9989716   0.0009441471 -1.0898021 0.27580035
## Var_41      -0.0009157951  0.9990846   0.0048982015 -0.1869656 0.85168762
## Var_42       0.0008660010  1.0008664   0.0049009500  0.1767006 0.85974354
## Var_51      -0.0090198819  0.9910207   0.0113218905 -0.7966763 0.42563905
## Var_52      -0.0103609021  0.9896926   0.0113209121 -0.9152003 0.36008649
## Var_53      -0.0111773346  0.9888849   0.0113364057 -0.9859681 0.32414876
## Var_54      -0.0051583819  0.9948549   0.0113269975 -0.4554059 0.64881723
## Var_55      -0.0166414412  0.9834963   0.0113247263 -1.4694784 0.14170306
## Var_56      -0.0170752441  0.9830697   0.0113137869 -1.5092422 0.13123691
## Var_57      -0.0115591956  0.9885074   0.0113313552 -1.0201071 0.30767768
## Var_58      -0.0190175646  0.9811621   0.0113399851 -1.6770361 0.09353542
## Var_59      -0.0024879742  0.9975151   0.0113313423 -0.2195657 0.82620940
## Var_510     -0.0039725724  0.9960353   0.0113297226 -0.3506328 0.72586385
## Var_511     -0.0189525009  0.9812260   0.0113250085 -1.6735088 0.09422718
## Var_512     -0.0080661323  0.9919663   0.0113222078 -0.7124169 0.47620665
## Var_513     -0.0167293199  0.9834098   0.0113376220 -1.4755581 0.14006256
## Var_514     -0.0270868122  0.9732767   0.0113146115 -2.3939675 0.01666723
## Var_515     -0.0148850714  0.9852252   0.0113248937 -1.3143674 0.18872258
## Var_6       -0.0006315246  0.9993687   0.0003991918 -1.5820079 0.11364778
##                    normal.CI
## (Intercept) [ -0.01 , 0.11 ]
## Var_1              [ 0 , 0 ]
## Var_2              [ 0 , 0 ]
## Var_41      [ -0.01 , 0.01 ]
## Var_42      [ -0.01 , 0.01 ]
## Var_51      [ -0.03 , 0.01 ]
## Var_52      [ -0.03 , 0.01 ]
## Var_53      [ -0.03 , 0.01 ]
## Var_54      [ -0.03 , 0.02 ]
## Var_55      [ -0.04 , 0.01 ]
## Var_56      [ -0.04 , 0.01 ]
## Var_57      [ -0.03 , 0.01 ]
## Var_58         [ -0.04 , 0 ]
## Var_59      [ -0.02 , 0.02 ]
## Var_510     [ -0.03 , 0.02 ]
## Var_511        [ -0.04 , 0 ]
## Var_512     [ -0.03 , 0.01 ]
## Var_513     [ -0.04 , 0.01 ]
## Var_514        [ -0.05 , 0 ]
## Var_515     [ -0.04 , 0.01 ]
## Var_6              [ 0 , 0 ]

Fitting Poisson Regression Model

Now, we shall fit poisson regression model to the data sets assuming Var_5 as response variable and all other variables as independent ones.

pmodel=drglm::drglm(Var_5~ Var_1+ Var_2+ Var_3+ Var_4+ Var_6, 
                    data=dataset, family="binomial", 
                    fitfunction="speedglm", k=10)

#Output
print(pmodel)
##                  Estimate Odds.Ratio standard.error      t.value    Pr...z..
## (Intercept)  2.544047e+00 12.7310943   0.0562502046 45.227344328 0.000000000
## Var_1       -3.472601e-06  0.9999965   0.0004138377 -0.008391215 0.993304858
## Var_2        3.258381e-03  1.0032637   0.0019538879  1.667639724 0.095387268
## Var_31      -1.273949e-02  0.9873413   0.0082797401 -1.538634126 0.123893642
## Var_41      -3.959107e-03  0.9960487   0.0101398385 -0.390450669 0.696203326
## Var_42      -2.863191e-03  0.9971409   0.0101476530 -0.282153069 0.777826142
## Var_6        2.539528e-03  1.0025428   0.0008261139  3.074064806 0.002111636
##                    normal.CI
## (Intercept)  [ 2.43 , 2.65 ]
## Var_1              [ 0 , 0 ]
## Var_2           [ 0 , 0.01 ]
## Var_31         [ -0.03 , 0 ]
## Var_41      [ -0.02 , 0.02 ]
## Var_42      [ -0.02 , 0.02 ]
## Var_6              [ 0 , 0 ]

Fitting Multinomial Logistic Regression Model

Now, we shall fit multinomial logistic regression model to the data sets assuming Var_4 as response variable and all other variables as independent ones.

mmodel=drglm::drglm(Var_4~ Var_1+ Var_2+ Var_3+ Var_5+ Var_6, 
              data=dataset,family="multinomial",
              fitfunction="multinom", k=10)
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## final  value 109861.228162 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109842.503510
## iter  20 value 109840.273128
## final  value 109838.002508 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109850.296686
## iter  20 value 109846.528490
## final  value 109842.945823 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109847.393856
## iter  20 value 109841.079169
## final  value 109840.175418 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109842.805655
## iter  20 value 109840.979230
## iter  30 value 109838.911934
## final  value 109838.864166 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109841.472994
## iter  20 value 109839.598647
## final  value 109837.733262 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109851.271296
## iter  20 value 109846.660324
## iter  30 value 109839.769091
## iter  40 value 109838.903624
## iter  40 value 109838.903182
## iter  40 value 109838.903178
## final  value 109838.903178 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109840.806578
## iter  20 value 109837.263429
## final  value 109834.528438 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109850.031314
## iter  20 value 109849.169972
## final  value 109846.685488 
## converged
## # weights:  63 (40 variable)
## initial  value 109861.228867 
## iter  10 value 109848.501910
## iter  20 value 109846.077070
## final  value 109845.048526 
## converged
#Output
print(mmodel)
##                Estimate.1    Estimate.2 Odds.Ratio.1 Odds.Ratio.2
## (Intercept)  4.081904e-02  2.071676e-03    1.0416636    1.0020738
## Var_1       -9.984185e-05  1.415146e-05    0.9999002    1.0000142
## Var_2        1.402186e-03  2.012445e-04    1.0014032    1.0002013
## Var_31      -1.835696e-03 -5.230905e-05    0.9981660    0.9999477
## Var_51      -2.570995e-03  6.045345e-03    0.9974323    1.0060637
## Var_52       2.589983e-03  7.659461e-03    1.0025933    1.0076889
## Var_53      -4.951806e-03 -1.604007e-02    0.9950604    0.9840879
## Var_54       1.456459e-03  1.530690e-02    1.0014575    1.0154247
## Var_55      -2.225580e-02 -2.838295e-02    0.9779900    0.9720161
## Var_56      -1.001576e-02 -1.472764e-02    0.9900342    0.9853803
## Var_57       3.229535e-03 -1.157117e-03    1.0032348    0.9988436
## Var_58       2.181392e-05 -1.234939e-03    1.0000218    0.9987658
## Var_59      -1.823170e-02 -1.626911e-02    0.9819335    0.9838625
## Var_510     -1.050656e-02 -1.295762e-02    0.9895484    0.9871260
## Var_511     -1.114918e-02  6.444328e-03    0.9889127    1.0064651
## Var_512     -5.482693e-03  1.265131e-03    0.9945323    1.0012659
## Var_513     -1.979504e-02 -2.113650e-02    0.9803996    0.9790853
## Var_514     -3.300604e-02 -1.611510e-02    0.9675327    0.9840141
## Var_515     -8.855361e-03  3.537469e-03    0.9911837    1.0035437
## Var_6       -6.124825e-04 -1.379973e-05    0.9993877    0.9999862
##             standard.error.1 standard.error.2    Z_value.1   Z_value.2
## (Intercept)     0.0344340641     0.0344561368  1.185426192  0.06012503
## Var_1           0.0002448509     0.0002449696 -0.407765881  0.05776822
## Var_2           0.0011559414     0.0011565234  1.213025485  0.17400812
## Var_31          0.0048983192     0.0049007854 -0.374760305 -0.01067361
## Var_51          0.0138744940     0.0138774417 -0.185303723  0.43562392
## Var_52          0.0138717808     0.0138809960  0.186708754  0.55179480
## Var_53          0.0138678622     0.0139049681 -0.357070594 -1.15354944
## Var_54          0.0138888131     0.0138830238  0.104865617  1.10256256
## Var_55          0.0138490644     0.0138778135 -1.607025597 -2.04520340
## Var_56          0.0138454752     0.0138710823 -0.723395603 -1.06175109
## Var_57          0.0138747222     0.0139001413  0.232763905 -0.08324495
## Var_58          0.0138865421     0.0139071506  0.001570868 -0.08879882
## Var_59          0.0138691698     0.0138837951 -1.314548921 -1.17180582
## Var_510         0.0138664395     0.0138884550 -0.757696897 -0.93297782
## Var_511         0.0138833413     0.0138709237 -0.803061741  0.46459254
## Var_512         0.0138716161     0.0138773671 -0.395245419  0.09116504
## Var_513         0.0138717368     0.0138919857 -1.427005454 -1.52148900
## Var_514         0.0138574025     0.0138463110 -2.381834365 -1.16385533
## Var_515         0.0138783001     0.0138751272 -0.638072450  0.25495039
## Var_6           0.0004887467     0.0004889809 -1.253169669 -0.02822140
##             Pr...z...1 Pr...z...2    Lower.CI.1    Lower.CI.2    Upper.CI.1
## (Intercept) 0.23584898 0.95205606 -0.0266704840 -0.0654611110  0.1083085670
## Var_1       0.68344556 0.95393325 -0.0005797408 -0.0004659801  0.0003800571
## Var_2       0.22512008 0.86185908 -0.0008634171 -0.0020654997  0.0036677897
## Var_31      0.70783874 0.99148386 -0.0114362249 -0.0096576719  0.0077648337
## Var_51      0.85299082 0.66310961 -0.0297645039 -0.0211539403  0.0246225131
## Var_52      0.85188899 0.58108895 -0.0245982079 -0.0195467909  0.0297781737
## Var_53      0.72103896 0.24868494 -0.0321323162 -0.0432933050  0.0222287046
## Var_54      0.91648244 0.27021717 -0.0257651146 -0.0119033243  0.0286780325
## Var_55      0.10804875 0.04083481 -0.0493994683 -0.0555829659  0.0048878664
## Var_56      0.46943687 0.28834870 -0.0371523886 -0.0419144585  0.0171208768
## Var_57      0.81594474 0.93365677 -0.0239644213 -0.0284008929  0.0304234903
## Var_58      0.99874663 0.92924180 -0.0271953084 -0.0284924528  0.0272389363
## Var_59      0.18866155 0.24127502 -0.0454147755 -0.0434808504  0.0089513711
## Var_510     0.44863246 0.35083142 -0.0376842803 -0.0401784920  0.0166711639
## Var_511     0.42193905 0.64222327 -0.0383600292 -0.0207421833  0.0160616687
## Var_512     0.69266178 0.92736145 -0.0326705607 -0.0259340090  0.0217051753
## Var_513     0.15357832 0.12813717 -0.0469831484 -0.0483642950  0.0073930604
## Var_514     0.01722664 0.24448265 -0.0601660472 -0.0432533737 -0.0058460277
## Var_515     0.52342652 0.79876141 -0.0360563293 -0.0236572804  0.0183456074
## Var_6       0.21014397 0.97748557 -0.0015704084 -0.0009721847  0.0003454434
##                Upper.CI.2
## (Intercept)  0.0696044634
## Var_1        0.0004942831
## Var_2        0.0024679886
## Var_31       0.0095530538
## Var_51       0.0332446313
## Var_52       0.0348657137
## Var_53       0.0112131685
## Var_54       0.0425171290
## Var_55      -0.0011829367
## Var_56       0.0124591850
## Var_57       0.0260866597
## Var_58       0.0260225757
## Var_59       0.0109426265
## Var_510      0.0142632510
## Var_511      0.0336308387
## Var_512      0.0284642705
## Var_513      0.0060912882
## Var_514      0.0110231681
## Var_515      0.0307322187
## Var_6        0.0009445853

Note that, function ‘drglm’ is designed for fitting GLMs to data sets which can be fitted into memory. To fit data set that is larger than the memory, function ‘big.drglm’ can be used. Users are requested to check the respective vignette.