Regression Models Using R

Useful Links

Encyclopedia of Math
Wikipedia
Simple Linear Regression
R Functions for Regression Analysis
Practical Regression and Anova using R
F values and degrees of freedom explained (video)
leaps (for covariate subset selection based on AIC)
Multivariate Analysis Tutorial
Another Tutorial Multivariate Regression
Using R Markdown with RStudio
Multiple (Linear) Regression


Linear Regression

Linear regression basic assumptions

  • Variance is constant.
  • There is no big "real" outlier
  • Linear can be observed from exploratory plots
  • All the right terms exist in the model

Linear regression with factors

Stepwise linear regression

AIC

BIC


Logistic Regression

Odds

Log odds


Count Regression


Decision Trees


Smoothing

first download data from [http://daniandphy.wikidot.com/local--files/wiki:regression-models/cd4.data]

download.file("http://daniandphy.wikidot.com/local--files/wiki:regression-models/cd4.data",
              destfile="./cd4.data",method="curl")
cd4Data <- read.table("./cd4.data", 
                      col.names=c("time", "cd4", "age", "packs", "drugs", "sex",
                                  "cesd", "id"))
cd4Data <- cd4Data[order(cd4Data$time),]

Now smoothing

Filter

filtTime <- as.vector(filter(cd4Data$time,filter=rep(1,200))/200)
filtCd4 <- as.vector(filter(cd4Data$cd4,filter=rep(1,200))/200)
plot(cd4Data$time,cd4Data$cd4,pch=19,cex=0.1); lines(filtTime,filtCd4,col="blue",lwd=3)

Local Regression (LOWLESS)

Local Regression

plot(cd4Data$time,cd4Data$cd4,pch=19,cex=0.1,ylim=c(500,1500))
lines(cd4Data$time,loess(cd4 ~ time,data=cd4Data,span=0.1)$fitted,col="blue",lwd=3)
lines(cd4Data$time,loess(cd4 ~ time,data=cd4Data,span=0.25)$fitted,col="red",lwd=3)
lines(cd4Data$time,loess(cd4 ~ time,data=cd4Data,span=0.76)$fitted,col="green",lwd=3)

span indicate how much percent of data have to be used for smoothing

Splline

library(splines)
ns1 <- ns(cd4Data$time,df=3) ##df is degree of freedom,which is the number of functions which applied in covariates
par(mfrow=c(1,3))
plot(cd4Data$time,ns1[,1]); plot(cd4Data$time,ns1[,2]); plot(cd4Data$time,ns1[,3]) ### plot the effect of each individual degree of freedom on regression results

Bootstrapping

Bootstrap is a useful tool for calculating the standard errors BUT it doesn't improve the bias (AKA systematic error)
Good Introduction to Bootstrap

Single variable bootstrapping

library(simpleboot)
data(airquality)
attach(airquality)
set.seed(33833)
quantile_75<-function(x) {return(quantile(x,0.75))}
boot_wind_quantile.75<-one.boot(airquality$Wind,quantile_75,R=1000)
STD_error<-sd(boot_wind_quantile.75$t)

Linear models

Bagging trees

Basic idea:
1. Resample data
2. Recalculate tree
3. (Average/mode) of predictors

Random Forest

Random Forest by Leo Breiman and Adele Cutler
Basic idea:
1. Bootstrap samples
2. At each split, bootstrap variables (This is an extra step compare to bagging procedure)
3. Grow multiple trees and vote

library(randomForest)
forestIris <- randomForest(Species~ Petal.Width + Petal.Length,data=iris,prox=TRUE)
forestIris  ###shows out of bag error+ confusion matrix (misclassification matrix) + ...
getTree(forestIris,k=4) ###looking at 4th tree in the random forest
iris.p <- classCenter(iris[,c(3,4)], iris$Species, forestIris$prox)  ### calculating the center of classes
plot(iris[,3], iris[,4], pch=21, xlab=names(iris)[3], ylab=names(iris)[4],
bg=c("red", "blue", "yellow")[as.numeric(factor(iris$Species))],main="Iris Data with Prototypes") ###plot data
points(iris.p[,1], iris.p[,2], pch=21, cex=2, bg=c("red", "blue", "yelow")) ### plot centers

Combining random forests

Since the random forest can be computationally demanding, this opting can increase the run speed.

forestIris1 <- randomForest(Species~Petal.Width + Petal.Length,data=iris,prox=TRUE,ntree=50) ###calculation on computer 1
forestIris2 <- randomForest(Species~Petal.Width + Petal.Length,data=iris,prox=TRUE,ntree=50) ###calculation on computer 2
forestIris3 <- randomForest(Species~Petal.Width + Petal.Length,data=iris,prox=TRUE,nrtee=50) ###calculation on computer 3
combine(forestIris1,forestIris2,forestIris3) ## combining results

Predicting new values

### Producing new data using Gaussian distribution
newdata <- data.frame(Sepal.Length<- rnorm(1000,mean(iris$Sepal.Length),sd(iris$Sepal.Length)),
 Sepal.Width <- rnorm(1000,mean(iris$Sepal.Width), sd(iris$Sepal.Width)),
 Petal.Width <- rnorm(1000,mean(iris$Petal.Width),sd(iris$Petal.Width)),
 Petal.Length <- rnorm(1000,mean(iris$Petal.Length),sd(iris$Petal.Length)))
pred <- predict(forestIris,newdata)
### plot results
plot(newdata[,4], newdata[,3], pch=21, xlab="Petal.Length",ylab="Petal.Width",
bg=c("red", "blue", "yellow")[as.numeric(pred)],main="newdata Predictions")

Selecting Regression Model

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License