R

Useful links

R enforces the lexical scoping rules (be careful!!)
FITTING DISTRIBUTIONS with R
GRAPHICAL PARAMETERS
LECTURES
MANUALS and GUIDES
SPECIAL SYMBOLS in R
SUPERIMPOSE-HISTOGRAM
One Page R: A Survival Guide to Data Science with R
USING COLORs in R
Kickstarting R
Data Types in R
Data Types conversion in R
Animation in R
Multiple (Linear) Regression
How to upgrade R in ubuntu?


Control Structures

if, else: testing a condition 
for: execute a loop a fixed number of times 
while: execute a loop while a condition is true 
repeat: execute an infinite loop break: 
break the execution of a loop 
next: skip an iteration of a loop 
return: exit a function

Functions

function_name <- function(<arguments>) { 
## Commands
}

R is lazy, i.e., arguments are evaluated only as needed:

f <- function(x, y) {
x^3
} 
>f(2)
8

Using R in terminal

##in linux/mac terminal
R CMD BATCH <R script>   # running R script in terminal

Installing packages

Install from CRAN directly

> install.packages("mypkg", lib="/my/own/R-packages/")
#loading package
#temporarily inside the R:
> library("mypkg", lib.loc="/my/own/R-packages/")
# or permanently by adding this line to your  ~/.bashrc file:
# export R_LIBS=$R_LIBS:"~/usr_libs/R"

or through the terminal
$ R CMD INSTALL mypkg -l /my/own/R-packages/

or

mkdir $HOME/.Renviron
mkdir ~/R ~/R/library
###open $HOME/.Renviron
###type this in the .Renviron : R_LIBS_USER="~/R/library"
R
> .libPaths()

or

dev_mode(TRUE, path = "anywhere-you-want-to-install")
install.packages("anything-that-you-want-to-install")


SIMULATION

generating random number

Probability distribution functions usually have four functions associated with them. The functions are prefixed with a
d for density
r for random number generation
p for cumulative distribution
q for quantile function
set.seed(3) #setting the seed for pseudo random numbers
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = Φ(q) and qnorm(p) = Φ−1(p).

> x <- rnorm(8, 50, 2)
>x
[1] 52.00365 51.23664 52.12788 46.29200 48.81412 45.10396 52.31309 52.00741
> summary(x)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45.10   48.18   51.62   49.99   52.04   52.31

functions:
  • runif for generating flat distribution.
  • rnorm etc. for normal distribution.
  • rpois etc. for Poisson distribution.
  • rbinom etc. for binary distribution.
  • sample for generating sample from a given vector with/without replacement.


FITTING

Fitting nonlinear function on a histogram

Examle:
fitting Maxwell-Boltzmann distribution on kinetic energy histogram in order to find kBT value.

> library(stats)
> K_ene<- read.table("energy.dat")
> K_ene_hist<-hist(K_ene$V1,breaks=100)
> 
> fit_data=data.frame(K_ene_hist$mids,K_ene_hist$density)
> maxw_dist <- function(x,kbt){ 2/sqrt(pi * kbt^3) * sqrt (x) *exp (-x/kbt)}
> fit <- nls( K_ene_hist.density ~  maxw_dist(K_ene_hist.mids,kbt),start=list(kbt=0.5),data=fit_data)
> summary(fit)

Formula: K_ene_hist.density ~ maxw_dist(K_ene_hist.mids, kbt)

Parameters:
    Estimate Std. Error t value Pr(>|t|)    
kbt 0.580014   0.004514   128.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.02237 on 111 degrees of freedom

Number of iterations to convergence: 3 
Achieved convergence tolerance: 4.243e-06 
>  hist(K_ene$V1,breaks=100,prob=T,xlab="K (kcal/mol)",main = "Histogram of Kinetic Energy")
>  lines(fit_data$K_ene_hist.mids,predict(fit),col="red")
Maxwell_fit.png


PLOTTING

  • graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot and many others.
  • lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot
  • grid: implements a di↵erent graphing system independent of the “base” system; the lattice package builds on top of grid; we seldom call functions from the grid package directly.
  • grDevices: contains all the code implementing the various graphics devices, including X11, PDF, PostScript, PNG, etc.

Base graphics parameters

The par function is used to specify global graphics parameters that affect all plots in an R session. These parameters can often be overridden as arguments to specific plotting functions.

  • pch: the plotting symbol (default is open circle i.e., pch=1)
  • lty: the line type (default is solid line), can be dashed, dotted, etc.
  • lwd: the line width, specified as an integer multiple
  • col: the plotting color, specified as a number, string, or hex code; the colors function gives you a vector of colors by name
  • las: the orientation of the axis labels on the plot
  • bg: the background color
  • mar: the margin size
  • oma: the outer margin size (default is 0 for all sides)
  • mfrow: number of plots per row, column (plots are filled row-wise)
  • mfcol: number of plots per row, column (plots are filled column-wise)
  • cex: change the font size in the graph (e.g., cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
> par("bg") 
[1] "transparent"
>par("lty")
[1] "solid"

Useful graphics devices

  • pdf: useful for line-type graphics, vector format, resizes well, usually portable
  • postscript: older format, also vector format and resizes well, usually portable, can be used to create encapsulated postscript files, Windows systems often don’t have a postscript viewer
  • xfig: good of you use Unix and want to edit a plot by hand
  • png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well
  • jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings
  • bitmap: needed to create bitmap files (png, jpeg) in certain situations (uses Ghostscript), also can be used to create a variety of other bitmapped formats not mentioned
  • bmp: a native Windows bitmapped format

Copying plots to another device

Copying a plot to another device can be useful because some plots require a lot of code and it can be a pain to type all that in again for a different device.

  • dev.new: open a new graphical device
  • dev.copy: copy a plot from one device to another
  • dev.copy2pdf: copy a plot to a Portable Document Format (PDF) file
  • dev.list: show the list of open graphics devices
  • dev.next: switch control to the next graphics device on the device list
  • dev.set: set control to a specific graphics device
  • dev.off: close the current graphics device

NOTE: Copying a plot is not an exact operation!


Lattice plotting functions

  • Lattice functions behave differently from base graphics functions in one critical way.
  • Base graphics functions plot data directly the graphics device
  • Lattice graphics functions return an object of class trellis.
  • The print methods for lattice functions actually do the work of plotting the data on the graphics device.
  • Lattice functions return “plot objects” that can, in principle, be stored (but it’s usually better to just save the code + data).
  • On the command line, trellis objects are auto-printed so that it appears the function is plotting the data.

useful plotting functions in lattice mode:

  • xyplot: this is the main function for creating scatterplots
  • bwplot: box-and-whiskers plots (“boxplots”)
  • histogram: histograms
  • stripplot: like a boxplot but with actual points
  • dotplot: plot dots on “violin strings”
  • splom: scatterplot matrix; like pairs in base graphics system
  • levelplot, contourplot: for plotting “image” data


USEFUL COMMANDS

The “…” Argument

it indicates a variable number of arguments that passed to the other function:
applications:

  • is used when extending another function and you don't want to copy the entire list of original function:
myplot <- function(x, y, type = "l", ...) { 
plot(x, y, type = type, ...)
}
  • is used in Generic function application (more to come later …)
  • is used when the number of passed arguments is not priori known.
> args(paste) 
function (..., sep = " ", collapse = NULL)

***note that argument after "…" in the argument list must be passed explicitly (No partial match), i.e.,
> paste("x", "y", sep = "::") 
[1] "x::y"
> paste("x", "y", se = "::")
[1] "a b ::"

looping functions (apply familiy)

  • lapply: Loop over a list and evaluate the function on each element of the list.
> str(lapply)
function (X, FUN, ...)

it take data( usually a list), a FUNCTION and other arguments of FUNCTION using "…".
It always return a list.

> x <- list(a = 1:8, b = rnorm(6))
> lapply(x,mean)
$a
[1] 4.5

$b
[1] 0.3712558
  • sapply: the same as lapply but it tries to simplify the output as if possible, for example: if the result is a list that each element has the length of 1, then the output will be a vector, e.g.,
> x <- list(a = 1:8, b = rnorm(6))
> sapply(x,mean)
   a              b
4.5 0.3712558

#OR
> x<-list(rnorm(100), runif(100),rpois(100,1))
> sapply(x,quantile)
             [,1]        [,2] [,3]
0%   -1.989606246 0.007519909    0
25%  -0.468642621 0.245061830    0
50%  -0.001321715 0.536464307    1
75%   0.861023510 0.748914418    1
100%  2.684075873 0.990305343    5
  • apply: evaluates a FUNCTION over the margins of an array.
>str(apply)
function (X, MARGIN, FUN, ...)
 x <- matrix(rnorm(50), 5, 10)
> apply(x,2,mean)
 [1] -0.18721244  0.55309260  0.01923375  0.90735669 -0.59145445 -0.41153130
 [7]  0.18867043  0.37575356  0.07717792  0.76396592
>apply(x, 1, sum)
[1] -0.06645109  2.85445136  4.65403831 -0.24152826  1.27475306

Here is some shortcut functions and their "apply" command equivalents, remember shortcuts are much faster:
rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)
> a <- array(rnorm(3 * 4 * 5), c(3, 4, 5))
> apply(a, c(1, 2), mean)
            [,1]       [,2]        [,3]       [,4]
[1,]  0.03696901  0.2252630  0.97482887 -0.6309632
[2,]  0.45403204 -0.1001198 -0.01561331  0.4672563
[3,] -0.12243215 -0.7474585 -0.17272954  0.1722274
> rowMeans(a, dims = 2)
            [,1]       [,2]        [,3]       [,4]
[1,]  0.03696901  0.2252630  0.97482887 -0.6309632
[2,]  0.45403204 -0.1001198 -0.01561331  0.4672563
[3,] -0.12243215 -0.7474585 -0.17272954  0.1722274
  • tapply: applies a FUNCTION over a subset of a vector.
> str(tapply) 
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)

X is a vector INDEX is a factor or a list of factors (or else they are coerced to factors), FUN is a function to be applied, "…" contains other arguments to be passed FUN and simplify tries to simplify the result?

> library(datasets)
> data(mtcars)
>tapply(mtcars$mpg, mtcars$cyl, mean)
       4        6        8 
26.66364 19.74286 15.10000
  • mapply: is a multivariate apply
> str(mapply) function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)

FUN is a function to apply, "…" contains arguments to apply over, MoreArgs is a list of other arguments to FUN and SIMPLIFY indicates whether the result should be simplified.
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4
  • split:split takes a vector or other objects and splits it into groups determined by a factor or list of factors.
> str(split)
function (x, f, drop = FALSE, ...)

x is a vector (or list) or data frame, f is a factor (or coerced to one) or a list of factors, drop indicates whether empty factors levels should be dropped.

> library(datasets)
> s <- split(airquality, airquality$Month)
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000

args

shows the arguments of functions

> args(sd)
function (x, na.rm = FALSE)

debugging tools in R

  • traceback: prints out the function call stack after an error occurs; does nothing if there’s no error
  • debug: flags a function for “debug” mode which allows you to step through execution of a function one line at a time
  • browser: suspends the execution of a function wherever it is called and puts the function in debug mode
  • trace: allows you to insert debugging code into a function a specific places recover: allows you to modify the error behavior so that you can browse the function call stack

get or set working directory

getwd()
setwd(dir)

Histogram/Weighted Histogram

"hist" description

hist(x, breaks = "Sturges",
     freq = NULL, probability = !freq,
     include.lowest = TRUE, right = TRUE,
     density = NULL, angle = 45, col = NULL, border = NULL,
     main = paste("Histogram of" , xname),
     xlim = range(breaks), ylim = NULL,
     xlab = xname, ylab,
     axes = TRUE, plot = TRUE, labels = FALSE,
     nclass = NULL, warn.unused = TRUE, ...)
hist(test$V2)
library(plotrix)
weighted.hist(test$V2,exp(-1/0.6*(test$V3-test$V4))/sum(exp(-1/0.6*(test$V3-test$V4))),density=F)

plotting Density/Weighted Density

plot(density(test_T$V2),col='blue')
plot(density(test$V2,weights=exp(-1/0.6*(test$V3-test$V4))/sum(exp(-1/0.6*(test$V3-test$V4)))),col='red')
###note: weights should be normalized!!

scan

Read data into a vector or list from the console or file.
more information on scan

f_names<-scan(what=" ") #From prompt read a list of string into a f_names
data_list<-scan(file = "data_file", what = double()) #read from data_file into the data_list which double precision

trunc

return the integer part of float number

>trunc(5.7)
[1] 5


HANDY SCRIPTS in R

Converting a String to a Variable Name and Vice-versa

Recently, I had a professor ask me how to take a string and convert it to an R variable name on-the-fly. One possible way is:

x <- 42 eval(parse(text = "x")) [1] 42

Now, suppose we want to go the other way. The trick is just as simple:

x <- 42 deparse(substitute(x)) [1] "x"

converting to matrix

c1<-1:5
c2<-6:10
c_all<-cbind(c1,c2)
c_all_matrix<-as.matrix(c_all)
c_all

        c1   c2
[1,]    1     6
[2,]    2     7
[3,]    3     8
[4,]    4     9
[5,]    5    10

multi-density plot +legend

ene <- read.table("PE_exPE_oPE_oexPE.out")
#names(ene)
postscript(file="PE_exPE_oPE_oexPE_2.eps",horizontal=TRUE,paper="special", width=11, height=8)
#cyl.f <- factor(ene, levels= c(ene$V1,ene$V2,ene$V3,ene$V4),labels = c("4 cylinder", "6 cylinder", "8 cylinder","8 cylinder")) 

plot(density(ene$V1),xlim=range(c(ene$V1,ene$V2,ene$V3,ene$V4)),main="",xlab="ptential energy",col=1)
lines(density(ene$V2),col=2)
lines(density(ene$V3),col=3)
lines(density(ene$V4),col=4)
legend(-1900,0.01,c("H(","expe","ope","oexpe"),col=c(1,2,3,4),lty=c(1,1,1,1),text.col="black")

#plot(pe,expe,ope,oexpe)

 box()

 dev.off()

Reading multiple files in R

#x<-1:num_files

#f_names<-paste(format,x,sep="")# if the file names have any particular format 
line_num_list<- numeric(0)
all_data<-data.frame()
f_names <- scan(file="files_list.out",what="")

for (i in f_names){

x<-read.table(i)
f_line_num<-length(x)
line_num_list<-c(line_num_list,f_line_num)
all_data<-append(all_data,x)
}
min_line_num<-min(line_num_list) #find the lowest number of lines in the input files
all_data<-as.data.frame(sapply(all_data,"[",1:min_line_num))

writing histogram data in a file

y<-hist(x,breaks=100,…)
output<-cbind(y$mids,y$counts) # or output<-cbind(y$mids,y$density)
write.table(output,"datafile",row.names=F,col.names=F)

plotting multiple histogram in one plot

the fourth number in "rgb(1,0,0,1/4)" determines the transparency of color

pdf(file="RMSD_pocket_hist.pdf")
apo<-read.table('apo/run/trajrmsd_pocket_apo.dat',skip=1)
apo_ions<-read.table('apo_ions/run/trajrmsd_pocket_apo_ions.dat',skip=1)
apo_chol<-read.table('apo_cholesterol/run/trajrmsd_pocket_apo_chol.dat',skip=1)

apo_hist<-hist(apo$V2,col=rgb(0,0,1,1/4),xlim=c(0.5,2.7),breaks=20,probability=T,main='',xlab='RMSD')
apo_ions_hist<-hist(apo_ions$V2,add=T,col=rgb(1,0,1,1/4),breaks=20,probability=T)
apo_chol_hist<-hist(apo_chol$V2,add=T,col=rgb(1,0,0,1/4),breaks=20,probability=T)
legend(1.8,3.3,c("apo","apo_ions","apo_cholesterol"),col=c(rgb(0,0,1,1/4),rgb(1,0,1,1/4),rgb(1,0,0,1/4)),lty=c(1,1,1),lwd=c(5,5,5),text.col="black")
box()
dev.off()
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License