# Lesson 4: Simulation Using R

Simulation is one of the most common applications of programming. In statistics, simulation is almost necessary if the results depend on a model that may not be totally correct or an asymptotic conclusion when small sample properties are unknown. In 2006 September issue of JSAS, there are 25 theory and methods articles, 16 of them used simulation.

In R Functions & datasets, there are many functions to generate random deviates. Most of them start with r. So it is easy to find them. Here are some of them

• rbeta (for the beta random variable)
• rbinom (for the binomial random variable)
• rexp (for the exponential random variable)
• rf (for the F random variable)
• rgamma (for the gamma random variable)
• rgeom (for the geometric random variable)
• rhyper (for the hypergeometric random variable)
• rlnorm (for the lognormal random variable)
• rlogis (for the logistic random variable)
• rmvbin (for the multivariate binary random variable)
• rnbinom (for the negative binomial random variable)
• rmvnorm (for the multivariate normal random variable, nonexisting?)
• rnorm (for the normal random variable)
• rpois (for the Poisson random variable)
• rweibull (for the weibull random variable)
• runif (for the uniform random variable)

One very important step in simulation is set the initial seed for your simulation. It is the function rngseed(x), where x should be a large integer. The set.seed() function works the same way.

Here are the two most important random deviates.

1. The uniform distribution (random numbers): runif(n, min=, max=), the default value for min is 0 and max is 1, i.e., runif(n) will generate n random numbers.
2. The normal distribution (random numbers): rnorm(n,mean=,sd=) with default mean=0 and sd=1.

### Simple simulations

Let us starts with 4 different distributions

```postscript("Four_distr.ps")
par(mfrow=c(2,2))
set.seed(12345)
x<-runif(1000)
hist(x)
y<-rnorm(1000)
hist(y)
z<-rgamma(1000,shape=4,scale=2.5) # f(x) = C x^(4-1)exp(-x/2.5)
hist(z)
w<-rbeta(1000,5,2)  # f(x) = C x^(5-1)*(1-x)^(2-1)
hist(w)
dev.off()
<\PRE>

Four_distr.ps

Here are first order autoregressive time series with different correlation
coefficient. (The model is given in the comments of the program.)

# Different shapes of a autoregressive process
#                                             Saved as autoreg3.s

# Various forms of  (1-0.8B)(1+0.3B)Y(t)=Z(t) #1-#3.
# 1 Y(t)=0.5Y(t-1)+Z(t)
# 2 Y(t)=0.5Y(t-1)+0.24Y(t-1)+Z(t)
# 3 Y(t)=0.8Y(t-1)+Z(t)
# 4 Y(t)=-0.8Y(t-1)+Z(t)

postscript("timeseries.ps")
par(mfrow=c(2,2))          # Four figures in the output

t<-c(1:200);               # Generate t from 1 to 200
set.seed(12345);           # Set seed from random number generatots

par(mfrow=c(2,2))

try.ar1<-function(n,a1,a2)
{
y<-c(1:n); y[1]<-0; y[2]<-0
for(j in c(3:n)) y[j]<-a1*y[j-1]+a2*y[j-2]+rnorm(1)
y}
n<-200
a1<- 0.5
a2<- 0.0
y1<-try.ar1(n,a1,a2)
ts.plot(y1,main="AR(1), 0.5")           # Plot the time series y1

a1<- 0.5
a2<- 0.24
y2<-try.ar1(n,a1,a2)
ts.plot(y2,main="AR(2), 0.5, 0.24")

a1<- 0.9
a2<- 0.0
y3<-try.ar1(n,a1,a2)
ts.plot(y3,main="AR(1), 0.9")

a1<- -0.8
a2<- 0.0
y4<-try.ar1(n,a1,a2)
ts.plot(y4,main="AR(1), -0.8")
dev.off()

timeseries.ps

Suppose we are interested in how robust is the least squares estimate
b=(X'X)^(-1)X'Y when the error terms are not normal. The two other error
terms are uniform (short tail distribution) and exponential (long tail
distribution). Both of them are adjusted so that the errors have mean 0
and variance 1.

postscript("robustness.ps")
par(mfrow=c(2,2))
set.seed(123456)
n=20
x<-runif(n)*5
y1<-c(1:n)
e1<-rnorm(n)  # normal residuals
for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
result1<-lm(y1~x)
summary(result1)
plot(x,y1)
abline(lsfit(x,y1),lty=2)

c=sqrt(12.0)
e2<-(runif(n)-0.5)*c   # uniform residuals
y2<-c(1:n)
for(i in c(1:n)) y2[i]=1+0.5*x[i]+e2[i]
result2<-lm(y2~x)
summary(result2)
plot(x,y2)
abline(lsfit(x,y2),lty=2)

e3<-rexp(n)-1   # exponential residuals
y3<-c(1:n)
for(i in c(1:n)) y3[i]=1+0.5*x[i]+e3[i]
result3<-lm(y3~x)
summary(result3)
plot(x,y3)
abline(lsfit(x,y3),lty=2)
dev.off()

robust1.ps

The outputs from the summary all show significance of the slope and
intercept. The p-values for the slopes are 0.015, 0.0002, and 0.015 for
normal, uniform and exponential errors respectively.
From the range of the residuals, we can see that y3 has much longer tail
than the other two. This also shows the merit of mixture of graphics and
data analysis. The when the seed is changed to set.seed(1234567), we have
the following graphs. The p-values are 0.00004, 0.0001 and 0.0005 for
normal, uniform and exponential errors respectively. It is surprising it
have such a large variation.

robust2.ps

A typical output from summary() is

>summary(result1)

Residuals:
Min       1Q   Median       3Q      Max
-1.18925 -0.55789  0.07722  0.54001  1.80446

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.7557     0.3935   4.461 0.000302 ***
x             0.3194     0.1184   2.699 0.014688 *
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.7897 on 18 degrees of freedom
Multiple R-Squared: 0.2881,	Adjusted R-squared: 0.2485
F-statistic: 7.284 on 1 and 18 DF,  p-value: 0.01469

Repeated simulations

The previous example tells us that we need to do more than one experiment
to check the accuracy of an estimate.
In other words, we need to check the distribution of the estimates.
To do this, we need to know how to get the R lm() output and summary them.
The manual usually provided the information. For the lm example,
see the statement  slope[j]=fit\$coef[2].

robust<-function(ns,n,error,a,c) {
slope<-c(1:ns)
x<-runif(n)*10
y1<-c(1:n)
for (j in c(1:ns))    {
e1<-(error(n)-a)*c  # normalizing residuals
for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
fit<-lm(y1~x)
slope[j]=fit\$coef[2]  }
mean<-mean(slope)
std<-var(slope)^0.5
list(mean,std)
}
set.seed(12345)
test1<-robust(1000,10,rnorm,a=0,c=1)  # Note the error
# function in the function statement
test2<-robust(1000,10,runif,a=0.5,c=3.464)  # 3.464 = sqrt(12)
test3<-robust(1000,10,rexp,a=1,c=1)

The results shows

mean   std
normal     0.515  0.125
uniform    0.490  0.158
exp        0.502  0.110

<\PRE>

Apparently, the regression estimate is not sensitive to the error
distribution.  It would be better if we can check the relation between
the estimated slope and its standard error. Hopefully, the worse the estimate,
the larger the standard error. In other words, the confidence interval coverage is
robust. Unfortunately, there is no
way to recover the standrad error from the R output (see summary(result1).