1+2[1] 3
\[ \renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \]
Links
R is a programming language widely used for statistical computing, (big) data analysis, machine learning, and other areas aimed to retrieve, clean, analyze, visualize, and present data.
It is a standard (often, default) tool for statisticians, actuaries, biologists, and many others. R is free software, you can download it from www.r-project.org and install on your computer/laptop (Windows/Mac OSX/Linux).
RStudio is an Integrated Development Environment (IDE) for R that includes various tools for effective and comfortable work with R.
You can download the free version of RStudio Desktop from posit.co
You can also use a cloud version of RStudio at posit.cloud.
Arithmetic operations
1+2[1] 3
or
2*3[1] 6
2/3[1] 0.6666667
2**3[1] 8
and BODMAS works
5*2**3[1] 40
Note that a**b and a^b mean the same.
4**2 == 2**4[1] TRUE
or
24 ** 23 > 23 ** 24[1] FALSE
Remark. Remember that TRUE is just 1 and FALSE is just 0.
Basic operations with vectors
c(...):c(2,7,4,1,3)[1] 2 7 4 1 3
<- instead of = for the assigning (though the latter will not lead to a mistake in the most of cases). The RStudio shortcut for this is Alt -.a <- c(2,7,4,1,3)2*a[1] 4 14 8 2 6
b <- c(5,10,20,15,10)then
a*b[1] 10 70 80 15 30
a**2[1] 4 49 16 1 9
c(1,2) < c(2,3)[1] TRUE TRUE
and
c(1,2) < c(1,3)[1] FALSE TRUE
Moreover, one can compare vector with a number, in this case, we compare each vector’s component with that number:
a**2 > 10[1] FALSE TRUE TRUE FALSE FALSE
Patterns
seq(7,94) [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[26] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[51] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
[76] 82 83 84 85 86 87 88 89 90 91 92 93 94
seq:seq(7,94,2) [1] 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
[26] 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93
or
seq(1,10,0.2) [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
Strictly speaking, all arguments of the function seq (and any other function in R) have special names:
seq(from=1, to=10, by=0.2) [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
However, if we keep the order of arguments, we may omit their names.
On the other hand, we may change the order if needed, but then we need to specify the names:
seq(to=10, from=1, by=0.2) [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
whereas seq(10,1,0.2) would lead to an error (as R understands the latter as seq(from=10, to=1, by=0.2)).
Also this function has other arguments. You may read more about this (and any other function) by typing ?seq in RStudio console.
by is equal to \(\pm1\), there is a shorter form:1:10 [1] 1 2 3 4 5 6 7 8 9 10
or
10:1 [1] 10 9 8 7 6 5 4 3 2 1
rep function:rep(1,10) [1] 1 1 1 1 1 1 1 1 1 1
We can also repeat a vector:
rep(1:3,5) [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
or
rep(seq(2,3,0.5), 4) [1] 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0
each:rep(1:3, each=5) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
or
rep(seq(2,3,0.5), each=4) [1] 2.0 2.0 2.0 2.0 2.5 2.5 2.5 2.5 3.0 3.0 3.0 3.0
Summation
sum:a <- seq(1,10)
b <- sum(a)
b[1] 55
i.e. \(a=(a_1,\ldots,a_{10})=(1,2,\ldots,10)\) and
\[ b = \sum_{i=1}^{10} a_i \]
a <- seq(1,10)
c <- cumsum(a)
c [1] 1 3 6 10 15 21 28 36 45 55
i.e. \(c =(c_1,\ldots,c_{10})\), where
\[ c_i = \sum_{k=1}^i a_k. \]
Let \(X\) be the random variable that is the result of throwing a (fair) dice, i.e. \(X\) may take either of values \(1,2,3,4,5,6\) with equal probabilities \(\frac16\).
Define vector \[
x=(x_1,x_2,\ldots,x_6)=(1,2,\ldots,6)
\] of values of \(X\) and assign it to R-variable x.
Define vector \[
p=(p_1,p_2,\ldots,p_6)=\Bigl(\frac16,\frac16,\ldots,\frac16\Bigr)
\] of the probabilities, and assign it to R-variable p.
Recall that the expectation of \(X\) is \[ \E(X)=\sum_{i=1}^6 x_ip_i=x_1p_1+\ldots x_6p_6=x\cdot p. \] Calculate \(\E(X)\) (using the previously considered functions).
Check the answer. (Unfold the code.)
x <- seq(1,6)
p <- rep(1/6, 6)
sum(x*p)[1] 3.5
Recall that the variance of a random variable can be found by the following formulas: \[ \begin{aligned}\mathrm{Var}(X) &= \mathbb{E}\Bigl(\bigl(X-\mathbb{E}(X)\bigr)^2\Bigr)\\ &= \sum_{i=1}^6 \bigl(x_i-\mathbb{E}(X)\bigr)^2p_i\end{aligned}. \] Find \(\mathrm{Var}(X)\) (using the previously considered functions).
Check the answer:
sum(p*(x-sum(x*p))**2)[1] 2.916667
Let \(X=(1,2,\ldots,2024)\) and \(Y=(Y_1,\ldots,Y_{2024})\) with
\[ Y_k = X_1+\ldots+X_k, \qquad 1\leq k\leq 2024. \]
Find how many components of \(Y\) are bigger than \(10^6\). (Hint: remember that True stands for 1).
Check the answer:
x <- 1:2024
y <- cumsum(x)
sum(y>10**6)[1] 611
Slicing
x is a vector, to access its component with index i one needs to use x[i]:a <- seq(2,20,2)
a[3][1] 6
Remark. Note that the indexation in R starts with 1 (in contrast to e.g. Python).
a[3:5][1] 6 8 10
a[c(3,7,1)][1] 6 14 2
R language includes functions to deal with various probability distributions.
Some specific distributions require external libraries to be used, we will discuss this in next Labs.
Usually, a distribution has short “nickname”:
Exponential - exp
Gamma - gamma
Normal - norm
… and we will discuss others in next Labs.
Each distribution has several functions in R with a similar structure of names, e.g.:
Functions dexp, dgamma, dnorm stand for the PDF (density function) \(f_X(x)\) of the random variables \(X\) with the distributions exp, gamma, norm, respectively.
Functions pexp, pgamma, pnorm stand for the corresponding CDF (distribution functions) \(F_X(x)=\P(X\leq x)\).
Functions qexp, qgamma, qnorm stand for the quantiles, i.e. the inverse functions to \(F_X\) (see examples below).
Functions rexp, rgamma, rnorm are used to generate random variables distributed according to the corresponding distributions exp, gamma, norm, respectively.
If you type in console ? followed by either of functions above, you will get a detailed help about all functions for the corresponding distribution, try e.g. ?dexp.
Consider examples and tasks.
Exponential distribution
\[ F_X(x) = \P(X\leq x) = 1- e^{-\lambda x}, \qquad x\geq0, \]
and hence, its density is
\[ f_X(x) = \lambda e^{-\lambda x}, \qquad x\geq0. \]
\[ \boxed{\ \P(a\leq x\leq b)= F_X(b)- F_X(a)\ \vphantom{\biggl(}} \tag{1}\]
For the PDF \(f_X(x)\), we use R-function dexp(x, lambda) where lambda is the value of \(\lambda\). The full code is dexp(x, rate = lambda) but as soon as lambda is on the second position, the word rate can be omitted. The default value of rate is \(1\). The following code gives the value of the PDF for \(\lambda=3\) at \(x=5\), i.e. the value \(f_X(5) = 3\cdot e^{-3\cdot 5}=3e^{-15}\):
dexp(5, 3) # or, the same, dexp(5, rate = 3)[1] 9.17707e-07
Note that R uses the so-called exponential form for decimals, it stands here for
0.000000917707
One can check that the value is as expected:
dexp(5, 3) == 3 * exp(- 3*5)[1] TRUE
pexp(x, lambda) (with the same comments about rate = lambda). For example, for \(X\sim Exp(3)\), the probability \(\P(X\leq 0.5)\) can be found using the codepexp(0.5, 3)[1] 0.7768698
Let \(X\sim Exp(0.1)\). Find \(\P(1<X\leq3)\).
Solution. We know that (see Equation 1 above)
\[ \P(1<X\leq3) = F_X(3) - F_X(1) \]
Therefore, we can use the code
pexp(3, 0.1) - pexp(1, 0.1)[1] 0.1640192
lower.tail = FALSE (or lower.tail = F) for function pexp:pexp(0.5, 3, lower.tail = F) [1] 0.2231302
that coincides with
1 - pexp(0.5, 3)[1] 0.2231302
at least in the first 5 decimal figures.
(Surely, for the exponential distribution one has an explicit simple formula: \(S_X(x)= e^{-\lambda x}\).)
Gamma distribution
\[ f_X(x)=\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1} e^{-\lambda x}, \quad x\geq0. \]
dgamma, pgamma etc., should hence have at least 3 arguments. \(\alpha\) corresponds to shape (second parameter) and \(\lambda\) corresponds to rate (third parameter), i.e. e.g., for \(X\sim\Gamma(0.1,2)\), \(f_X(3)\) can be calculated as follows:dgamma(3, 0.1, 2)[1] 0.000103893
It may be more convenient (though not necessary) to write explicitly: dgamma(3, shape = 0.1, rate = 2).
\[ F_X(x) = \P(X\leq x) = p, \]
i.e. \(x=F_X^{-1}(p)\). The corresponding R-function has prefix q. For example, for \(X\sim\Gamma(0.1,2)\), to find \(x\) such that \(F_X(x)=0.2\) we can use the following code:
x <- qgamma(0.2, shape = 0.1, rate = 2)
x[1] 3.109401e-08
and we can check that, indeed, \(F_X(x)=0.2\) for the found value of \(x\):
pgamma(x, shape = 0.1, rate = 2)[1] 0.2
Let \(X\sim \Gamma(3, 2)\). Find \(a\) such that \(\P(a\leq X < 5)=0.4\). (Hint: find first \(F_X(a)\) and then find \(a\).) Assign the answer to R variable a.
Solution. We have that
\[ F_X(5)-F_X(a) = 0.4, \]
hence,
\[ F_X(a)=F_X(5)-0.4. \]
# F(5) - F(a) = 0.4
# F(a) = F(5) - 0.4
fa <- pgamma(5, shape = 3, rate = 2) - 0.4
a <- qgamma(fa, shape = 3, rate = 2)To generate n (independent) random values distributed according to \(\Gamma(\alpha, \lambda)\), we use the command rgamma(n, shape = alpha, rate = lambda). The output is a vector (of the length n).
Since the values are random, if you run the same rgamma command again and again, the output will be different all the time:
rgamma(2, shape = 0.1, rate = 0.2)[1] 1.619667 1.364637
and another output for the same input:
rgamma(2, shape = 0.1, rate = 0.2)[1] 0.001953884 0.702667760
To fix the output, we should “set a seed”: write the code set.seed(m) with any natural number instead of m. For each m, there will be a fixed output. If you change m, the output will also changes.
Set the seed \(123\), i.e. write set.seed(123) before your code. Generate \(10000\) values of \(X\sim\Gamma(2,3)\). Find the mean (average) of these values and assign it to the variable m. You may either use sum function discussed above (and average it) or mean function (use ?mean if experience difficulties).
set.seed(123)
x <- rgamma(10000, shape = 2, rate = 3)
m <- mean(x)Check the answer:
m[1] 0.6592689
You may check that this answer will remain the same for each new run, as the randomness is “frozen” now.
abs function for an absolute value of a number):abs(m - 2/3)[1] 0.00739773