[1] 3
Lab 1: From basics to probability distributions
\[ \renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \]
1 Introduction
Links
R is a programming language widely used for statistical computing, (big) data analysis, machine learning, and other areas aimed to retrieve, clean, analyze, visualize, and present data.
It is a standard (often, default) tool for statisticians, actuaries, biologists, and many others. R is free software, you can download it from www.r-project.org and install on your computer/laptop (Windows/Mac OSX/Linux).
RStudio is an Integrated Development Environment (IDE) for R that includes various tools for effective and comfortable work with R.
You can download the free version of RStudio Desktop from posit.co
You can also use a cloud version of RStudio at posit.cloud.
Arithmetic operations
- R can be used as a calculator:
or
[1] 6
- R provide rounded results with certain number of significant figures (that can be increased)
[1] 0.6666667
- We can also use powers:
[1] 8
and BODMAS works
[1] 40
Note that a**b and a^b mean the same.
- We can also check some equalities or inequalities with R:
[1] TRUE
or
[1] FALSE
Remark. Remember that TRUE is just 1 and FALSE is just 0.
2 Vectors
Basic operations with vectors
- One of the basic and widely used data types in R is vectors. To create a vector, one uses function
c(...):
[1] 2 7 4 1 3
- One can, of course, assign the result to a variable. A good style is to use
<-instead of=for the assigning (though the latter will not lead to a mistake in the most of cases). The RStudio shortcut for this isAlt -.
- Operations with vectors are made component-wise. For multiplication on a scalar, it is as expected:
[1] 4 14 8 2 6
- However, multiplication of vectors is also component-wise: if
then
[1] 10 70 80 15 30
- Naturally, then one has that
[1] 4 49 16 1 9
- We can compare vectors component-wise:
[1] TRUE TRUE
and
[1] FALSE TRUE
Moreover, one can compare vector with a number, in this case, we compare each vector’s component with that number:
[1] FALSE TRUE TRUE FALSE FALSE
Patterns
- If coordinates of a vector follow some pattern, one can use special function to define it
[1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[26] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[51] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
[76] 82 83 84 85 86 87 88 89 90 91 92 93 94
- To change the step-size, we use the third argument of function
seq:
[1] 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
[26] 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93
or
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
Strictly speaking, all arguments of the function seq (and any other function in R) have special names:
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
However, if we keep the order of arguments, we may omit their names.
On the other hand, we may change the order if needed, but then we need to specify the names:
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
[16] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8
[31] 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
[46] 10.0
whereas seq(10,1,0.2) would lead to an error (as R understands the latter as seq(from=10, to=1, by=0.2)).
Also this function has other arguments. You may read more about this (and any other function) by typing ?seq in RStudio console.
- If the step
byis equal to \(\pm1\), there is a shorter form:
[1] 1 2 3 4 5 6 7 8 9 10
or
[1] 10 9 8 7 6 5 4 3 2 1
- To get a vector with identical coordinates, we use
repfunction:
[1] 1 1 1 1 1 1 1 1 1 1
We can also repeat a vector:
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
or
[1] 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0
- Finally, if we want to repeat each coordinate of a vector, we use argument
each:
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
or
[1] 2.0 2.0 2.0 2.0 2.5 2.5 2.5 2.5 3.0 3.0 3.0 3.0
Summation
- We may summarize all vector coordinates by using function
sum:
[1] 55
i.e. \(a=(a_1,\ldots,a_{10})=(1,2,\ldots,10)\) and
\[ b = \sum_{i=1}^{10} a_i \]
- The cumulative sum, in contrast, is a vector:
[1] 1 3 6 10 15 21 28 36 45 55
i.e. \(c =(c_1,\ldots,c_{10})\), where
\[ c_i = \sum_{k=1}^i a_k. \]
2.1
Let \(X\) be the random variable that is the result of throwing a (fair) dice, i.e. \(X\) may take either of values \(1,2,3,4,5,6\) with equal probabilities \(\frac16\).
Define vector \[ x=(x_1,x_2,\ldots,x_6)=(1,2,\ldots,6) \] of values of \(X\) and assign it to R-variable
x.Define vector \[ p=(p_1,p_2,\ldots,p_6)=\Bigl(\frac16,\frac16,\ldots,\frac16\Bigr) \] of the probabilities, and assign it to R-variable
p.Recall that the expectation of \(X\) is \[ \E(X)=\sum_{i=1}^6 x_ip_i=x_1p_1+\ldots x_6p_6=x\cdot p. \] Calculate \(\E(X)\) (using the previously considered functions).
Check the answer.
[1] 3.5
2.2
Recall that the variance of a random variable can be found by the following formulas: \[ \begin{aligned}\mathrm{Var}(X) &= \mathbb{E}\Bigl(\bigl(X-\mathbb{E}(X)\bigr)^2\Bigr)\\ &= \sum_{i=1}^6 \bigl(x_i-\mathbb{E}(X)\bigr)^2p_i\end{aligned}. \] Find \(\mathrm{Var}(X)\) (using the previously considered functions).
Check the answer:
[1] 2.916667
2.3
Let \(X=(1,2,\ldots,2024)\) and \(Y=(Y_1,\ldots,Y_{2024})\) with
\[ Y_k = X_1+\ldots+X_k, \qquad 1\leq k\leq 2024. \]
Find how many components of \(Y\) are bigger than \(10^6\). (Hint: remember that True stands for 1).
Check the answer:
[1] 611
Slicing
- If
xis a vector, to access its component with indexione needs to usex[i]:
[1] 6
Remark. Note that the indexation in R starts with 1 (in contrast to e.g. Python).
- We can also get a slice of a vector by specifying the range of indexes, e.g.
[1] 6 8 10
- The indexes do not need to follow some order:
[1] 6 14 2
3 Probability distributions
R language includes functions to deal with various probability distributions.
Some specific distributions require external libraries to be used, we will discuss this in next Labs.
Usually, a distribution has short “nickname”:
Exponential -
expGamma -
gammaNormal -
norm… and we will discuss others in next Labs.
Each distribution has several functions in R with a similar structure of names, e.g.:
Functions
dexp,dgamma,dnormstand for the PDF (density function) \(f_X(x)\) of the random variables \(X\) with the distributionsexp,gamma,norm, respectively.Functions
pexp,pgamma,pnormstand for the corresponding CDF (distribution functions) \(F_X(x)=\P(X\leq x)\).Functions
qexp,qgamma,qnormstand for the quantiles, i.e. the inverse functions to \(F_X\) (see examples below).Functions
rexp,rgamma,rnormare used to generate random variables distributed according to the corresponding distributionsexp,gamma,norm, respectively.
If you type in console
?followed by either of functions above, you will get a detailed help about all functions for the corresponding distribution, try e.g.?dexp.
Consider examples and tasks.
Exponential distribution
- Recall that continuous random variable \(X\) has the exponential distribution with a rate (a parameter) \(\lambda>0\) (we denote this \(X\sim Exp(\lambda)\)) if its cumulative distribution function is
\[ F_X(x) = \P(X\leq x) = 1- e^{-\lambda x}, \qquad x\geq0, \]
and hence, its density is
\[ f_X(x) = \lambda e^{-\lambda x}, \qquad x\geq0. \]
- Recall also that since this is a continuous random variable, \(\P(X=x)=0\) for each \(x\) and
\[ \boxed{\ \P(a\leq x\leq b)= F_X(b)- F_X(a)\ \vphantom{\biggl(}} \]
- For the PDF \(f_X(x)\), we use R-function
dexp(x, lambda)wherelambdais the value of \(\lambda\). The full code isdexp(x, rate = lambda)but as soon aslambdais on the second position, the wordratecan be omitted. The default value ofrateis \(1\). The following code gives the value of the PDF for \(\lambda=3\) at \(x=5\), i.e. the value \(f_X(5) = 3\cdot e^{-3\cdot 5}=3e^{-15}\):
[1] 9.17707e-07
Note that R uses the so-called exponential form for decimals, it stands here for
0.000000917707
One can check that the value is as expected:
[1] TRUE
- Similarly, for the CDF \(F_X(x)=\P(X\leq x)\), we use the code
pexp(x, lambda)(with the same comments aboutrate = lambda). For example, for \(X\sim Exp(3)\), the probability \(\P(X\leq 0.5)\) can be found using the code
[1] 0.7768698
3.1
Let \(X\sim Exp(0.1)\). Find \(\P(1<X\leq3)\).
Check the answer.
[1] 0.1640192
- The survival function \(S_X(x) = \P(X>x)\) can be calculated, of course, from the relation \(S_X(x) = 1- F_X(x)\). There is, however, more numerically exact way by using the argument
lower.tail = FALSE(orlower.tail = F) for functionpexp:
[1] 0.2231302
that coincides with
[1] 0.2231302
at least in the first 5 decimal figures.
(Surely, for the exponential distribution one has an explicit simple formula: \(S_X(x)= e^{-\lambda x}\).)
Gamma distribution
Recall that, for \(X\sim \Gamma(\alpha,\lambda)\),
\[ f_X(x)=\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1} e^{-\lambda x}, \quad x\geq0. \]
The corresponding R functions, e.g.
dgamma,pgammaetc., should hence have at least 3 arguments. \(\alpha\) corresponds toshape(second parameter) and \(\lambda\) corresponds torate(third parameter), i.e. e.g., for \(X\sim\Gamma(0.1,2)\), \(f_X(3)\) can be calculated as follows:
[1] 0.000103893
It may be more convenient (though not necessary) to write explicitly: dgamma(3, shape = 0.1, rate = 2).
By the definition, for any \(p\in(0,1)\), the \(p\)-quantile is the value of \(x\), such that
\[ F_X(x) = \P(X\leq x) = p, \]
i.e. \(x=F_X^{-1}(p)\). The corresponding R-function has prefix
q. For example, for \(X\sim\Gamma(0.1,2)\), to find \(x\) such that \(F_X(x)=0.2\) we can use the following code:
[1] 3.109401e-08
and we can check that, indeed, \(F_X(x)=0.2\) for the found value of \(x\):
[1] 0.2
3.2
Let \(X\sim \Gamma(3, 2)\). Find \(a\) such that \(\P(a\leq X < 5)=0.4\). (Hint: find first \(F_X(a)\) and then find \(a\).) Assign the answer to R variable a.
Check the answer:
a[1] 1.546295
To generate
n(independent) random values distributed according to \(\Gamma(\alpha, \lambda)\), we use the commandrgamma(n, shape = alpha, rate = lambda). The output is a vector (of the lengthn).Since the values are random, if you run the same
rgammacommand again and again, the output will be different all the time:
[1] 0.0005612651 0.0299092083
and another output for the same input:
[1] 0.05120926 0.14020223
To fix the output, we should “set a seed”: write the code set.seed(m) with any natural number instead of m. For each m, there will be a fixed output. If you change m, the output will also changes.
3.3
Set the seed \(123\), i.e. write set.seed(123) before your code. Generate \(10000\) values of \(X\sim\Gamma(2,3)\). Find the mean (average) of these values and assign it to the variable m. You may either use sum function discussed above (and average it) or mean function (use ?mean if experience difficulties).
Check the answer:
m[1] 0.6592689
You may check that this answer will remain the same for each new run, as the randomness is “frozen” now.
- Note that, by LLN (the Low of Large Numbers), the obtained mean (i.e. a sample mean) should be close to \(\E(X)=\frac{\alpha}{\lambda}\). And indeed, the received value is close to \(\frac23\approx0.66\), we may calculate this in R, of course (using
absfunction for an absolute value of a number):
[1] 0.00739773