Lab 1: From basics to probability distributions

1 Introduction

Links

R is a programming language widely used for statistical computing, (big) data analysis, machine learning, and other areas aimed to retrieve, clean, analyze, visualize, and present data.
It is a standard (often, default) tool for statisticians, actuaries, biologists, and many others. R is free software, you can download it from www.r-project.org and install on your computer/laptop (Windows/Mac OSX/Linux).
RStudio is an Integrated Development Environment (IDE) for R that includes various tools for effective and comfortable work with R.
You can download the free version of RStudio Desktop from posit.co
You can also use a cloud version of RStudio at posit.cloud.

Arithmetic operations

R can be used as a calculator:

[1] 3

[1] 6

R provide rounded results with certain number of significant figures (that can be increased)

[1] 0.6666667

We can also use powers:

[1] 8

and BODMAS works

[1] 40

Note that a**b and a^b mean the same.

We can also check some equalities or inequalities with R:

[1] TRUE

[1] FALSE

Remark. Remember that TRUE is just 1 and FALSE is just 0.

2 Vectors

Basic operations with vectors

One of the basic and widely used data types in R is vectors. To create a vector, one uses function c(...):

[1] 2 7 4 1 3

One can, of course, assign the result to a variable. A good style is to use <- instead of = for the assigning (though the latter will not lead to a mistake in the most of cases). The RStudio shortcut for this is Alt -.

Operations with vectors are made component-wise. For multiplication on a scalar, it is as expected:

[1]  4 14  8  2  6

However, multiplication of vectors is also component-wise: if

then

[1] 10 70 80 15 30

Naturally, then one has that

[1]  4 49 16  1  9

We can compare vectors component-wise:

[1] TRUE TRUE

and

[1] FALSE  TRUE

Moreover, one can compare vector with a number, in this case, we compare each vector’s component with that number:

[1] FALSE  TRUE  TRUE FALSE FALSE

Patterns

If coordinates of a vector follow some pattern, one can use special function to define it

 [1]  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[26] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[51] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
[76] 82 83 84 85 86 87 88 89 90 91 92 93 94

To change the step-size, we use the third argument of function seq:

 [1]  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
[26] 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93

 [1]  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2  3.4  3.6  3.8
[16]  4.0  4.2  4.4  4.6  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4  6.6  6.8
[31]  7.0  7.2  7.4  7.6  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2  9.4  9.6  9.8
[46] 10.0

Strictly speaking, all arguments of the function seq (and any other function in R) have special names:

 [1]  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2  3.4  3.6  3.8
[16]  4.0  4.2  4.4  4.6  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4  6.6  6.8
[31]  7.0  7.2  7.4  7.6  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2  9.4  9.6  9.8
[46] 10.0

However, if we keep the order of arguments, we may omit their names.

On the other hand, we may change the order if needed, but then we need to specify the names:

 [1]  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2  3.4  3.6  3.8
[16]  4.0  4.2  4.4  4.6  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4  6.6  6.8
[31]  7.0  7.2  7.4  7.6  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2  9.4  9.6  9.8
[46] 10.0

whereas seq(10,1,0.2) would lead to an error (as R understands the latter as seq(from=10, to=1, by=0.2)).

Also this function has other arguments. You may read more about this (and any other function) by typing ?seq in RStudio console.

If the step by is equal to \(\pm1\), there is a shorter form:

 [1]  1  2  3  4  5  6  7  8  9 10

 [1] 10  9  8  7  6  5  4  3  2  1

To get a vector with identical coordinates, we use rep function:

 [1] 1 1 1 1 1 1 1 1 1 1

We can also repeat a vector:

 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

 [1] 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0 2.0 2.5 3.0

Finally, if we want to repeat each coordinate of a vector, we use argument each:

 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

 [1] 2.0 2.0 2.0 2.0 2.5 2.5 2.5 2.5 3.0 3.0 3.0 3.0

Summation

We may summarize all vector coordinates by using function sum:

[1] 55

i.e. \(a=(a_1,\ldots,a_{10})=(1,2,\ldots,10)\) and

\[ b = \sum_{i=1}^{10} a_i \]

The cumulative sum, in contrast, is a vector:

 [1]  1  3  6 10 15 21 28 36 45 55

i.e. \(c =(c_1,\ldots,c_{10})\), where

\[ c_i = \sum_{k=1}^i a_k. \]

2.1

Let \(X\) be the random variable that is the result of throwing a (fair) dice, i.e. \(X\) may take either of values \(1,2,3,4,5,6\) with equal probabilities \(\frac16\).

Define vector \[ x=(x_1,x_2,\ldots,x_6)=(1,2,\ldots,6) \] of values of \(X\) and assign it to R-variable x.
Define vector \[ p=(p_1,p_2,\ldots,p_6)=\Bigl(\frac16,\frac16,\ldots,\frac16\Bigr) \] of the probabilities, and assign it to R-variable p.
Recall that the expectation of \(X\) is \[ \E(X)=\sum_{i=1}^6 x_ip_i=x_1p_1+\ldots x_6p_6=x\cdot p. \] Calculate \(\E(X)\) (using the previously considered functions).

Check the answer.

[1] 3.5

2.2

Recall that the variance of a random variable can be found by the following formulas: \[ \begin{aligned}\mathrm{Var}(X) &= \mathbb{E}\Bigl(\bigl(X-\mathbb{E}(X)\bigr)^2\Bigr)\\ &= \sum_{i=1}^6 \bigl(x_i-\mathbb{E}(X)\bigr)^2p_i\end{aligned}. \] Find \(\mathrm{Var}(X)\) (using the previously considered functions).

Check the answer:

[1] 2.916667

2.3

Let \(X=(1,2,\ldots,2024)\) and \(Y=(Y_1,\ldots,Y_{2024})\) with

\[ Y_k = X_1+\ldots+X_k, \qquad 1\leq k\leq 2024. \]

Find how many components of \(Y\) are bigger than \(10^6\). (Hint: remember that True stands for 1).

Check the answer:

[1] 611

Slicing

If x is a vector, to access its component with index i one needs to use x[i]:

[1] 6

Remark. Note that the indexation in R starts with 1 (in contrast to e.g. Python).

We can also get a slice of a vector by specifying the range of indexes, e.g.

[1]  6  8 10

The indexes do not need to follow some order:

[1]  6 14  2

3 Probability distributions

R language includes functions to deal with various probability distributions.
Some specific distributions require external libraries to be used, we will discuss this in next Labs.
Usually, a distribution has short “nickname”:
- Exponential - exp
- Gamma - gamma
- Normal - norm
- … and we will discuss others in next Labs.
Each distribution has several functions in R with a similar structure of names, e.g.:
- Functions dexp, dgamma, dnorm stand for the PDF (density function) \(f_X(x)\) of the random variables \(X\) with the distributions exp, gamma, norm, respectively.
- Functions pexp, pgamma, pnorm stand for the corresponding CDF (distribution functions) \(F_X(x)=\P(X\leq x)\).
- Functions qexp, qgamma, qnorm stand for the quantiles, i.e. the inverse functions to \(F_X\) (see examples below).
- Functions rexp, rgamma, rnorm are used to generate random variables distributed according to the corresponding distributions exp, gamma, norm, respectively.
If you type in console ? followed by either of functions above, you will get a detailed help about all functions for the corresponding distribution, try e.g. ?dexp.

Consider examples and tasks.

Exponential distribution

Recall that continuous random variable \(X\) has the exponential distribution with a rate (a parameter) \(\lambda>0\) (we denote this \(X\sim Exp(\lambda)\)) if its cumulative distribution function is

\[ F_X(x) = \P(X\leq x) = 1- e^{-\lambda x}, \qquad x\geq0, \]

and hence, its density is

\[ f_X(x) = \lambda e^{-\lambda x}, \qquad x\geq0. \]

Recall also that since this is a continuous random variable, \(\P(X=x)=0\) for each \(x\) and

\[ \boxed{\ \P(a\leq x\leq b)= F_X(b)- F_X(a)\ \vphantom{\biggl(}} \]

For the PDF \(f_X(x)\), we use R-function dexp(x, lambda) where lambda is the value of \(\lambda\). The full code is dexp(x, rate = lambda) but as soon as lambda is on the second position, the word rate can be omitted. The default value of rate is \(1\). The following code gives the value of the PDF for \(\lambda=3\) at \(x=5\), i.e. the value \(f_X(5) = 3\cdot e^{-3\cdot 5}=3e^{-15}\):

[1] 9.17707e-07

Note that R uses the so-called exponential form for decimals, it stands here for

0.000000917707

One can check that the value is as expected:

[1] TRUE

Similarly, for the CDF \(F_X(x)=\P(X\leq x)\), we use the code pexp(x, lambda) (with the same comments about rate = lambda). For example, for \(X\sim Exp(3)\), the probability \(\P(X\leq 0.5)\) can be found using the code

[1] 0.7768698

3.1

Let \(X\sim Exp(0.1)\). Find \(\P(1<X\leq3)\).

Check the answer.

[1] 0.1640192

The survival function \(S_X(x) = \P(X>x)\) can be calculated, of course, from the relation \(S_X(x) = 1- F_X(x)\). There is, however, more numerically exact way by using the argument lower.tail = FALSE (or lower.tail = F) for function pexp:

[1] 0.2231302

that coincides with

[1] 0.2231302

at least in the first 5 decimal figures.

(Surely, for the exponential distribution one has an explicit simple formula: \(S_X(x)= e^{-\lambda x}\).)

Gamma distribution

Recall that, for \(X\sim \Gamma(\alpha,\lambda)\),

\[ f_X(x)=\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1} e^{-\lambda x}, \quad x\geq0. \]
The corresponding R functions, e.g. dgamma, pgamma etc., should hence have at least 3 arguments. \(\alpha\) corresponds to shape (second parameter) and \(\lambda\) corresponds to rate (third parameter), i.e. e.g., for \(X\sim\Gamma(0.1,2)\), \(f_X(3)\) can be calculated as follows:

[1] 0.000103893

It may be more convenient (though not necessary) to write explicitly: dgamma(3, shape = 0.1, rate = 2).

By the definition, for any \(p\in(0,1)\), the \(p\)-quantile is the value of \(x\), such that

\[ F_X(x) = \P(X\leq x) = p, \]

i.e. \(x=F_X^{-1}(p)\). The corresponding R-function has prefix q. For example, for \(X\sim\Gamma(0.1,2)\), to find \(x\) such that \(F_X(x)=0.2\) we can use the following code:

[1] 3.109401e-08

and we can check that, indeed, \(F_X(x)=0.2\) for the found value of \(x\):

[1] 0.2

3.2

Let \(X\sim \Gamma(3, 2)\). Find \(a\) such that \(\P(a\leq X < 5)=0.4\). (Hint: find first \(F_X(a)\) and then find \(a\).) Assign the answer to R variable a.

Check the answer:

[1] 1.546295

To generate n (independent) random values distributed according to \(\Gamma(\alpha, \lambda)\), we use the command rgamma(n, shape = alpha, rate = lambda). The output is a vector (of the length n).
Since the values are random, if you run the same rgamma command again and again, the output will be different all the time:

[1] 0.0005612651 0.0299092083

and another output for the same input:

[1] 0.05120926 0.14020223

To fix the output, we should “set a seed”: write the code set.seed(m) with any natural number instead of m. For each m, there will be a fixed output. If you change m, the output will also changes.

3.3

Set the seed \(123\), i.e. write set.seed(123) before your code. Generate \(10000\) values of \(X\sim\Gamma(2,3)\). Find the mean (average) of these values and assign it to the variable m. You may either use sum function discussed above (and average it) or mean function (use ?mean if experience difficulties).

Check the answer:

[1] 0.6592689

You may check that this answer will remain the same for each new run, as the randomness is “frozen” now.

Note that, by LLN (the Low of Large Numbers), the obtained mean (i.e. a sample mean) should be close to \(\E(X)=\frac{\alpha}{\lambda}\). And indeed, the received value is close to \(\frac23\approx0.66\), we may calculate this in R, of course (using abs function for an absolute value of a number):

[1] 0.00739773