Lab 3 - Solutions

1 Discrete probability distributions in Python

Module scipy.stats of the famous library scipy provides various tools to work with probability and statistics in Python. Today, we consider several discrete distributions; to work with them will use the same structure of commands. In particular, any distribution will have the following methods:

rvs - to generate (several) values of a random variable \(X\)
pmf - to calculate probability mass function at any \(k\in\mathbb{R}\)

\[ p_X(k)=\mathbb{P}(X=k) \]

cdf - to calculate the cumulative distribuion function

\[ F_X(x)=\sum_{k\leq x} p_X(k)=\mathbb{P}(X\leq x) \]

recall that

\[ \mathbb{P}(a<X\leq b) = F_X(b)-F_X(a) \]

mean - to calculate the mathematical expectation (mean) \(\mathbb{E}(X)\)
var - to calculate the varaince \(\mathrm{Var}(X)\)
std - to calculate the standard deviation \(\sigma(X)=\sqrt{\mathrm{Var}(X)}\)

2 Binomial distribution

First, one has to import binom class.

from scipy.stats import binom

For \(X\sim Bin(n, p)\), one can use binom.pmf(k, n, p) to calculate,

\[ \mathbb{P}(X=k) = \binom{n}{k}p^k(1-p)^{n-k}, \]

and one can use binom.cdf(k, n, p) to calculate

\[ \mathbb{P}(X\leq k) = \sum_{j\leq k}\mathbb{P}(X=j). \]

Remark. Note that \(X\sim Bin(n, p)\) can take only values \(0, 1, 2, \ldots, n\). In particular, for example, for \(X\sim Bin(5, 0.1)\),

[binom.pmf(6, 5, 0.1), binom.pmf(-1, 5, 0.1), binom.pmf(2.5, 5, 0.1)]

[0.0, 0.0, 0.0]

since neither of 6, -1, 2.5 belongs to \(\{0, 1, 2, 3, 4, 5\}\).

However, cdf is the cumulative probability, e.g.

\[ \mathbb{P}(X\leq 1.5)=\mathbb{P}(X=0)+\mathbb{P}(X=1), \]

and hence, it’s not \(0\), namely,

binom.cdf(1.5, 5, 0.1)

0.9185399999999999

Remark. Note also that, e.g.

\[ \mathbb{P}(X\leq 1)=\mathbb{P}(X\leq 1.5)=\mathbb{P}(X<2) \]

since in all three cases it will be \(X=0\) or \(X=1\).

We know from lectures that

\[ \mathbb{E}(X) = np, \qquad \mathrm{Var}(X) = np(1-p). \]

We may also get these numbers from binom.mean(n, p) and binom.var(n, p). We can also get

\[ \sigma(X)=\sqrt{\mathrm{Var}(X)}=\sqrt{np(1-p)} \]

from binom.std(n, p).

2.1

Let \(X\sim Bin(27, 0.45)\).

Calculate \(\mathbb{P}(X\leq 20)\).

Solution. \(\mathbb{P}(X\leq 20) = F_X(20)\), hence, we calculate binom.cdf at k=20 with parameters n=27 and p=0.45.
Code
binom.cdf(20, 27, 0.45)
```
0.9994575733214396
```
Calculate \(\mathbb{P}(X < 10)\).

Solution. Since \(X\) takes only integer values, \(X<10\) means that \(X\leq 9\). Hence,

\[ \mathbb{P}(X < 10) = \mathbb{P}(X \leq 9) = F_X(9). \]
Code
binom.cdf(9, 27, 0.45)
```
0.15256903022954094
```
Calculate \(\mathbb{P}(7<X\leq 13)\).

Solution. We use the formula

\[ \mathbb{P}(7<X\leq 13) = F_X(13)-F_X(7). \]
Code
binom.cdf(13, 27, 0.45)-binom.cdf(7, 27, 0.45)
```
0.666671672666459
```
Calculate \(\mathbb{P}(7\leq X\leq 13)\).

Solution. To apply the same formula, we need to rewrite the inequality in the form \(6<X\leq 13\).
Code
binom.cdf(13, 27, 0.45)-binom.cdf(6, 27, 0.45)
```
0.6879613472859423
```

2.2

A company manufactures light bulbs, and \(95\)% of them are of good quality, while the rest are defective. If a customer buys \(50\) light bulbs, what is the probability that:

Exactly \(5\) of them are defective?

Solution. Since we are looking for defective bulbs, we consider \(p=0.05\) that is the probabilioty that a bulb is defective (\(5\)% of all bulbs). Hence, we need to find \(p_X(5)\) for \(X\sim Bin(50,0.05)\).
Code
binom.pmf(5, 50, 0.05)
```
0.06584063715436628
```
At most \(5\) of them are defective?

Solution. We need to find (again, for \(X\sim Bin(50,0.05)\))

\[ \mathbb{P}(X\leq 5) = F_X(5). \]
Code
binom.cdf(5, 50, 0.05)
```
0.9622238270102227
```
At least \(5\) of them are defective?

Solution. We need to find

\[ \mathbb{P}(X\geq 5) = 1- \mathbb{P}(X < 5)=1- \mathbb{P}(X \leq {\color{red}4})=1-F_X(4). \]
Code
1-binom.cdf(4, 50, 0.05)
```
0.10361681014414348
```
Find the expected value of the number of defective bulbs and the standard deviation of this quantity.

Solution. Here we just use the corresponding methods to calculate mean and std.
Code
[binom.mean(50, 0.05), binom.std(50, 0.05)]
```
[2.5, 1.541103500742244]
```

3 Geometric distribution

Recall that, for \(X\sim Geom(p)\),

\[ \mathbb{P}(X = k) = (1-p)^{k-1} p, \]

and this the probability that the first “success” happened in the \(k\)-th Bernoulli trial.

To work with the geometric distribution in Python, one has to import geom class.

from scipy.stats import geom

The functions are similar to the binomial case: geom.pmf(k, p), geom.cdf(k, p), geom.mean(p), geom.var(p), geom.std(p).

3.1

Let \(X\sim Geom(0.4)\). Calculate \(\mathbb{P}(7\leq X <10)\).

Solution. Similarly to the previous examples, we rewrite

\[ \mathbb{P}(7\leq X <10)=\mathbb{P}(6 <X \leq 9)=F_X(9)-F_X(6). \]

Code

geom.cdf(9, 0.4)-geom.cdf(6, 0.4)

0.03657830400000006

3.2

A lazy student has to take a quiz, where each question may have as an answer an integer number from \(1\) to \(100\) (different questions may have equal answers). Instead of preparation, the student is going to guess the answers. The quiz contains \(20\) questions. As soon as the student gives a correct answer, the quiz stops, and it is considered as a passed one. What is the probability that the student will pass the quiz?

Hint: the quiz may stop after either of \(1,2,3,\ldots,20\) questions.

Solution. The students needs to guess a number from \(100\) possible numbers, so the proability to answer a question correctly is \(\frac1{100}=0.01\). If \(X\sim Geom(0.01)\), then \(X\) models the number of the question when the first correct answer is made, and the quiz stops. Hence, \(X\) may be either of \(1,2,3,\ldots,20\), i.e. we need to find \(\mathbb{P}(X\leq 20)\).

Code

geom.cdf(20, 0.01)

0.18209306240276918

4 Negative Binomial Distribution

Recall that \(X\sim NB(r, p)\) describes the number \(k\) of failures in a series of trials before the \(r\)-th success occurs for the first time. The total number of trials is then, hence, \(n=r+k\). We know that

\[ \mathbb{P}(X = k) = \binom{k+r-1}{k} p^r (1-p)^{k}. \]

To work with the negative binomial distribution in Python, one has to import nbinom class.

from scipy.stats import nbinom

The previously discussed functions can be used similarly: nbinom.pmf(k, r, p), nbinom.cdf(k, r, p), nbinom.mean(r, p), nbinom.var(r, p), nbinom.std(r, p).

4.1

Let \(X\sim NB(5, 0.3)\). Find \(\mathbb{P}(X>10)\).

Solution. We have, by using the probability of the complement event,

\[ \mathbb{P}(X>10)=1-\mathbb{P}(X\leq 10)=1-F_X(10). \]

Code

1-nbinom.cdf(10, 5, 0.3)

0.5154910592268431

4.2

A lazy student has to take a quiz, where each question may have as an answer an integer number from \(1\) to \(100\) (different questions may have equal answers). Instead of preparation, the student is going to guess the answers. The quiz contains \(20\) questions. The quiz stops as soon as the student answers correctly \(3\) questions. What is the probability to pass the test for this student?

Hint: think on how many wrong answers could be made by the student to still pass the test.

Solution. Let \(X\sim NB(3, 0.01)\) be the number of wrong answers made by the student before he made the \(3\)rd correct answer. Then the total number of answered questions will be \(X+3\). One needs to have \(X+3\leq20\) (as there are only \(20\) questions), i.e. \(X\leq 17\). Hence, we need to find

\[ \mathbb{P}(X\leq 17)=F_X(17). \]

Code

nbinom.cdf(17, 3, 0.01)

0.0010035761681001162

5 Poisson Distribution

Recall that \(X\sim Po(\lambda)\) models the number of independent events occurring in a fixed interval of time or space. It is characterized by \(\lambda>0\) that the average rate of events per interval of the same size.

We know that

\[ P(X = k) = \frac{\lambda^k}{k!}e^{-\lambda}. \]

To work with the Poisson distribution in Python, one has to import poisson class.

from scipy.stats import poisson

and the functions are poisson.pmf(n, lambda), poisson.cdf(n, lambda), poisson.mean(lambda) and so on.

5.1

Let \(X\sim Po(0.4)\). Find

\[ \sum_{n=4}^\infty \mathbb{P}(X=n). \]

Solution. Note that

\[ \begin{aligned} \sum_{n=4}^\infty \mathbb{P}(X=n)&=\mathbb{P}(X\geq 4) =1-\mathbb{P}(X< 4)\\&=1-\mathbb{P}(X\leq 3)=1-F_X(3). \end{aligned} \]

Code

1 - poisson.cdf(3, 0.4)

0.0007762513762070711

5.2

In an insurance company, customers’ claims are raised at an average rate of \(5\) claims per working day. Calculate the probability that

Exactly \(30\) claims will be raised in one working week (Monday – Friday).

Solution. The average rate of \(5\) claims per working day means, on average, \(5\cdot 5=25\) claims per \(5\) working days (that is the working week). Hence we consider now \(X\sim Po(25)\) and we need to find \(p_X(30)\).
Code
poisson.pmf(30, 25)
```
0.04541278513011904
```
At least \(8\) claims will be raised in the next \(2\) working days.

Solution. The average rate per \(2\) working days is \(2\cdot 5=10\), i.e. we deal now with \(X\sim Po(10)\). We need to find

\[ \mathbb{P}(X\geq8)=1-\mathbb{P}(X<8)=1-\mathbb{P}(X\leq7)=1-F_X(7). \]
Code
1 - poisson.cdf(7, 10)
```
0.7797793533983011
```

6 Generated data

We can also generate data, in particular, random data. For example, we can use for this scipy library. Recall that each of the random variable classes in this library has method rvs which can be used to generate (several) random values distributed according to the corresponding probability distribution. The general form of the command is distribution_name.rvs(distribution_parameters, size = number_of_genereted_values), where distribution_name is one of e.g. binom, geom, nbinom, poisson, and instead of number_of_genereted_values you write a number, whereas you keep size = as it is (see below). The distribution_parameters will be all parameters needed for the given distribution, i.e. to generate \(5\) binomial random variables you would use binom.rvs(n, p, size = 5) with certain n and p (and similarly for the negative binomial), but for \(5\) geometrical random variables you would use geom.rvs(p, size = 5) with certain p only since there is only one parameter for the geometrical distribution.

For example, in Task 2.2 above we considered a company which manufactures light bulbs, and \(95\)% of them are of good quality, while the rest are defective. If a customer buys \(50\) light bulbs, the random number \(X\) of defective bulbs there has a binomial distribution, namely, \(X\sim Bin(50,0.05)\), as \(p=0.05\) is the probability of a defective bulb. To generate e.g. \(7\) random values of \(X\), we use the commands

Code

from scipy.stats import binom
binom.rvs(50, 0.05, size = 7)

array([8, 3, 0, 3, 1, 3, 2], dtype=int64)

If you run this command several times, you will probably get each time a different result, as the numbers are random. To fix an output, we should set a random_state:

Code

from scipy.stats import binom # If you kept the previous cell, there is no need in this line
c = binom.rvs(50, 0.05, size = 7, random_state = 123)
c

array([3, 2, 1, 3, 3, 2, 6], dtype=int64)

Now you may run this command again and again, with the same output. If you change 123 by another natural number, you will get another output (which will remain fixed unless you change the value of random_state again).

6.1

Let \(X\sim Po(3)\) be a Poisson random variable with the parameter \(\lambda=3\). Generate \(100\) random values of \(X\) fixing random_state = 111, and assign the resulting Numpy array to a variable f. Calculate the mean and the (population) variance of f. Check your answer. Don’t forget to load numpy first.

Code

import numpy as np
from scipy.stats import poisson
f = poisson.rvs(3, size = 100, random_state = 111)
[np.mean(f), np.var(f)]

[2.9, 2.87]

We know that the theoretical mean (expected value) \(\mathbb{E}(X)\) and variance \(\mathrm{Var}(X)\) for a Poisson random variable are equal to \(\lambda\):

\[ \mathbb{E}(X)=\mathrm{Var}(X)=\lambda. \]

Clearly, \(2.9\neq 3\neq 2.87\). To make the statistics more “matching” the probability, we need to increase the size of the data: let’s generate \(10^6\) random variables (keeping random_state = 111). Check the answers.

Code

f = poisson.rvs(3, size = 10**6, random_state = 111)
[np.mean(f), np.var(f)]

[2.998107, 2.9899274165509997]