Pasha Zusmanovich - teachning

Homeworks at 6PAS2 Probability and Statistic 2, summer semester 2025/2026.

Scores

Each homework assumes writing R code, which should be demonstrated in the class.

Homework 1 (up to 10 points)
Take from internet (say, from the Czech Hydrometeorological Institute) data for air pollution and temperature at one of the Ostrava stations for a 24-hours period. Draw histogram of air pollution data in two ways: using built-in R function, and using your own function. Plot air pollution against temperature. Try to draw conclusions.

Homework 2 (up to 10 points)
Take some dataset of considerable volume -- either from internet, or from an R package. This should be a kind of data which has a "bell-shaped" form, for example biometric data (height, weight), data from financial markets, temperature, etc.
1. Try to fit this dataset into the normal distribution.
2. Try other distributions (look, for example, at Wikipedia, or at the list of distributions available in R), try to play with parameters, compare various distributions -- how good they approximate this dataset.

Homework 3 (up to 5 points)
Write R function accepting as an argument a vector (discrete statistical distribution), and returning the mode of this statistical distribution.

Homework 4 (up to 10 points)
Demonstrate in R validity of the Central Limit Theorem when taking means of a big number of independent iterates of the same distribution, for different distributions. Also demonstrate that the condition of distributions to be identical is necessary.

Homework 5 (up to 10 points)
Write R function calculating the standard deviation of a vector ${\bf x} = (x_1, \dots, x_n)$ (discrete statistical distribution) as discussed earlier in the class, i.e. \[ \sigma({\bf x}) = \sqrt{\frac{\big(x_1 - \bar{\bf x}\big)^2 + \dots + \big(x_n - \bar{\bf x}\big)^2}{n}} . \] Compare the result with the built-in R function. Comment on the result.

Homework 6 (up to 10 points)
Write R functions computing skewness and kurtosis of a given vector of numerical data (discrete statistical distribution). Demonstrate how the functions work.

The next two homeworks are, stricty speaking, not about computations in R, so they can be submitted as a clearly (!) written piece of (mathematical) text, either on paper or by email. However, you can involve computer calculations as you see fit.

Homework 7 (up to 10 points)
Suppose some physical measurement always takes value in the interval $[0,2a]$, where $a$ is some positive number. The constant dataset of $2n$ such measurements, each of the same value $a$ has, obviously, mean $a$, standard deviation $0$, and its skewness and kurtosis are not defined (why?). Now let us "deform" this constant dataset by replacing $k$ (where $0 \le k \le n$) values by $a - \varepsilon$, and $k$ values by $a + \varepsilon$ (where $0 \le \varepsilon \le a$). The "deformed" dataset \[ (\underbrace{a, \dots, a}_{2n-2k}, \underbrace{a - \varepsilon, \dots, a - \varepsilon}_k, \underbrace{a + \varepsilon, \dots, a + \varepsilon}_k) \] has, obviously, the same mean $a$, but the standard deviation is no longer zero, and skewness and kurtosis are defined.
1. Assuming $a$ and $n$ are fixed, express standard deviation, skewness and kurtosis as a function of $\varepsilon$ and $k$.
2. What is the maximum and minimum possible values of standard deviation, skewness, and kurtosis? For which values of $\varepsilon$ and $k$ these extreme values are attained?

Homework 8 (up to 10 points)
Compute skewness and kurtosis of the uniform distribution taking (with an equal frequency) $n+1$ values $0,\frac 1n, \frac 2n, \dots, \frac{n-1}n, 1$.

Homework 9 (up to 10 points)
Provide in R a numerical evidence of the statement considered in the class: if $X_1, \dots, X_n$ are identically distributed indepenedent random variables with mean $m$ and standard deviation $\sigma$, then their mean, $\frac{X_1 + \dots + X_n}{n}$, is distributed with mean $m$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.

Homework 10 (up to 10 points)
Take a text of considerable length and count the number of occurencies of each letter of the alphabet. Do as many statitistical analysis as possible of the so obtained distribution of letters.

Homework 11 (up to 10 points)
Count the number of your "friends" on facebook, or in any environment of such sort, and for each friend count the number of his friends. Plot the histogram of so obtained distribution. Where are you located in this distribution -- above or below mean? Invent a notion of weight appropriate in this case, and compute the weighted mean of the number of friends among all your friends. Compare it with the unweighted mean. Comment on the results.

Homework 12 (up to 10 points)
Write R code generating numerical examples of Simpson's paradox discussed at Class 3 (rate of admission to two or more departments among men and women, see, for example, an old synopsis).

Homework 13 (up to 5 points)
Write R function computing the confidence interval for a sample from a normal distribution. (Figure out yourself which arguments this function should accept and what it should return).

Homework 14 (up to 10 points)
Test for normality (using the methods discussed at Class 4, i.e., normal scores and Q-Q plots) some (relatively big) dataset you have encountered before at your homeworks.

Homework 15 (up to 5 points)
Whether each of the following matrices is a correlation matrix? (Substantiate your answer).
\[ \left(\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & 0.1 & 0.5 \\ 0.1 & 1 & -1.1 \\ 0.5 & -1.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 0.9 & 0.1 & 0.5 \\ 0.1 & 1 & 0.1 \\ 0.5 & 0.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & -0.1 & 0.5 \\ 0.1 & 1 & 0.1 \\ 0.5 & 0.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix}\right) \] Homework 16 (up to 20 points)
Explore the convergence pattern of iterated correlation matrices.

Homework 17 (up to 5 points)
Build a linear regression between two datasets of your choice. Test how good the regression is. Comment on the results.

Homework 18 (up to 10 points) (Pruim, Exercise 6.24)
The object returned by lm() includes a vector named effects. (If you call the result model, you can access this vector with model$effects). What are the values in this vector?
(Hint: Think geometrically, make a reasonable guess, and then do some calculations to confirm your guess. You may use one of the data sets used earlier, or design your own data set if that helps you figure out what is going on.)

Homework 19 (up to 5 points)
What is the problem with the function icorr demonstrated in the class for computation of iterative correlation matrices? Fix it.
(Hint: try to apply it iteratively to the matrix $ \left(\begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix}\right) $ ).

Homework 20 (up to 5 points)
When discussing generalized additive models in the class, we looked at an example from a book demonstrating superiority/flexibility of generalized additive models over linear models. Devise further example(s) of this sort.

Homework 21 (up to 10 points)
Demonstrate clustering capabilities of R on data where the notion of distance is not so obvious. This can be facebook "friends" from Homework 11, or any other "interesting" data. Try to use different functions (kmeans, hclust, etc.) and to compare them.

Homework 22 (up to 5 points) (Pruim, Exercise 2.51)
A child's game includes a spinner with four colors on it. Each color is one quarter of the circle. You want to test the spinner to see if it is fair, so you decide to spin the spinner 50 times and count the number of blues. You do this and record 8 blues. Carry out a hypothesis test, carefully showing the four steps. Do it "by hand" (using R but not binom.test). Then check your work using binom.test.

Homework 23 (up to 15 points)
Try to extend the "null-alternative hypotheses" paradigm to choice among several (more than two) hypotheses. Can we utilize for this existing R functions?

Created: Wed Feb 11 2026
Last modified: Wed Apr 29 2026 14:32:25 CEST