Pasha Zusmanovich - teachning

Homeworks at VAPMS Selected Applications of Mathematical Statistics, summer semester 2018/2019.

Homework 1 (1 point)
Suppose some physical measurement always takes value in the interval $[0,2a]$, where $a$ is some positive number. The constant dataset of $2n$ such measurements, each of the same value $a$ has, obviously, mean $a$ and standard deviation $0$. Now let us "deform" this constant dataset by replacing $k$ (where $0 \le k \le n$) values by $a - \varepsilon$, and $k$ values by $a + \varepsilon$ (where $0 \le \varepsilon \le a$). The "deformed" dataset \[ (\underbrace{a, \dots, a}_{2n-2k}, \underbrace{a - \varepsilon, \dots, a - \varepsilon}_k, \underbrace{a + \varepsilon, \dots, a + \varepsilon}_k) \] has, obviously, the same mean $a$, but the standard deviation is no longer zero.
1. Assuming $a$ and $n$ are fixed, express this standard deviation as a function of $\varepsilon$ and $k$.
2. What is the maximum possible value of standard deviation? For which values of $\varepsilon$ and $k$ this maximum value is attained?

Homework 2 (up to 2 points)
Write R function calculating the standard deviation of a vector $\bar x$ (discrete statistical distribution) as given in the class, i.e. \[ \sigma(\bar{x}) = \sqrt{\frac{\big(x_1 - m(\bar{x})\big)^2 + \dots + \big(x_n - m(\bar{x})\big)^2}{n}} . \] Compare the result with the built-in R function. Comment on the result.

Homework 3 (1 point)
Write R function accepting as an argument a vector (discrete statistical distribution), and returning the mode of this statistical distribution. Note that, in general, there could be several modes (if they occur in the given distribution with the same maximal frequency), so the function should return, in general, a vector and not just a single value.
For example: for the distribution (1,1,1,2,2,3,4) the return value will be a single value 1, for the distribution (1,1,1,2,2,2,3,4) the return value will be the vector (1,2), and for the distribution (1,2,3,4,5), the return value will be the same vector (1,2,3,4,5) (as each value occurs once and hence is the mode).

Homework 4 (2 points)
Take a text of considerable length and count the number of occurencies of each letter of the alphabet. (You may do this using any method or software you wish. An additional 1 point will be given if you will do the counting in R). For the so obtained distribution of letters, compute mean, median, mode, standard deviation, and plot its barchart.
Along with R code and results, do not forget to submit by email either the text itself, or link to it.

Homework 5
Take some dataset of considerable volume - either from internet, or from an R package. This should be a kind of data which has a "bell-shaped" form, for example biometric data (height, weight), data from financial markets, temperature, etc.
1. Try to fit this dataset into the normal distribution. Worth 1 point.
2. Try other distributions (look, for example, at Wikipedia, or at the list of distributions available in R), try to play with parameters, compare various distributions - how good they approximate this dataset. This may bring you up to 4 additional points.

Homework 6
Write an R code demonstrating validity of the Central Limit Theorem when taking means of a big number of independent iterates of the same distribution, for different distributions -- similarly how it was done in the class for the uniform distribution ("throwing dices"). For each significantly different distribution, you will get 1 point, up to 4 points in total.

Homework 7 (up to 3 points)
Write R functions computing skewness and kurtosis of a given vector of numerical data (discrete statistical distribution). Use them to compute skewness and kurtosis of some datasets you have encountered before, either in the class or at your homeworks. Comment on the results.

Homework 8 (2 points)
Compute kurtosis of the uniform distribution taking (with an equal frequency) $n+1$ values $0,\frac 1n, \frac 2n, \dots, \frac{n-1}n, 1$.
Note that this homework is not about computations in R, so it should be submitted as a clearly (!) written piece of (mathematical) text, either on paper or by email.

Homework 9 (2 points)
Compute the weighted population density of Czech Republic when it is divided to:
1. Bohemia, Moravia, and Czech Silesia.
2. 13 regions (kraje) + Prague.
(All the relevant data -- population and area -- can be found on internet). Compare these two densities with each other and with the unweighted population density of Czech Republic. Comment on the results.

Homework 10 (3 points)
Count the number of your "friends" on facebook, and for each friend count the number of his friends. Plot the histogram of so obtained distribution. Where are you located in this distribution -- above or below mean? Invent a notion of weight appropriate in this case, and compute the weighted mean of the number of friends among all your friends. Compare it with the unweighted mean. Comment on the results.

Homework 11 (1 point)
Give a numerical example of an instance of Simpson's paradox discussed at Class 3 (rate of admission to two departments among men and women).

Homework 12 (up to 3 points)
Provide in R a numerical evidence of the following statement: if $X_1, \dots, X_n$ are normally distributed random variables with mean $m$ and standard deviation $\sigma$, then their mean, $\frac{X_1 + \dots + X_n}{n}$, is normally distributed with mean $m$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.

Homework 13 (1 point)
Write R function computing the confidence interval for a sample from a normal distribution. (Figure out yourself which arguments this function should accept and what it should return).

Homework 14 (up to 2 points)
Test for normality (using one of the methods discussed at Class 4, i.e., normal scores or Q-Q plots) some (relatively big) dataset you have encountered before, either in the class or at your homework.

Homework 15
Whether each of the following matrices can be a correlation matrix? Explain your answer. (0.3 points for each matrix).
\[ \left(\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & 0.1 & 0.5 \\ 0.1 & 1 & -1.1 \\ 0.5 & -1.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 0.9 & 0.1 & 0.5 \\ 0.1 & 1 & 0.1 \\ 0.5 & 0.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & -0.1 & 0.5 \\ 0.1 & 1 & 0.1 \\ 0.5 & 0.1 & 1 \end{matrix}\right) \quad \left(\begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix}\right) \]
Homework 16 (up to 2 points)
Compute correlation between two datasets of your choice. Compare the answer with other methods of statistical comparison of two datasets we used before. Comment on the results.

Homework 17 (up to infinity points)
Try to explore the convergence pattern of iterative correlation matrices.

Homework 18 (up to 2 points)
Build a linear regression between two datasets of your choice. Test how good the regression is. Comment on the results.

Homework 19 (up to 3 points) (Pruim, Exercise 6.24)
The object returned by lm() includes a vector named effects. (If you call the result model, you can access this vector with model$effects). What are the values in this vector?
(Hint: Think geometrically, make a reasonable guess, and then do some calculations to confirm your guess. You may use one of the data sets used earlier, or design your own data set if that helps you figure out what is going on.)

Homework 20 (up to 2 points)
This is a remake of the first question from Homework 15. Whether the identity $3 \times 3$ matrix \[ \left(\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}\right) \] is a correlation matrix? Note that not any symmetric matrix with units on the main diagonal, and all elements between -1 and 1, is a correlation matrix, so in order to answer the question you should either prove that there are 3 vectors $\overline x, \overline y, \overline z$ such that their pairwise correlations are zero (most probably, by explicitly constructing such vectors), or prove that such vectors do not exist.

Homework 21 (up to 4 points)
Try to cluster your facebook friends (Homework 10) using an R function of your choice (kmeans, hclust, etc.)

Homework 22 (up to 2 points) (Pruim, Exercise 2.51)
A child's game includes a spinner with four colors on it. Each color is one quarter of the circle. You want to test the spinner to see if it is fair, so you decide to spin the spinner 50 times and count the number of blues. You do this and record 8 blues. Carry out a hypothesis test, carefully showing the four steps. Do it "by hand" (using R but not binom.test()). Then check your work using binom.test().

Created: Tue Oct 6 2015
Last modified: Sun Jun 9 21:17:14 CEST 2019