Homeworks at
VAPMS Selected Applications of Mathematical Statistics, summer semester 2018/2019.
Homework 1 (1 point)
Suppose some physical measurement always takes value in the interval
\([0,2a]\), where \(a\) is some positive number. The constant dataset of
\(2n\) such measurements, each of the same value \(a\) has, obviously,
mean \(a\) and standard deviation \(0\). Now let us "deform" this constant
dataset by replacing \(k\) (where \(0 \le k \le n\)) values by
\(a - \varepsilon\), and \(k\) values by \(a + \varepsilon\) (where
\(0 \le \varepsilon \le a\)). The "deformed" dataset
\[
(\underbrace{a, \dots, a}_{2n-2k},
\underbrace{a - \varepsilon, \dots, a - \varepsilon}_k,
\underbrace{a + \varepsilon, \dots, a + \varepsilon}_k)
\]
has, obviously, the same mean \(a\), but the standard deviation is no longer
zero.
1. Assuming \(a\) and \(n\) are fixed, express this standard deviation as a
function of \(\varepsilon\) and \(k\).
2. What is the maximum possible value of standard deviation? For which values of
\(\varepsilon\) and \(k\) this maximum value is attained?
Homework 2 (up to 2 points)
Write R function calculating the standard deviation of a vector \(\bar x\)
(discrete statistical distribution) as given in the class, i.e.
\[
\sigma(\bar{x}) =
\sqrt{\frac{\big(x_1 - m(\bar{x})\big)^2 + \dots + \big(x_n - m(\bar{x})\big)^2}{n}} .
\]
Compare the result with the built-in R function. Comment on the result.
Homework 3 (1 point)
Write R function accepting as an argument a vector (discrete statistical
distribution), and returning the mode of this statistical distribution. Note that, in general,
there could be several modes (if they occur in the given distribution with the
same maximal frequency), so the function should return, in general, a vector and
not just a single value.
For example:
for the distribution (1,1,1,2,2,3,4) the return value will be a single value 1,
for the distribution (1,1,1,2,2,2,3,4) the return value will be the vector (1,2),
and for the distribution (1,2,3,4,5), the return value will be the same vector
(1,2,3,4,5) (as each value occurs once and hence is the mode).
Homework 4 (2 points)
Take a text of considerable length and count the number of occurencies of each
letter of the alphabet. (You may do this using any method or software you wish.
An additional 1 point will be given if you will do the counting in R).
For the so obtained distribution of letters, compute mean, median, mode,
standard deviation, and plot its barchart.
Along with R code and results, do not forget to submit by email either the
text itself, or link to it.
Homework 5
Take some dataset of considerable volume - either from internet, or from an R
package. This should be a kind of data which has a "bell-shaped" form, for
example biometric data (height, weight), data from financial markets,
temperature, etc.
1. Try to fit this dataset into the normal distribution. Worth 1 point.
2. Try other distributions (look, for example, at
Wikipedia,
or at the
list of distributions available in R),
try to play with parameters, compare various distributions - how good they
approximate this dataset.
This may bring you up to 4 additional points.
Homework 6
Write an R code demonstrating validity of the Central Limit Theorem when
taking means of a big number of independent iterates of the same distribution,
for different distributions -- similarly how it
was done in the class for the uniform distribution
("throwing dices").
For each significantly different distribution, you will get 1 point, up to
4 points in total.
Homework 7 (up to 3 points)
Write R functions computing skewness and kurtosis of a given vector of numerical
data (discrete statistical distribution). Use them to compute skewness and
kurtosis of some datasets you have encountered before, either in the class or at
your homeworks. Comment on the results.
Homework 8 (2 points)
Compute kurtosis of the uniform distribution taking (with an equal frequency)
\(n+1\) values \(0,\frac 1n, \frac 2n, \dots, \frac{n-1}n, 1\).
Note that this homework is not about computations in R, so it should
be submitted as a clearly (!) written piece of (mathematical) text, either on
paper or by email.
Homework 9 (2 points)
Compute the weighted population density of Czech Republic when it is divided to:
1. Bohemia, Moravia, and Czech Silesia.
2. 13 regions (kraje) + Prague.
(All the relevant data -- population and area -- can be found on internet).
Compare these two densities with each other and with the unweighted population
density of Czech Republic. Comment on the results.
Homework 10 (3 points)
Count the number of your "friends" on facebook, and for each friend count the
number of his friends. Plot the histogram of so obtained distribution. Where
are you located in this distribution -- above or below mean?
Invent a notion of weight appropriate in this case, and compute the
weighted mean of the number of friends among all your friends. Compare it with
the unweighted mean. Comment on the results.
Homework 11 (1 point)
Give a numerical example of an instance of Simpson's paradox discussed at
Class 3 (rate of admission to two
departments among men and women).
Homework 12 (up to 3 points)
Provide in R a numerical evidence of the following statement:
if \(X_1, \dots, X_n\) are normally distributed random variables with mean
\(m\) and standard deviation \(\sigma\), then their mean,
\(\frac{X_1 + \dots + X_n}{n}\), is normally distributed with mean \(m\) and
standard deviation \(\frac{\sigma}{\sqrt{n}}\).
Homework 13 (1 point)
Write R function computing the confidence interval for a sample from a
normal distribution. (Figure out yourself which arguments this function should
accept and what it should return).
Homework 14 (up to 2 points)
Test for normality (using one of the methods discussed at
Class 4, i.e., normal scores or Q-Q plots) some
(relatively big) dataset you have encountered before, either in the class or at
your homework.
Homework 15
Whether each of the following matrices can be a correlation matrix?
Explain your answer. (0.3 points for each matrix).
\[
\left(\begin{matrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{matrix}\right)
\quad
\left(\begin{matrix}
1 & 0.1 & 0.5 \\
0.1 & 1 & -1.1 \\
0.5 & -1.1 & 1
\end{matrix}\right)
\quad
\left(\begin{matrix}
0.9 & 0.1 & 0.5 \\
0.1 & 1 & 0.1 \\
0.5 & 0.1 & 1
\end{matrix}\right)
\quad
\left(\begin{matrix}
1 & -0.1 & 0.5 \\
0.1 & 1 & 0.1 \\
0.5 & 0.1 & 1
\end{matrix}\right)
\quad
\left(\begin{matrix}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1
\end{matrix}\right)
\]
Homework 16 (up to 2 points)
Compute correlation between two datasets of your choice. Compare the answer
with other methods of statistical comparison of two datasets we used before.
Comment on the results.
Homework 17 (up to infinity points)
Try to explore the convergence pattern of iterative correlation matrices.
Homework 18 (up to 2 points)
Build a linear regression between two datasets of your choice. Test how good
the regression is. Comment on the results.
Homework 19 (up to 3 points)
(Pruim, Exercise 6.24)
The object returned by lm() includes a vector named effects.
(If you call the result model, you can access this vector with
model$effects). What are the values in this vector?
(Hint: Think geometrically, make a reasonable guess, and then do some
calculations to confirm your guess. You may use one of the data sets used
earlier, or design your own data set if that helps you figure out what is going
on.)
Homework 20 (up to 2 points)
This is a remake of the first question from Homework 15. Whether the identity
\(3 \times 3\) matrix
\[
\left(\begin{matrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{matrix}\right)
\]
is a correlation matrix? Note that not any symmetric matrix with units on the
main diagonal, and all elements between -1 and 1, is a correlation matrix,
so in order to answer the question you should either prove that there are 3
vectors \(\overline x, \overline y, \overline z\) such that their pairwise
correlations are zero (most probably, by explicitly constructing such vectors),
or prove that such vectors do not exist.
Homework 21 (up to 4 points)
Try to cluster your facebook friends (Homework 10) using an R function
of your choice (kmeans, hclust, etc.)
Homework 22 (up to 2 points)
(Pruim, Exercise 2.51)
A child's game includes a spinner with four colors on it. Each color is one
quarter of the circle. You want to test the spinner to see if it is fair, so you
decide to spin the spinner 50 times and count the number of blues. You do this
and record 8 blues. Carry out a hypothesis test, carefully showing the four
steps. Do it "by hand" (using R but not binom.test()). Then check your
work using binom.test().
Created: Tue Oct 6 2015
Last modified: Sun Jun 9 21:17:14 CEST 2019