1. Data and Models Flashcards

Summarising numerical data, summarising attribute data, fitting a model

1
Q

Population

Definition

A

-a collection of individuals/items of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sample

Definition

A

-the subset of the population for which observations are available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Variable/Variate

Definition

A

-a quantity or attribute whose value varies between individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Observation

Definition

A

-a recorded value of a variate for an individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data

Definition

A

-a collection of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Statistic

Definition

A

-a function of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Summarising Numerical Data

min and max

A

-the minimum and maximum values of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Summarising Numerical Data

Measures of Location

A
  • summary statistics which try to capture the location of the centre of the sample
    1) sample mean
    2) mode
    3) median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Summarising Numerical Data

Sample Mean

A

-the sample mean or sample average of x1,…,xn∈R is given by:
1/n Σ xi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Summarising Numerical Data

Mode

A
  • the mode of a sample x1,…,xn is the value of the variate which occurs most frequently
  • in cases where different values occur with the same frequency the mode may not be unique
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sumarising Numerical Data

Meidan

A

-a median of x1,…,xn∈R is any number m∈R such that:
a) at least half of the observations are less than or equal to m
AND
b) at least half of the observations are greater than or equal to m
-if the number of samples is odd, there is a unique median
-if the number of samples is even even, the median can fall anywhere in the interval between the middle two values, we usually choose the midpoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Summarising Numerical Data

Measures of Spread

A
  • statistics which characterise the spread of the sample
    1) range
    2) sample variance
    3) sample standard deviation
    4) interquartile and semi-interquartile range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Summarising Numerical Data

Range

A

-the range of a sample of numeric observations x1,…,xn∈R is the interval:
[min xi , max xi]
-i.e. the smallest interval which contains all the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Summarising Numerical Data

Sample Variance

A

-the sample variance of x1,…,xn∈R is given by:
sx² = 1/(n-1) Σ(xi-x^)²

  • where x^ is the sample mean
  • the sample variance is nearly the average squared distances between samples and the sample mean, only the denominator is is n-1 instead of n
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Summarising Numerical Data

Sample Standard Deviation

A
  • sample standard deviation is the square root of the sample variance
  • large values of sx indicate that the samples are spread out, while small values of sx indicate that the samples are concentrated around the sample mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Summarising Numerical Data

α-quanitles

A
  • the idea of α-quantiles is to split the samples into two groups such that αn samples are smaller than qα and (1-α)n samples are larger than qα
  • the value of qα that leads to such a split is an α-quantile, depending on n, α and x, the α-quantile may or may not be unique
17
Q

Summarising Numerical Data

first and third quartiles

A
  • using the definition of the α-quantile, qα:
  • the value q1/4 is called the first quartile
  • q3/4 is called the third quartile
18
Q

Summarising Numerical Data

interquartile and semi-interquartile range

A

-the difference q3/4-q1/4 is called the interquartile range
-and:
(q3/4-q1/4)/2 is called the semi-interquartile range

19
Q

Semi-Interquatile Range vs Sample Standard Devitation

A
  • the semi-interquartile range can be used as an alternative to the sample standard deviation
  • its definition is slightly more complicated but the semi-interquartile range is less affected by outliers than the sample standard deviation
  • i.e. the semi-interquartile range is a robust measure of the spread of a sample
20
Q

Summarising Attribute Data

A
  • since the observations of attribute data do not consist of numbers, the mode is the only one of the summary statistics from the previous section which can be computed for attribute data
  • often the best way to summarise attribute data is to consider tables which show how often each of the possible values occurs
21
Q

Statistical Model

Definition

A

-a statistical model for a sample x1,…,xn consists of random variables X1,…,Xn chosen such that the data x1,…xn ‘look like’ a random sample of X1,…,Xn

22
Q

Fitting a Model

A
  • one of the main concerns in statistics is to ‘fit a model’ to given data
  • i.e. to find a distribution for the random variables X1,…Xn such that the data could plausibly be a random sample from the model
23
Q

Questions about the relation between data and models

A

1) what are the best parameter values to use in the model -> parameter estimation
2) which parameter values in the model are compatible with the data -> confidence intervals
3) could the data have been produced by a given model with given parameter values -> hypothesis tests

24
Q

Models in R

r

A

-generates n random numbers from the sample

25
Q

Models in R

d

A

-densities (weights for the discrete case)

26
Q

Model in R

p

A

-cumulative distribution functions

27
Q

Models in R

q

A

-quartiles

28
Q

Models in R

Distributions

A
binomial : binom
chi-squared: chisq
exponential: exp
gamma: gamma
normal: norm
poisson: pois
uniform: unif
29
Q

Sampling Attribute Data

A

-to generate independent, random samples from a model for an attribute value, the command:
sample(values,n,replace=TRUE,prob=p)
-can be used
-values must be a vector of the possible values of the attribute, and p must be a vector of the same length as values giving the corresponding probabilities of each value
-if all possible values have the same probability, the argument prob=… can be omitted