Analysing data Flashcards Preview

Uni-year 1 & 2 > Analysing data > Flashcards

Flashcards in Analysing data Deck (39)
Loading flashcards...
1
Q

Greek symbols

A

population mean: µ
sample mean: ̅x
population mean estimate: μ ̂
SD: o-

2
Q

normal distribution

A
  • bell curved
  • peak is its mean
  • mean median mode same value
  • centring; changing mean, shifting curve left/right
  • SD determines steepness of curve
  • scaling; changing SD
  • 68.2% of data within +/- 1 SD of mean
  • 95.4% of data within +/- 2 SD of mean
  • 99.7% of data within +/- 3 SD of mean
3
Q

critical values

A

if sd is known, can calculate critical value fro any proportion of normally distributed data

4
Q

sampling from distributions

A
  • collecting data on variable includes randomly sampling from distribution
  • underlying distribution assumed to be normal
  • some variables may come from other distributions; log normal distribution, poisson distribution, binomial distribution
  • sample statistic differ from pop
  • sampling distribution centred around population mean
5
Q

standard error

A

standard deviation of sampling distribution
estimated from any sample
SE = SD/ square root of N
gauge accuracy of parameter estimate in sample
smaller SE, more likely parameter estimate is close to population parameter

6
Q

central limit theorem

A
  • sampling distribution of mean is approximately normal, true no matter shape of population distribution
  • as N gets larger, sampling distribution of sample mean tends towards normal distribution
  • mean = µ, SD= SD/square root of N
7
Q

point estimates

A
  • singel numbers that are best guesses about corresponding population parameters
  • central tendency, measures of spread
  • relationships between variables can be expressed using point estimates
8
Q

what does SE of mean express?

A
  • uncertainty about relationship between sample and population mean
  • sample mean is best estimate of population mean, true for all point estimates
9
Q

interval estimates

A
  • communicate uncertainty around point estimate

- indicates how confident can be that estimate is representative of population parameter

10
Q

confidence interval (CI)

A
  • using SE and sampling distribution to calculate CI with certain coverage
  • 95% CI, 95% of intervals around sample estimate will contain value of population parameter
  • 95% of sampl. distr. within +/- 1.96 SE, 95% CI estimate pop. mean is mean +/- SE
11
Q

t-distribution

A
  • when don’t know sampling distribution
  • symmetrical and centred around 0
  • shape changes based on degrees of freedom
  • ‘fat tailed’ when df=1; identical to normal dist. when df=infinite
  • as df increases, tails get thinner
  • critical value changes based on df
  • df= N-1 (n is number of estimated parameters)
12
Q

what you need to calculate confidence intervals

A

estimated mean
sample SD
N
critical value fro t-distribution with df = N -1

-95% CI around estimated pop. mean is mean +/- SE

13
Q

CI’s are useful :

A
  • width of interval tell us about how much we expect mean of different sample of same size to vary from one we got
  • x% chance that any x% CI contains true population mean
  • can be calculated for any point estimate
14
Q

hypothesis

A
  • statement about something in terms of differences or relationships between things/people/groups
  • must be testable
  • about a single thing
15
Q

levels of hypotheses

A
  • conceptual: expressed in normal language on level of concepts/constructs
  • operational: restates conceptual hypothesis in terms of how constructs are measured in given study
  • statistical: translates operational hypothesis into language of mathematics
16
Q

operationalisation

A
  • process of defining variables in terms of how they are measured
  • intelligence as total score on Ravens progressive matrics
17
Q

Statistical hypothesis

A
  • operational hypothesis in terms of language of maths
  • deals with specific values of population parameters
  • mean of population can be hypothesised to be of given value
  • can hypothesise a difference in means between two populations
18
Q

problems with samples that test hypothesis

A

not representative of population
larger the sample the better as fluctuations become less important as N increases
means converge to true value of population mean as N increases
CIs get exponentially smaller with N

19
Q

null hypothesis

A

states there is no difference

used to test for statistical significance

20
Q

distribution of test statistic under Ho

A

even if true difference in population delta is zero, D can be non-zero in sample
Assume A is normally distributed in population with µ=0 and o- = 1, expected value of D under Ho, more often than not D will not equal to 0 in sample

21
Q

what is a p-value

A
  • the probability of getting test statistic at least as extreme as one observed if null hypothesis is true, how likely data is if there is no difference/effect in population
  • if p-value is less than chosen significance level, call result statistically significant
22
Q

retain or reject null

A

reject null hypothesis when judge our result to be unlikely under Ho
retain Ho if judge result to be likely under it

23
Q

continuous data

A
  • matter of degree eg how much
  • score or measurement
  • makes sense to have mean value
24
Q

categorical data

A
  • matter of membership eg which group?
  • group or label
  • membership is binary
25
Q

for each statistical analysis we need:

A

data
test statistic
distribution of test statistic
probability of value of test statistic uder null hypothesis

26
Q

correlation

A
  • quantifies degree and direction of numeric relationship
  • used wtih two or more continuous variables or if one is categorical
  • use pearson correlation coefficient
  • only use correlated when reporting r as evidence
27
Q

what code in r is used to get pearsons correlation

A

data %>% select(variable, variable) %>% cor(method = ‘pearson’)

28
Q

what can you suggest when confidence intervals overlap

A

they may have same population value

29
Q

chi squared test

A
  • quantifies relationship between two or more categorical variables
  • compare what might expect under null and calculate X^2 to quantify
  • only use X^2 if value greater than 5 in each cell
  • the bigger the X^2 value the bigger the difference between our data and what we expect
30
Q

important note about chi squared

A

only test significance of null hypothesis being true, there will be no evidence for alternative

31
Q

using t distribution

A
  • t is the difference in sample means compared to standard error of differences in means
  • larger the t the bigger the difference bewteen sample means compared to error
32
Q

t and r

A
  • p value from r comes from t distibution
  • can change t into r
  • t quantifies difference in means between two groups
  • R quantifies degree and direction of relationship between two variables
33
Q

what is a predicor

A

variable that may have relationship with outcome

34
Q

what is an outcome

A

variable we want to explain

outcome = model + error

35
Q

linear model

A
  • creates linear model between outcome variable and predictor variable in dataset
  • look at lm() %>% summary()
  • R^2 is variance of variable A was explained by variable B
  • adjusted R^2 is if applied same model to population
  • R^2 and adjusted R^2 must be similar and big
36
Q

r code for linear model

A

Lm(outcome ~ predictor, data = data)

37
Q

equation of linear model

A

outcome = b0 + b1 x PREDICTOR1 + error

38
Q

f statistic

A

F = (what model can explain)/(what cant explain)

  • ratio of variance explained relative to variance unexplained
  • ratio > 1 means model can explain more than it cant
  • associated p value of how likely to find F stat as large as observed if null is true
39
Q

how to compare linear models

A
  • compare R^2 and change in R^2
  • compare f stat and its associated p-value
  • look at standardised versions of b1