L7; Data pre-processing & Dealing with missing data Flashcards

1
Q

data pre-processing

A

real world datasets often contain noisy, missing and inconsistent data. This is usually caused by the data being formed from multiple, sources or poor data collection techniques. It is generally regarded that data pre-processing takes up about 80 % of an analysis time.

includes;
data transformation and reshaping,
calculating variables that are functions of existing variables,
aggregation,
dealing with missing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dplyr package (5)

A

filter, select, arrange, mutate, summarise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

dplyr; filter

A

filter(data, how to filter)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

and

A

&

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

or

A

|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

equal

A

==

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

then

A

%> %

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

missing problem (3)

A

missing data can usually be classified into;
1. Missing Completely at Random (MCAR);
if missingness doesn’t depend on the values of the dataset.

  1. Missing at random (MAR)
    if missingness does not depend on the unobserved value of the data set but does depend on the observed.
  2. Not missing at random (NMAR)
    if missingness depends on the unobserved values of the data set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

the procedure for dealing with missing data

A
  1. Identify the missing data.
  2. Identify the cause of the missing data.
  3. A; remove the rows containing the missing data
    (naive approach) and missing data should not be biased.
    B; replace missing values with alternative values.
    impute the missing values, there are number of approaches.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

deletion (2)

A

listwise deletion is analyse the data rows where there is complete data for every column. the advantage of it is simple and easily compare across analyses. the limitations are could be biased and lower n and reduces statistical power.

Pairwise deletion is analysing the data rows where the variables of interest have data present. advantage is using all possible information but limitation is separate analyses cannot be compared as the data/ sample will be different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

replacing missing data

A
  1. Simple Imputation
    missing values are replaced with the mean, median or mode value. no stochastic and very simple. the limitations are could be biased, underestimating standard errors, could distort (ゆがめる) correlations among variables.
  2. Multiple Imputation
    estimates missing data through repeated simulations. stochastic and variability is more accurate. limitations are algorithms are more complex and normally would require complex coding.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly