Stat - Exam #1 Flashcards Preview

Spring 2015 > Stat - Exam #1 > Flashcards

Flashcards in Stat - Exam #1 Deck (134)
Loading flashcards...
1
Q

What is Statistics?

A
  • The science of COLLECTING, ORGANIZING and SUMMARIZING and ANALYZING information to draw conclusions;
  • Science of data
2
Q

What kind of data is used in Statistics?

A

Probabilistic data = data that characteristics of being unknown for one observation, but many observations is know;
— characterize well in the long run, but unknown individually

3
Q

What are the techniques of gathering statistical data?

A
  1. Sampling;
  2. Descriptive Stats;
  3. Inferential Stats
4
Q

What is Sampling?

A

-Techniques used to collect info
— Major technique = Simple Random Sampling;
— COLLECTING techniques

5
Q

What are Descriptive Statistics?

A

-Techniques used to condense and describe sets of data;
— Major techniques = Frequency Tables, histograms, and summary numbers;
— ORGANIZING and SUMMARIZING techniques

6
Q

What are Inferential Statistics?

A

-Techniques used to systematically draw conclusions about a population from a set of sample data;
-Gather population information from a sample;
— Interpretation of data by generalizing info from a sample to apply to a population
— Major tools = Hypothesis testing and confidence intervals;
— ANALYZING techniques

7
Q

What are Statistical Methods?

A

-Combo of descriptive and inferential techniques (collect, organize, summarize, analyze)

8
Q

What is a Population?

A
  • The totality of element in a well-defined group to be studied;
  • MUST be WELL-DEFINED by clearly stating what exact and specific elements (people, animals, etc) DO and DO NOT belong in the population
9
Q

What is a Sample?

A

A SUBSET of the population;

— Larger the sample size the better, but the METHOD is more important than the size

10
Q

What is an Individual?

A

ONE object from the POPULATION

11
Q

What is the goal of Sampling?

A

To collect…

  1. a measurable numbers of individuals that…
  2. represent the population
    * Measuring the sample gives info about the population
12
Q

What are the 4 sampling techniques?

A
  1. **Simple random = best;
  2. Stratified;
  3. Systematic;
  4. Cluster
13
Q

How can sampling be done?

A
  1. WITH replacement;

2. WITHOUT replacement

14
Q

What are the non sampling errors?

A
  1. Coverage errors = incomplete population;
  2. Nonresponse erros = cannot measure selected element;
  3. Inaccurate response errors = poor records, lying;
  4. Measurement erros = ambiguous questions, crude tools
15
Q

What is Simple Random Sampling?

A

A method of choosing a sample such that each sample of the same size has the same change of being chosen;
- Each individual has equal chance of being chosen

16
Q

What does Random sampling do?

A
  1. DOES remove SELECTION bias from the sample;
  2. Does NOT affect the natural variability of data;
  3. Does NOT guarantee a representative sample;
    * *ONLY way to allow inferences (informed guesses) about the population
17
Q

What is the method of Random Sampling?

A
  1. Assign every individual in a population a number;
  2. Select individuals to be in the sample by:
    — Random number table or
    — Random number generator
    (Data > Random Variates > Distribution)
18
Q

What are the 3 classes of data?

A
  1. Constant = measurement gives only one possible value;
  2. Variable = repeated measure yields many possible values;
  3. Random Variable = randomly varying values (value determined by chance:
    * *Stats is concerned with data from random variables
19
Q

What are the types of data?

A
  1. Qualitative;

2. Quantitative = Discrete or Continuous

20
Q

What is Qualitative Data?

A

Data that can be calcified by some mutually exclusive and exhaustive quality of individuals;
EX: Color, religion, gender

21
Q

What is Quantitative Data?

A

Data that are numericanl and allow the use of arithmetic

22
Q

What is Discrete Data?

A

*Quantitative;
-Easily countable number of possible values;
EX: 1-10 ( number of items)

23
Q

What is Continuous Data?

A

*Quantitative;
- Infinite number of possible values;
EX: Time, weight, length

24
Q

What is the method to identify TYPES OF DATA?

A
  1. Pick any TWO data points;
  2. Can they be ordered?
    —NO = QUALITATIVE:
    —YES…
  3. Countable number of values between?
    —NO = Continous;
    —YES = Discrete
25
Q

What is a Census?

A
-Study that measures a characteristic of EVERY individual in a population;
— Observational;
— DOES NOT involve a sample;
— Measure the whole population
EX: US Census
26
Q

What is an Observational Study?

A

-Study that measures a characteristics in a sample WITHOUT controlling the units or treatment;
— Called an “Ex Post Study” (after the fact) because values have already been established;
— Determines ASSOCIATION, not cause;
EX: students heights in a class

27
Q

What is Experimental Design?

A

-Study that measure a characteristic in a sample WITH CONTROLLING units or treatment;
EX: measure wt. gain of 3 emails on a week long high protein diet
1. Independent or
2. Dependent

28
Q

What is Independent Design?

A
  • Experimental;

- Where all experimental units are randomly chosen and assigned to treatments randomly

29
Q

What is Dependent Design?

A
  • Experimental;
  • One half of the experimental units are chosen randomly and the second half are chosen by matching characteristics;
  • “Matched-Pairs Design”
30
Q

When do you use an Experimental Design?

A

When you can BOTH:

  1. Control the individuals characteristics and
  2. Control the treatment
    * If you CAN’T do both, use observational
31
Q

Why can’t observational studies determine causation?

A
  • Because of possible lurking variables;

- Lurking = No measure, but DO affect the results (EX: Snoring and a risk of heart attack)

32
Q

What does an experiment mean in statistics?

A
  • High level of control;

- Often takes more than one study to eliminate or control lucking variables

33
Q

What is an Experimental Unit?

A

An individual in a sample

34
Q

What is Treatment?

A

A A condition of interest that is applied to the experimental unit

35
Q

What is the Response Variable?

A

A quantitative or qualitative variable that reflects the characteristic of interest

36
Q

What is a Double Blind study?

A

Neither the researcher nor the experimental unit knows whether, or what, treatment is being applied

37
Q

What is a Placebo?

A
  • A false treatment that has NO effect;

- Used to prevent experimental units from knowing whether they are being treated

38
Q

How do I describe a column of data?

A

Distribution of data gives SHAPE, LOCATION, and SPREAD;

-These are very useful in abstracting the info from the data

39
Q

What is the process of statistics?

A
  1. Ask question;
  2. Collect data = census, observational study, experimental design, or existing data;
  3. Organize and analyze = overview with tables/graphs; detailed using methods depending on type;
  4. Make a conclusion
40
Q

What is Raw Data?

A

Data NOT organized

41
Q

How is a variable (column of data) described?

A

-Condense and described by distribution;
-Distribution described by shape, location, and spread =
— Graphical methods determine shape;
— Numerical methods find location and spread

42
Q

How can Qualitative data be graphically summarized?

A
  • Frequency table and in a Graph (bar chart, pareto chart, and pie chart);
  • Frequency organizes = shows what possible VALUES a variable take and HOW OFTEN each value;
  • Picture give better overview
43
Q

What is Frequency Table?

A

A table that lists all categories of data, with number of occurrences for each category;

  • 5 Columns =
    1. Category
    2. Frequency
    3. Relative Frequency
    4. Cumulative Frequency
    5. Cumulative Relative Frequency
44
Q

What is Category?

A

Lists the names of all categories in a column of data

45
Q

What is Frequency?

A

The number of observations in each category

46
Q

What is Relative Frequency?

A

The percent, or proportion, of data in each category;

Relative Frequency = (Frequency/Sum of all Frequencies)

47
Q

What is Cumulative Frequency?

A

The sum of frequency up through, and including category of interest;
-the number of observations less than or equal to the category value

48
Q

What is Cumulative Relative Frequency?

A

The sum of relative frequency up through and including the category of interest

49
Q

What is a Bar Chart?

A

A graph of a set of data made with:

  1. Categories on horizontal axis
  2. Frequencies on vertical axis;
  3. Rectangle of equal width (bar) drawn for each category with the hight equal to the category’s frequency (or relative frequency);
    - Bars do NOT touch;
    - Value are in the middle of the bars
50
Q

What is a Pareto Chart?

A

A bar chart whose bars are drawn in descending order of height

51
Q

What is a Pie Chart?

A

A circle divided in to wedges, where each wedge represents a category and the size of the wedge represents the relative frequency of a category;
-Summarizes qualitative data

52
Q

How is Discrete Data summarized?

A
  • HISTOGRAM;
  • Values are used to create categories;
  • Histogram bar DO touch;
  • Values marked in the middle of the bars
53
Q

How do you graphically represent Continuous Data?

A
  • Too many categories since each number is its own category;
  • To condense, need to GROUP data into intervals and create new, smaller categories;
    1. Group into classes; make a frequency table; make a histogram;
    2. Or make a stem-and-leaf plot
54
Q

What is the method to group Continuous data into classes?

A
  1. Decide on the number of intervals (5-20).
  2. Find the width of each interval (divide range by number of intervals)
  3. Select a starting point for the first interval (usually the min)
  4. Make the remaining intervals equidistant from start = equal widths, adjacent, no overlap, and convenient endpoints
55
Q

What is a Histogram for CONTINUOUS data?

A
  • Graph of a set of data made like a bar chart, but:
    1. Bars DO touch;
    2. The lower limit of each class is marked at the LEFT of each rectangle
56
Q

What is a Stem-and-Leaf Plot?

A

-Graph to summarize continous data by dividing each data point into a star and and a leaf part and listing these parts

57
Q

What is the method to make a Stem-and-Leaf Plot?

A
  1. Rank the data from low to high;
  2. Divide each point into stem and leaf;
    - Leaf = rightmost digit
    - Stem = digits to the left of leaf;
  3. Write stems vertically from low to high;
  4. Draw line to right of numbers;
  5. Write the leaf next to the corresponding stem
58
Q

How are distribution SHAPES described?

A
  • Symmetric
  • Skewed left
  • Skewed right
  • Uniform
  • Bimodal
59
Q

How are histograms analyzed?

A
  1. Overall shape (frequency curve);

2. Any deviation from general shape

60
Q

How are columns of data analyzed?

A
  1. Shape
  2. Location
  3. Spread
61
Q

What are Summary Numbers?

A

Numerical quantity used to describe data

62
Q

What are the Numerical Measure (stats) that represent/summarize characteristics?

A
  • Number of Observations = Size;
  • Mean, Median, Mode = Location;
  • Range, variance, standard deviation = Spread;
  • Correlation, last-squares registration = Association
63
Q

What is a Parameter?

A

-A summary number for a POPULATION;
— Constant as population does not change;
— Greek letters

64
Q

What is a Statistic?

A

-A summary number for a SAMPLE;
— Variable as the same varies (changes)
-Roman letters

65
Q

What is a Resistant Statistic?

A
  • A stat that is NOT sensitive to extreme data values;

* *MEDIAN is more resistant than the MEAN — median won’t typically alter with an outlier

66
Q

What is Size?

A

The size of a set of data is the NUMBER of INDIVIDUALS (data points) in the set;
— Sample = n;
— Population = N

67
Q

What are summary numbers for Location?

A
  • Tell where the MIDDLE of the data is located on the real number line;
  • To find the middle, imagine a histogram;
  • Determine the middle of the histogram and find where the middle hits the number line;
  • *Condensing column of data to ONE number
68
Q

What are the numerical measures for location?

A
  • Mean, Median, and Mode;
  • Best measure of location depends on:
    1. TYPE of data;
    2. SHAPE of distribution
69
Q

What are the measures per data types for location?

A
  • Quantitative = Symmetric = MEAN (most info):
  • Quantitative = Skewed = MEDIAN (“ugly” graphs):
  • Qualitative = No shape = MODE
70
Q

What is Binomial Data?

A
  • Special case of qualitative data;
  • Only TWO values: 0 for failure and 1 for success;
  • Best measure is the PROPORTION of successes, which is the average of the data (x-bar)
71
Q

What is the Mean?

A
  • Arithmetic average of all data points (balance point);
  • Sample = X-bar
  • Population = u;
  • Used for discrete and continuous data;
  • Advantages = easily understood; uses all points;
  • Disadvantages = affected by extreme data
72
Q

Why is the mean the most commonly used measure of location?

A
  • Algebraically easy to use;

- Statistically more stable in that it tends to vary less from sample to sample than other measures

73
Q

What is the method to find the MEAN?

A
  1. Rank the data points from lowest to highest (optional);
  2. Sum of all values;
  3. Divide by the number of data points
74
Q

What is the Median?

A
  • The numerical value that lies in the medal of a ranked set of sat;
  • Sample = M
  • Population = M;
  • Used for discrete and continuous data;
  • Advantages = NOT affected by outliers (RESISTANT);
  • Disadvantages = uses information from only the position of data
75
Q

The median is the balance of what?

A
  • The balance point for the NUMBER of DATA POINTS;
  • Half below and half above;
  • Median with be ONLY the
    1. Value of a data point, or;
    2. Simple average between two adjacent data points
76
Q

What is the method to find the Median?

A
  1. RANK (must) the data points from low to high;
  2. TABLE, fill in Index, Position, Value;
    (Find the singular middle value of an odd numbered set, or find the average of the 2 middle in an even numbered set)
77
Q

What is the Mode?

A
  • The value that occurs MOST frequently in a set of data;
  • Sample = Mode;
  • Population = Mode;
  • Advantages = Easy to find;
  • Disadvantages = Not unique and uses info from only part of data
78
Q

What “completely” describes b column of numbers?

A
  • Summary Numbers;
  • Spread tell how WIDESPREAD data is on the real number line;
  • WIDTH = indication of variability
79
Q

How does Variability indicate variance?

A
  • More variable data has a GREATER width;

- Less variable data has SMALLER width

80
Q

What are the 3 common measures of SPREAD?

A
  1. Range;
  2. Variance;
  3. Standard Deviation;
  4. Interquartile Range (IQR)
81
Q

What is the Range?

A

-The difference between the largest and smallest data value;
— Sample = R;
— Population = R;
-Advantages = Easy calc;
-Disadvantage = Not resistant, Use only the two most extreme data points (NOT all data points)

82
Q

What is the method to find the Range?

A
  1. Rank the data from low to high (MUST);
  2. Find the largest data value and the smallest data value;
  3. Take the difference: R = max - min
83
Q

What is the Interquartile Range (IQR)?

A

The difference in the 75th percentile and the 25th percentile ;
— Sample = IQR;
— Population = IQR;
-Advantages = RESISTANT version of the range (removes outliers);
-Disadvantages = Only gives spread of 50% of data;
EX: P(75) - P(25) = IQR

84
Q

What is the Variance?

A

*Most important summary number for spread in stats;
-The ‘average’ of the squared deviations of the data points from the mean;
-NOT the ‘true average’ because it is divided by the degrees of freedom and not by the number of data points;
— Sample = S(^2);
— Population = sigma(^2))
-Advantages = BEST estimator of spread;
-Disadvantages = Uses DIFF measurements than the mean

85
Q

What is the method to find the Variance?

A
  1. Rank the data points;
  2. Use the Sum-of-Squares Table;
  3. Calculation: s(^2) = [Sum of (X(i) - avg.)^2 / (n-1)]
    * Divide the Sum-of-Squares by the degrees of freedom
86
Q

What is the formula for Sum-of-Squares?

A

Summation of (X(i) - avg.)^2

87
Q

What is the Degrees of Freedom

A

-Sample number(n) -1;
-Estimated the mean to calc variance, so lost one degree of freedom;
-or Know the true value of the mean, the data points can take any possible value except the last data point…last data point must take one, specific value to give the correct value of the mean and is not free;
= Number of FREE data

88
Q

What is the Standard Deviation?

A

-Square root of the variance;
-*Most COMMON summary for spread;
— Sample = s;
— Population = sigma;
-Gives the “average” deviation of the data from this mean; Not a true average since it is divided by degrees of freedom, not data points

89
Q

What is the method of the Standard Deviation?

A
  1. Take the square-root of the variance: s = sq. root of s(^2)
90
Q

What is the “kth” percentile?

A

-The number that divides a set of ranked data points into two sets:
1. the lower k%;
2. upper (100 - k)%;
— Sample = P(k);
— Population = P(k);
-Percentiles (quartiles) are like the place to slice a load of bread to divide it in two

91
Q

What is the methods to find the “kth” percentile?

A
  1. RANK, from low to high (MUST);
  2. Table, fill in (Index, Position, Value)
    -index: i = (k/100) x n
    -position:
    — i = Integer = avg. i and (i+1) data points;
    — i = decimal = next larger data point
  3. Value: form position in the ranked set of data
92
Q

What do the INDEX and POSTION show for the “kth” percentile?

A

-INDEX takes just below the desired percentile;

— POSTION takes the rest of the way

93
Q

What are Quartiles?

A
  • Quartiles are the most common percentiles;

- Divide a set of ranked data points into four equal parts, each part of the set of data contains 25% of the data points

94
Q

What is the First Quartile?

A
  • The number such that 25% of the ranked data points are smaller, and 75% are greater;
  • Denoted: Q1
95
Q

What is the Third Quartile?

A
  • The number such that 75% of the ranked data points are smaller, 25% are greater;
  • Denoted: Q3
96
Q

What is the Second Quartiles?

A

-Actually the MEDIAN (and called such)

97
Q

What is the main disadvantage for using RANGE for a summary number for spread?

A
  • It is NOT a RESISTANT stat and is strongly affected by extreme data points;
  • May not represent the bulk of the data, especially if the two extremes are considerable outliers;
  • Need to correct by measuring the range of only 50% f the data points and not allow the influence of extremes = IQR
98
Q

What are Fences?

A
  • Check for extreme observations or outliers;
  • Do NOT automatically kick-out, but check-it out;
  • Use the fences to determine;
  • OUTLIERS = Smaller than LOWER fence or larger than UPPER fence
99
Q

Formula for LOWER Fence?

A

Q1 - 1.5(IQR)

100
Q

Formular for UPPER Fence?

A

Q3 + 1.5(IQR)

101
Q

What are the important RESISTANT measures of spread?

A
  1. Median = resistant for location;

2. IQR = resistant for spread

102
Q

What 3 numbers give the most information about data?

A

(Q1, M, Q3);

-Only missing the tails of data (min and max)

103
Q

What is the Five-Number Summary?

A
  • A set of numbers consisting of the smallest data (min), Q1, median (M), Q3, and the largest data value (max)
  • {Min, Q1, M, Q3, Max}
104
Q

What is a Boxplot?

A

-Picture of the 5-number summary

105
Q

How can the SHAPE of a column of data be seen from a boxplot?

A

(Shape - Median - Tails)

  1. Symmetric = center median = equal tails;
  2. Skew Right = median left = right tail longer;
  3. Skew Left = median right = left tail longer
106
Q

Properties of Normal Distribution

A
  • Probability = Area under curve;
  • z-Transformation: z = (x-u)/sigma — value/pop. avg/pop. stan.dev.;
  • Normal probability plot
107
Q

What are the data characteristics for Normal Distribution?

A

Data type = Continuous

Data Distribution = Normal

108
Q

What is Probability Density Function (PDF)?

A
  • Equation of a curve used to compute probabilities of a continuous, random variable, which satisfies 2 conditions:
    1. Area under ENTIRE curve must equal 1;
    2. Curve must be greater than, or equal to, zero at every point — CANNOT be negative
109
Q

What is the Normal Probability Density Function?

A
  • Equation (don’t have to integrate);
  • Describes asymmetric, bell-shaped curve;
  • Completely defined by the mean and variance (standard deviation)
110
Q

What defines the shape of normal curves?

A

-Defined by the equation and the o only difference will ever be the LOCATION or the SPREAD

111
Q

What are the properties of Normal Distribution?

A
  1. Symmetric about the mean (u) =
    — Mode, median, and mean are the same point;
    — Area under the curve to RIGHT of the mean (u) is equal to the area under the curve to the left of the mean (area=0.5);
  2. Curve approaches but never touches zero;
  3. Area under the curve is exactly 1 by definition
112
Q

What are the Two Symmetries?

A

-These properties lead to two symmetries of the normal curve =
1. If the area under the curve to the left of point -a is A;
2. Then the area under the curve to the right of:
Symmetry 1: Point -a is (1-A);
Symmetry 2: Point a is (A)

113
Q

What does the Area Under the Curve give?

A

-Area under the curve for an event gives the probability of the event happening, if the curve is a PDF.

114
Q

What are the types of Probability?

A
  1. PROPORTION of population described by the event;

2. PROBABILITY that a randomly selected individual from the population will be described by the event

115
Q

What is the Empirical Rule?

A
For any NORMAL Curve:
-Between pop.mean(u) +/- 1SD = 68% Area
-Between pop.mean(u) +/- 2SD = 95% Area
-Between pop.mean(u) +/- 3SD= 99.7% Area;
Also, for the quartiles:
-Between pop.mean(u) +/-0.67= 50% Area
116
Q

How will we find the are under a certain part of any curve?

A

-Convert any other normal curve into the STANDARD NORMAL CURVE and use one table

117
Q

What is Standardizing of a normal random variable?

A

-Means to convert a column of data from x-values (normal distribution) to z-scores (standard normal distribution) = Z-transformation

118
Q

P (z < a)

A

Probability that a standard normal random variable is…

LESS than a

119
Q

P (a < z)

A

Probability that a standard normal random variable is…

GREATER than a

120
Q

P (a < z < b)

A

Probability that a standard normal random variable is…

BETWEEN a and b, EXCLUSIVE

121
Q

P (z </= a)

A

Probability that a standard normal random variable is…

LESSthan, or equal to, a

122
Q

P (a </= z)

A

Probability that a standard normal random variable is…

GREATER than, or equal to, a

123
Q

P (a </= b)

A

Probability that a standard normal random variable is…

BETWEEN a and b, INCLUSIVE

124
Q

Symmetry 1

A

Area LEFT of (-a) = Area RIGHT of (+a);

P (z </= z)

125
Q

Symmetry 2

A

Area LEFT of (-a) = (1 - Area RIGHT of +a);

P (z </= z)]

126
Q

What is z-Sub-Alpha (z_apha)

A

The z-score such that the area under the standard normal curve is to the RIGHT of z_alpha = alpha;
EX: area under the curve to the right of Z_0.05 = 0.05

127
Q

What are the properties of the STANDARD NORMAL DISTRIBUTION?

A
  1. Perfectly summetric, bell-shaped curve, with:
  2. Mean zero (u=0) and standard deviation (sigma = 1);
  3. Denoted: z
128
Q

What is the method to draw a Schematic (graphical) Normal Curve?

A
  1. Draw a stylized normal curve;
  2. Mark the positions of the z-score along the x-axis;
  3. Shade in the area of interest;
    - - Write z-value if known;
    - - Write the area if known
129
Q

What is along the algebraic side of the Schematic Normal Curve?

A
Top = Area (Z-table)
Middle = Z (Z-trans)
Bottom = X (x-value)
130
Q

What is a z-table?

A

A table for the standard normal curve that gives…

  1. z-scores in the margins, and
  2. Area under the curve to the left of the z-score in the body
    - - Note: ALWAYS look at the picture in the table;
    - - Note: ALWAYS draw a schematic normal curve of the situation
131
Q

What will the z-table be used to find?

A
  1. Area FROM a z-score, and

2. z-score FROM area

132
Q

What 3 situations utilize a z-table?

A
  1. Away (from mean) = use AREA from table;
  2. Across (mean) = Use 1-(z-table area);
  3. Between (two z-scores) = Use schematic curve
133
Q

What is the method to find the area from an x-value?

A
  1. Draw a schematic normal curve;
  2. Shade the area of interest;
  3. Mark the x-value known;
  4. Convert the x-value to a z-score;
  5. Get area from z-table
134
Q

What is the method to find the x-value from the area?

A
  1. Draw a schematic normal curve;
  2. Shade the area of interest;
  3. Mark know area;
  4. Get z-score from z-table;
  5. Convert z-score to x-value