### Digest: Statistics

After reading Statistics, 4th Edition, I found myself with plenty of highlights from the book. As an exercise to help commit my highlights to memory, I’ll be documenting them here. This post is pretty much just for myself but I figured I’d share it online on the off chance it helps someone else out.

• “treatment group”: individuals given the drug that’s being tested
• “control”: individuals that are not treated
• “double blind”: neither the subjects nor the doctors who measure the responses should know who was in the treatment or control group
• the treatment and the control group should be as similar as possible
• “randomized control”: an experiment where an impartial chance procedure is used to assign subjects to treatment or control groups
• “placebo”: having the subject believe the received real treatment, when in fact they’re receiving nothing
• using a randomized double blind design reduces bias to a minimum
• badly designed studies may exaggerate the value of risky surgery
• using randomized control ensures the control group is like the treatment group
• this way comparisons are only made among patients who could have received the therapy
• “adherers”: the individuals who took their drugs as prescribed
• “summarize the Pellagra disease story”: “originally thought that disease carried by flies, as the flies infested homes were more prone to the disease
• turned out it was diet based and the poorer homes had this poorer diet”
• association != causation
• “controlling for the confounding factor”: attempting to adjust differences among subjects not related to treatment
• “simpson’s paradox or reversal paradox”: a trend appears in several different groups of data but disappears or reverses when these groups are combined
• multiple things can confound together
• ex: gender and choice of selected college major
• doing a weighted average can control for a confounding factor
• hidden confounders are a major problem in observational studies
• “confounding”: a difference between the treatment group and the control group, other then the treatment being studied
• A third variable, associated with exposure and with disease
• if you only look at the average diversity can be hidden
• “cross-sectional study”: different subjects are compared to each other in one point in time
• “longitudinal study”: subjects are followed over time, and compared with themselves at different points in time
• if a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal
• it is incorrect to assume that 50% of subjects are below the average the 50% are above
• “symmetric histogram”: average is about the same as median
• “long-right hand tail histogram”: average is bigger than median
• “long left hand tail histogram”: average is smaller than median
• “root-mean-square”: an equation that you should read backwards
• first squares all the entries, which gets rid of the signs
• then take the mean (average) of all the squares
• finally take the square root of the mean
• “variance”: given measurements, is the sum of the squared differences between each measurement and the mean
• “standard deviation (SD)”
• square root of the variance
• how far away numbers on a list are form their average
• most entries on the list will be somewhere around one SD away from the average
• very few will be more than two ro three SDs away
• the equation is $\text{SD =$r.m.s$deviation from average}$
• to compute SD first compute the deviation from the average for each entry
• the SD is the $r.m.s$ of these deviations
• the SD comes out in the same units as the data
• do not confuse SD of alist with its $r.m.s$ size
• “standard units”:
• how many SDs a value is above or below the average
• ex: $\text{given mean = -41.67, SD = 54.87, measurement = -44} \text{ then } \frac{-44 + 41.67}{54.87} = -0.042 \text{ standard units}$
• some handy rules when applying changes to a list of measurements
• adding the same number to every entry on a list adds that constant to the average but the $SD$ does not change
• this is because the deviations from the average do not change, because the added constant just cancels
• multiplying every entry on a list by the same positive number multiples the average and the $SD$ by that constant
• if the constant is negative, wipe out the sign before applying it to the $SD$
• these changes of scale do not change the standard units
• the $SD$ of a series of repeated measurements estimate the likely size of the chance error in a single measurement
• $\text{individual measurement = exact value + chance error}$
• “bias”: Affects all measurements the same way, pushing them in the same direction
• “chance error”: Changes from measurement to measurement, sometimes up and sometimes down
• “individual measurement”: $\text{exact value + bias + chance error}$
• if there is no bias in a measurement procedure, the long-run average of repeated measurements should give the exact value of the thing being measured, as the chance errors should cancel out
• if bias is present, the long-run average will itself be either too high or too low
• usually, bias cannot be detected just by looking at the measurements themselves
• instead, the measurements have to be compared to an external standard or to theoretical predictions
• “positive association”: when one variable tends to increase when another variable increases
• if there is a strong association between two variables, then knowing one helps a lot in predicting the other
• but when there is a weak association, information about one variable does not help much in guessing the other
• “independent variable”: A variable that is believed to influence our dependent variable
• “dependent variable”: What we are predicting
• should be dependent on other variables
• “correlation coefficient”: a measurement of linear association or clustering around a line
• abbreviated as $r$ for no good reason, perhaps due to the two r’s in “correlation”
• the relationship between two variables can be summarized by:
• the average of the x-values, the $SD$ of the x-values
• the average of the y-values, the $SD$ of the y-values
• the correlation coefficient $r$
• $r = 0.80$ does not mean that “80% of points are tightly clustered around a line
• nor does it indicate twice as much linearity as $r=0.40$
• Chapters 10 and 11 will address this
• a correlation of $-0.90$ indicates the same degree of clustering as one of $+0.90$
• with the negative sing, the clustering is around a line which slopes down; with a positive sign, the line slopes up
• correlations are always between $-1$ and $1$, but can take any value in between
• a positive correlations means that the cloud slopes up: as one variable increases, so does the other
• a negative correlation means that the cloud slopes down; as one variable increases, the other decreases
• the points in a scatter diagram generally seem to cluster around the $SD$ line
• this line goes through the point of averages; and it goes through all the points which are equal number of SDs away from the average; for both variables
• “correlation coefficient equation”: $r = \text{average of (x in standard units)} * (\text{y in standard units})$
• in other words, convert each variable to standard units
• the average of the products gives the correlation coefficient
• “changes of scale”: when you multiply all the values of one variable by the same positive number or add the same number
• $r$ is not affected by changes of scale
• the correlation coefficient is a pure number, without units and it is not affected by:
• interchanging the two variables
• adding the same number to all the values of one variable
• multiplying all the values of one variable by the same positive number
• $r$ measures clustering not in absolute terms but in relative terms, relative to the SDs
• “vertical $SD$": the $SD$ of the variable plotted on the y-axis
• “horizontal $SD$": the $SD$ of the variable plotted on the x-axis
• if $r$ is closer to $1$, then a typical point is only a small fraction of a vertical $SD$ above or below the $SD$ line
• if $r$ is close to $0$, then a typical point is above or below the line by an amount roughly comparable in size to the vertical $SD$
• the $r.m.s$ vertical distance to the $SD$ line equals $\sqrt{2(1 - |r|) * \text{the vertical SD}}$
• $r$ can be misleading as outliers and non-linearity are problem cases
• $r$ measures linear association, not association in general
• correlations based on rates or averages can be misleading, example
• you could use population data from 2005 and compute the correlation between income and education for men age 25-64 in the US, $r \approx 0.45$
• for each state and DC you can compute the educational level and average income
• finally you can compute the correlation between the 51 pairs of of averages, $r \approx 0.70$
• if you used this correlation for the states to estimate the correlation for the individuals you would be way off
• this is because there is a lot of spread around the averages
• replacing the states by their averages eliminates the spread and gives a misleading impression of tight clustering
• ecological correlations are based on rates or averages
• they are often used in political science and sociology
• they tend to overstate the strength of an association
• for school children, shoe size is strongly correlated with reading skills
• however learning new words does not make the feet get bigger
• instead there is a third factor involved, age
• corelation measures association but association is not the same as causation
• additional notes I took on this topic from this Berkeley resource
• the $SD$ line goes through the “point of averages”
• “point of averages”: in a scatter plot, the point whose coordinates are the arithmetic means of the corresponding variables
• for example, if the variable X is plotted on the horizontal axis and the variable Y is plotted on the vertical axis, the point of averages has coordinates (mean of X, mean of Y)
• the regression line for $y$ on $x$ estimates the average value for $y$ corresponding to each value of $x$
• “regression method”: associated with each increase of one $SD$ in $x$ there is an increase of only $r$ $SDs$ in $y$, on average