User:Dcljr/Statistics
|
Please do not edit this page. Comments should be placed on my talk page. Thanks.
Contents |
5.1 Population vs. sample |
I hate the fact that I can't edit the lead section without editing the entire article!
This page contains some ideas and lists related to probability and statistics. It is very incomplete. In most sections, no attempt has been made to make something presentable or useful (for example, there are lots of dead links).
I originally intended the material near the top of the page to eventually replace the current Statistics article, which I am not at all happy with. Some portions look more like a Wikibook (http://wikibooks.org/wiki/), though. Whatever. I'll continue to work on this page until portions of it become suitable for moving to other places in the Wikiverse...
See also my remarks in Talk:Statistics.
Please do not edit this page. Comments should be placed on my talk page. Thanks.
Preamble
Statistics is a broad mathematical discipline which studies ways to collect, summarize, and draw conclusions from data. It is applicable to a wide variety of academic fields from the physical and social sciences to the humanities, as well as to business, government and industry.
Once data is collected, either through a formal sampling procedure or some other, less formal method of observation, graphical and numerical summaries may be obtained using the techniques of descriptive statistics. The specific summary methods chosen depend on the method of data collection. The techniques of descriptive statistics can also be applied to census data, which is collected on entire populations.
If the data can be viewed as a sample (a subset of some population of interest), inferential statistics can be used to draw conclusions about the larger, mostly unobserved population. These inferences, which are usually based on ideas of randomness and uncertainty quantified through the use of probabilities, may take any of several forms:
- Answers to essentially yes/no questions (hypothesis testing)
- Estimates of numerical characteristics (estimation)
- Predictions of future observations (prediction)
- Descriptions of association (correlation)
- Modeling of relationships (regression)
The procedures by which such inferences are made are sometimes collectively known as applied statistics. In contrast, statistical theory (or, as an academic subject sometimes called mathematical statistics) is the subdiscipline of applied mathematics which uses probability theory and mathematical analysis to place statistical practice on a firm theoretical basis. (If applied statistics is what you do in statistics, statistical theory tells you why it works.)
In academic statistics courses, the word statistic (no final s) is usually defined as a numerical quantity calculated from a set of data. In this usage, statistics would be the plural form meaning a collection of such numerical quantities. See Statistic for further discussion.
Less formally, the word statistics (singluar statistic) is often used in a way roughly synonymous with data or simply numbers, a common example being sports "statistics" published in newspapers. Usually these "statistics" are collected on entire populations and so represent census data. In the United States, the Bureau of Labor Statistics collects data on employment and general economic conditions; also, the Census Bureau publishes a large annual volume called the Statistical Abstract of the United States based on census data.
Etymology
The word statistics comes from the modern Latin phrase statisticum collegium (lecture about state affairs), which gave rise to the Italian word statista (statesman or politician — compare to status) and the German Statistik (originally the analysis of data about the state). It acquired the meaning of the collection and classification of data generally in the early nineteenth century. The collection of data about states and localities continues, largely through national and international statistical services.
Definitions
Some textbook definitions of statistics and related terms (italics added):
- Stephen Bernstein and Ruth Bernstein, Schaum's Outline of Elements of Statistics II
- Inferential Statistics (1999)
- Statistics is the science that deals with the collection, analysis, and interpretation of numerical information.
- In descriptive statistics, techniques are provided for collecting, organizing, summarizing, describing, and representing numerical information.
- [Inferential statistics provides] techniques.... for making generalizations and decisions about the entire population from limited and uncertain sample information.
- Donald A. Berry, Statistics
- A Bayesian Perspective (1996)
- Statistical inferences have two characteristics:
- Experimental or observational evidence is available or can be gathered.
- Conclusions are uncertain.
- John E. Freund, Mathematical Statistics, 2nd edition (1971)
- Statistics no longer consists merely of the collection of data and their representation in charts and tables — it is now considered to encompass not only the science of basing inferences on observed data, but the entire problem of making decisions in the face of uncertainty.
- Gouri K. Bhattacharyya and Richard A. Johnson, Statistical Concepts and Methods (1977)
- Statistics is a body of concepts and methods used to collect and interpret data concerning a particular area of investigation and to draw conclusions in situations where uncertainty and variation are present.
- E. L. Lehmann, Theory of Point Estimation (1983)
- Statistics is concerned with the collection of data and with their analysis and interpretation.
- William H. Beyer (editor), CRC Standard Probability and Statistics Tables and Formulae (1991)
- The pursuit of knowledge frequently involves data collection; and those responsible for the collection must appreciate the need for analyzing the data to recover and interpret the information therein. Today, statistics are being accepted as the universal language for the results of experimentation and research and the dissemination of information.
- Oscar Kempthorne, The Design and Analysis of Eperiments, reprint edition (1973)
- Statistics enters [the scientific method] at two places:
- The taking of observations
- The comparison of the observations with the predictions from... theory.
- Marvin Lentner and Thomas Bishop, Experimental Design and Analysis (1986)
- The information obtained from planned experiments is used inductively. That is, generalizations are made about a population from information contained in a random sample of that particular population. ... [Such] inferences and decisions... are sometimes erroneous. Proper statistical analyses provide the tools for quantifying the chances of obtaining erroneous results.
- Robert L. Mason, Richard F. Gunst and James L. Hess, Statistical Design and Analysis of Experiments (1989)
- Statistics is the science of problem-solving in the presence of variability.
- Statistics is a scientific discipline devoted to the drawing of valid inferences from experimental or observational data.
- Stephen K. Campbell, Flaws and Fallacies in Statistical Thinking (1974)
- Statistics... is a set of methods for obtaining, organizing, summarizing, presenting, and analyzing numerical facts. Usually these numerical facts represent partial rather than complete knowledge about a situation, as is the case when a sample is used in lieu of a complete census.
Basic concepts
There are several philosophical approaches to statistics, most of which rely on a few basic concepts.
Population vs. sample
In statistics, a population is the set of all objects (people, etc.) that one wishes to make conclusions about. In order to do this, one usually selects a sample of objects: a subset of the population. By carefully examining the sample, one may make inferences about the larger population.
For example, if one wishes to determine the average height of adult women aged 20-29 in the U.S., it would be impractical to try to find all such women and ask or measure their heights. However, by taking small but representative sample of such women, one may determine the average height of all young women quite closely. The matter of taking representative samples is the focus of sampling.
Randomness, probability and uncertainty
The concept of randomness is difficult to define precisely. In general, any outcome of an action, or series of actions, which cannot be predicted beforehand may be described as being random. When statisticians use the word, they generally mean that while the exact outcome cannot be known beforehand, the set of all possible outcomes is, at least in theory, known. A simple example is the outcome of a coin toss: whether the coin will land heads up or tails up is (ideally) unknowable before the toss, but what is known is that the outcome will be one of these two possibilities and not, say, on edge (assuming, of course, the coin cannot stand upright on its edge). The set of all possible outcomes is usually called the sample space.
The probability of an event is also difficult to define precisely but is basically equivalent to the everyday idea of the likelihood or chance of the event happening. An event that can never happen has probability zero; an event that must happen has probability one. (Note that the reverse statements are not necessarily true; see the article on probability for details.) All other events have a probability strictly between zero and one. The greater the probability the more likely the event, and thus the less our uncertainty about whether it will happen; the smaller the probability the greater our uncertainty.
There are two basic interpretations of probability used to assign or compute probabilities in statistics:
- Relative frequency interpretation: The probability of an event is the long-run relative frequency of occurrence of the event. That is, after a long series of trials, the probability of event A is taken to be:
- <math>\mbox{P}(A) = {\mbox{number of trials in which event } A \mbox{ happened} \over \mbox{total number of trials}}<math>
- To make this definition rigorous, the right-hand side of the equation should be preceded by the limit as the number of trials grows to infinity.
- Subjective interpretation: The probability of an event reflects our subjective assessment of the likelihood of the event happening. This idea can be made rigorous by considering, for example, how much one should be willing to pay for the chance to win a given amount of money if the event happens. For more information, see Bayesian probability.
Note that the relative frequency interpretation does not require that a long series of trials actually be conducted. Typically probability calculations are ultimately based upon perceived equally-likely outcomes — as obtained, for example, when one tosses a so-called "fair" coin or rolls or "fair" die. Many frequentist statistical procedures are based on simple random samples, in which every possible sample of a given size is as likely as any other.
Prior information and loss
Once a procedure has been chosen for assigning probabilities to events, the probabilistic nature of the phenomenon under consideration can be summarized in one or more probability distributions. The data collected is then viewed as having been generated, in a sense, according to the chosen probability distribution.
- (This doesn't even make sense to me... Needs improvement!)
Blah, blah...
Data collection
Sampling
- Main article: Sampling (statistics)
Experimental design
- Main article: Design of experiments
Data summary: descriptive statistics
- Main article: Descriptive statistics
Levels of measurement
- Main article: Level of measurement
- Qualitative (categorical)
- Nominal
- Ordinal
- Quantitative (numerical)
- Interval
- Ratio
Graphical summaries
- Main article: ?
Numerical summaries
- Main article: Summary statistics
Data interpretation: inferential statistics
- Main article: Statistical inference
Estimation
- Main article: Statistical estimation
Prediction
- Main article: Statistical prediction
Hypothesis testing
- Main article: Statistical hypothesis testing
Relationships and modeling
Correlation
- Main article: Correlation
Two quantities are said to be correlated if greater values of one tend to be associated with greater values of the other (positively correlated) or with lesser values of the other (negatively correlated). In the case of interval or ratio variables, this is often apparent in a scatterplot of the data: positive correlation is reflected in an overall increasing trend in the data points when viewed left to right on the graph; negative correlation appears as an overall decreasing trend. (See graphs...) In the case of ordinal variables...
The correlation between two variables is a number measuring the strength and usually the direction of this relationship. Most measures of correlation take on values from -1 to 1 or from 0 to 1. Zero correlation means that greater values of one variable are associated with neither higher nor lower values of the other, or possibly with both. (See graphs...) A correlation of 1 implies a perfect positive correlation, meaning that an increase in one variable is always associated with an increase in the other (and possibly always of the same size, depending on the correlation measure used). Finally, a correlation of -1 means that an increase in one variable is always associated with a decrease in the other.
Some measures of correlation include the following:
Name | Used to measure | Range of values |
---|---|---|
Pearson product-moment correlation coefficient | degree of linear association between interval or ratio variables | -1 to 1 |
Spearman's rho | ... | ... |
Kendall's tau | ... | ... |
Yule's Q | ... | ... |
... | ... | ... |
Regression
- Main article: Regression
Time series
- Main article: Time series
Data mining
- Main article: Data mining
Statistical practice and methods
- Data collection
- Data analysis
- Drawing conclusions
Statistics in other fields
- Biostatistics
- Business statistics
- Chemometrics
- Demography
- Economic statistics
- Engineering statistics
- Epidemiology
- Geostatistics
- Psychometrics
- Statistical physics
Subfields or specialties in statistics
- Mathematical statistics
- Reliability
- Survival analysis
- Quality control (or Quality assurance)
- Time series
- Categorical data analysis
- Multivariate statistics
- Large-sample theory
- Bayesian statistics (or Bayesian inference, Bayesian analysis)
- Regression analysis (or just Regression)
- Sampling theory (or just Sampling)
- Experimental design (or Design of experiments)
- Statistical computing (or Computational statistics; see also Scientific computing)
- Nonparametric statistics (Nonparametrics, Nonparametric inference, Nonparametric regression)
- Density estimation
- Simultaneous inference
- Linear inference
- Optimal inference
- Decision theory (Statistical decision theory)
- Experimental design and analysis (Experimental design, Design and analysis of experiments, Design of experiments)
- Linear models (Linear model)
- Multivariate analysis (Multivariate statistics)
- Data modeling
- Sequential analysis
- Spatial statistics
Probability:
Related areas of mathematics
Also: Statistical physics
Typical course in mathematical probability
Below are the topics typically (?) covered in a one-year course introducing the mathematical theory of probability to undergraduate students in mathematics and statistics. (Actually, this list contains much more material than is typically covered in one year.)
Topics of a more advanced nature are italicized, including those typically only covered in mathematical statistics or graduate-level probability theory courses (e.g., topics requiring measure theory). See also the #Typical course in mathematical statistics below.
- Interpretation of probability
- Random experiments
- Set theory
- Measure theory
- Properties of probability
- Counting methods
- Independent events
- Joint probability
- Marginal probability
- Conditional probability
- Famous problems in probability
- Random variable
- Probability distribution
- Probability function (pf)
- Support of a probability function
- Discrete random variable
- Continuous random variable
- Cumulative distribution function (cdf) (Note: Distribution function is now about physics -- df)
- Mixed probability distribution (i.e., discrete and continuous parts — name??)
- Distribution of a function of a random variable
- Expectation
- Joint distribution (Joint probability distribution)
- Joint probability mass function (Joint pmf)
- Joint probability density function (Joint pdf)
- Joint distribution function (Joint cdf)
- Marginal distribution (Marginal probability distribution, Marginal density, Marginal density function, Marginal probability density function, Marginal probability mass function, Marginal distribution function, Marginal probability distribution function)
- Independent random variables
- Conditional distribution (Conditional probability distribution, Conditional density, Conditional density function, Conditional probability density function, Conditional probability mass function, Conditional distribution function, Conditional probability distribution function)
- Bivariate distribution
- Multivariate distribution
- Distribution of a function of two or more random variables
- List of probability distributions (or Table of probability distributions)
- Discrete probability distributions
- Discrete uniform distribution (Discrete-uniform distribution?)
- Bernoulli distribution
- Binomial distribution
- Geometric distribution
- Negative binomial distribution (or Pascal negative binomial distribution, Negative-binomial distribution, Pascal negative-binomial distribution)
- Hypergeometric distribution
- Poisson distribution
- Zeta distribution (or Zipf distribution)
- Continuous probability distributions (see also Sampling distributions below)
- Uniform distribution (or Rectangular distribution)
- Beta distribution
- Exponential distribution
- Gamma distribution
- Normal distribution (or Gaussian distribution)
- Cauchy distribution
- Pareto distribution
- Logistic distribution
- Hyperbolic secant distribution (Hyperbolic-secant distribution)
- Slash distribution
- Mixture distributions (Hierarchical probability distribution?)
- Discrete probability distributions
order?
- Relationships among probability distributions (List or Table...)
- Sampling distributions
- Family of probability distributions (or Probability distribution family, Distribution family, etc.?)
- Simulation (Generating random numbers with a given distribution, Generating random observations, Generating random numbers, Generating observations from a probability distribution, Generating observations on a random variable, etc. — "pseudo-random" on all these, too)
- Pseudorandom numbers (see Pseudorandom, Pseudo-random number, Pseudo-random)
- Random number table (and Table of random digits — former is how to use, latter an actual table)
- Pseudorandom variables (Pseudo-random variable)
- And so on, and so forth...
Typical course in mathematical statistics
Would cover many of the topics from the #Typical course in mathematical probability outlined above, plus...
- And so on, and so forth...
Typical course in applied statistics
Less theoretical than the #Typical course in mathematical statistics outlined above. (Sometimes portions of the following form the basis of a second statistics course for mathematics majors — third in the sequence if probability is the first course).
- Statistical charts
- Frequency distribution (Relative..., Cumulative..., Grouped...)
- Stem-and-leaf display (Stem and leaf display, Stem-and-leaf diagram, Stem and leaf diagram, Stem and leaf)
- Contingency table
- Statistical plots (Statistical graphs)
- List of experimental designs
- Completely randomized design (CR design, CR)
- Randomized block design (RB design, RB)
- Randomized complete block design (RCB design, RCB)
- Latin square design (LS design, LS)
- Graeco-Latin square design
- Crossover design
- Repeated Latin square design (RLS design, RLS)
- Factorial design
- Knut Vik square design
- Hierarchically nested design
- Split-plot design (SP design, SP)
- Split-block design
- Split-split-plot design
- Quasifactorial design
- Lattice design
- Incomplete block design (IB design, IB)
- Fractional factorial design
- Fractional-replication design
- Half replicate design
- Half fraction of a factorial design
- Completely balanced lattice design
- Rectangular lattice design
- Triple rectangular lattice design
- Balanced incomplete block design (BIB design, BIB)
- Cyclic design
- Alpha-design ("α-design")
- Incomplete Latin square design
- Youden square design
- Partially balanced incomplete block design (PBIB design, PBIB)
- Repeated measures design
- And so on, and so forth...
Bayesian anaylsis
Hmm...
Terms from categorical data analysis
(By chapter: Agresti, 1990.)
- (none)
- contingency table, two-way table, two-way contingency table, cross-classification table, cross-tabulation, relative risk, odds ratio, concordant pair, discordant pair, gamma, Yule's Q, Goodman and Kruskal's tau, concentration coefficient, Kendall's tau-b, Sommer's d, proportional prediction, proportional prediction rule, uncertainty coefficient, Gini concentration, entropy (variation measure), tetrachoric correlation, contingency coefficient, Pearson's contingency coefficient, log odds ratio, cumulative odds ratio, Goodman and Kruskal's lambda, observed frequency
- expected frequency, independent multinomial sampling, product multinomial sampling, overdispersion, chi-squared goodness-of-fit test, goodness-of-fit test, Pearson's chi-squared statistic, likelihood-ratio chi-squared statistic, partitioning chi-squared, Fisher's exact test, multiple hypergeometric distribution, Freeman-Halton p-value, phi-squared, power divergence statistic, minimum discrimination information statistic, Neyman modified chi-squared, Freeman-Tukey statistic, ...
Statistical software
List of statistical software or List of statistical software packages...
Commercial
- CART
- ECHIPS (EChips)
- Excel
- add-ins: Analyse-It, SigmaXL, statistiXL, WinSTAT, XLSTAT (XLSTAT)
- JMP
- Minitab
- NCSS
- nQuery
- PASS
- SAS
- S
- SPSS
- Stata
- STATISTICA (Statistica)
- StatXact, LogXact
- SUDAAN (Sudaan)
- SYSTAT (Systat)
Free versions of commercial software
- Gnumeric — not a clone of Excel, but implements many of the same functions (can it use Excel add-ins?)
- R — free version of S
- FIASCO or PSPP — free version of SPSS
Other free software
- BUGS — Bayesian inference Using Gibbs Sampling
- ESS — a GNU Emacs add-on
- ...
- see http://www.psychnet-uk.com/experimental_design/software_packages.htm
Licensing unknown
World Wide Web
- StatLib — large repository of statistical software and data sets
Online sources of data
See also
- List of statistical topics
- Pages that link to the "Statistics" article (http://en.wikipedia.org/w/wiki.phtml?title=Special:Whatlinkshere&target=Statistics)
External link
- StatLib (http://lib.stat.cmu.edu/)
References
- Agresti, Alan (1990). Categorical Data Analysis. NY: John Wiley & Sons. ISBN 0-471-85301-1.
- Casella, George & Berger, Roger L. (1990). Statistical Inference. Pacific Grove, CA: Wadsworth & Brooks/Cole. ISBN 0-534-11958-1.
- DeGroot, Morris (1986). Probability and Statistics (2nd ed.). Reading, Massachusetts: Addison-Wesley. ISBN 0-201-11366-X.
- Kempthorne, Oscar (1973). The Design and Analysis of Experiments. Malabar, FL: Robert E. Krieger Publishing Company. ISBN 0-88275-105-0. [Rpt.; orig. 1952, NY: John Wiley & Sons.]
- Kuehl, Robert O. (1994). Statistical Principles of Research Design and Analysis. Belmont, CA: Duxbury Press. ISBN 0-534-18804-4.
- Lentner, Marvin & Bishop, Thomas (1986). Experimental Design and Analysis. Blacksburg, VA: Valley Book Company. ISBN 0-9616255-0-3.
- Manoukian, Edward B. (1986). Modern Concepts and Theorems of Mathematical Statistics. NY: Springer-Verlag. ISBN 0-387-96186-0.
- Mason, Robert L.; Gunst, Richard F.; and Hess, James L. (1989). Statistical Design and Analysis of Experiments: With Applications to Engineering and Science. NY: John Wiley & Sons. ISBN 0-471-85364-X.
- Ross, Sheldon (1988). A First Course in Probability Theory (3rd ed.). NY: Macmillan. ISBN 0-02-403850-4.
And eventually...
- Berger, James O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). NY: Springer-Verlag. ISBN 0-387-96098-8. (Also, Berlin: ISBN 3-540-96098-8.)
- Berry, Donald A. (1996). Statistics: A Bayesian Perspective. Belmont, CA: Duxbury Press. ISBN 0-534-23472-0.
- Feller, William (1950). An Introduction to Probability Theory and Its Applications, Vol. 1. NY: John Wiley & Sons. ISBN unknown. (Current: 3rd ed., 1968, NY: John Wiley & Sons, ISBN 0-471-25708-7.)
- Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2 (2nd ed.). NY: John Wiley & Sons. ISBN 0-471-25709-5.
- Lehmann E. L. [Eric Leo] (1991). Theory of Point Estimation. Pacific Grove, CA: Wadsworth & Brooks/Cole. ISBN 0-534-15978-8. (Orig. 1983, NY: John Wiley & Sons.)
- Lehmann E. L. [Eric Leo] (1994). Testing Statistical Hypotheses (2nd ed.). NY: Chapman & Hall. ISBN 0-412-05321-7. (Orig. 2nd ed., 1986, NY: John Wiley & Sons.)