Talk:Statistics

This page is for discussion of the article about statistics. Comments and questions about the special page about Wikipedia site statistics (number of pages, edits, etc.) should be directed to Wikipedia talk:Special pages.

Contents [hide]

1 Miscellaneous

2 Suggest update to US National Statistical Services to FedStats

3 Puzzled by definition

4 What is statistics?

5 My attempt at article lead section

6 Probability

7 Probabilities in Bayesian statistics

8 Help needed

9 Removed some external links

10 virtual reality

11 Statistical Software - removal of SigmaXL link

12 Questions and Suggestions

13 Origin

Miscellaneous

I was taught statistics starting with the definition "a statistic is a function of data" as the first sentence of the Part 1B Stats course at Cambridge. I think the definition was useful and so it should be included BozMo (talk). Done

~~On special:Statistics, what are 'junk pages'? They seem to equal total pages minus (non-talk comma pages + talk pages). How many of these are #REDIRECTs? --Damian Yerrick~~

~~Why is the Main Page article counter different than the one in Statistics? --Chuck Smith~~

It's been some number of years since I studied statistics, but the terms used throughout the article did ring some bells very quietly in the back of my mind. The singular exception was ANOVA, so I followed the link to seek an explanation: Analysis of variance. That was familiar! I was then surprised by the number of hits that Google gave me about ANOVA (197,000). Still, I believe that the full expression is far more meaningful than the acronym, and I don't think that we should be encouraging the use of cute but meaningless acronyms. Eclecticology, Thursday, May 2, 2002

The three topics of statistics -- experimental design, description/exploration and inference -- are excellently described. The ongoing discussion between data miners and modellers (eg. Statistical Modeling: The Two Cultures, Leo Breiman and discussants, Statistical Science 2001;16:199-231) might deserve some more attention. Johannes Hüsing

I wonder if we can improve on the phrase "uncertain observations"? It's not the observations that are uncertain; it's what they entail about the population from which they came, the uncertainty resulting from the random way in which the observations came from the population. Michael Hardy 20:00 17 Jul 2003 (UTC)

Well, unless you're talking about measurement error, in which case the observations are uncertain. Anyway, I agree that the article needs a major rewrite. Oh, I guess that's not what you said... - dcljr 00:15, 9 Aug 2004 (UTC)

Even with measurement error, it's not the observations that are uncertain. You know what number your measuring instrument gave you; what you're uncertain about is what it should have given you. Michael Hardy 01:09, 9 Aug 2004 (UTC)

Hmm. A subtle distinction, indeed. But whatever. As a statistician yourself, surely you can provide us with a better introductory paragraph than the current version.... (See also item "What is statistics?" below.) - dcljr 05:46, 10 Aug 2004 (UTC)

Suggest update to US National Statistical Services to FedStats

Under "National Statistical Services", it appears that for a particular country, that country's main national statistics site is listed, except for the United States. For the US, the American Statistical Association is listed, which is primarily a professional association for statisticians. I would suggest that the FedStats web site, http://www.fedstats.gov, be listed as the web link for the US. The FedStats web site is the US government's gateway portal to it's underlying Federal statistical system, with links to more than 100 agencies with statistical information.

Puzzled by definition

Why is human knowledge part of the definition -- is it really necessary?CSTAR 03:26, 10 May 2004 (UTC)

I wouldn't call it a science either. — Miguel 06:28, 2004 May 10 (UTC)

Why not? cf Nelder JA (1999). From statistics to statistical science. The Statistician 48(2), 257-269. Johannes

What is statistics?

I don't like the introductory paragraph. I haven't come up with anything better, but here's a "definition of statistics" I used when I taught the subject to undergraduates:

[Statistics] is a logic and methodology for the measurement of uncertainty and for an examination of the consequences of that uncertainty in the planning and interpretation of experimentation or observation.

— Stephen M. Stigler, The History of Statistics (Belknap/Harvard, 1986)

Of course, I followed it with a lot of explanation...

I propose interested parties list their own preferred definition of statistics (serious ones, I mean) here and maybe we can come up with a consensus on the best one. (And then monkeys... well, nevermind.)

- dcljr 05:46, 10 Aug 2004 (UTC)

For me, statistics is a methodology for the collection, interpretation and presentation of information - I don't feel strongly about the words "methodology" or "information", but I don't like "uncertainty" in the primary definition. You can have statistics on the numbers of Olympic Gold Medal winners so far; they may be right or wrong, but I have yet to see anyone put error bands on them. To me "uncertainty" is part of the collection, interpretation and presentation in many cases, but not always a necessary part. --Henrygb 23:39, 12 Aug 2004 (UTC)

Your discomfort with the word uncertainty seems to stem from the difference between descriptive statistics (your definition) and inferential statistics ("mine"). (continued below)

Hmm. Or not. I just looked at your contributions, Henrygb. Anyway, I still say to do (or describe) meaningful statistics you have to have the idea of uncertainty or randomness in there somewhere. - dcljr 23:07, 31 Aug 2004 (UTC)

In descriptive stats, you usually just take the data as given; whether it's the whole population or just a sample, you can summarize it graphically and numerically in much the same ways. My background is mathematical statistics, so I usually don't even think of the descriptive side when I think statistics. It's my own bias. Anyway, we should try to address both aspects. - dcljr 22:55, 31 Aug 2004 (UTC)

I came to statistics through management science, the applied branch of operations research, and econometrics, an applied branch of mathematical statistics, with a big dose of John Tukey's pragmatism. I wound up with a perspective that some find unusual. For one thing, management science gave me a decision theoretical outlook. Part of that is reserving the word "uncertain" for situations that lack probability distributions. Data are raw materials; there's no infomation until you interpret descriptive or inferential statistics. I'm not sure what level to shoot for here, but here goes. I've done things like this with more example and less technical stuff but that takes more time or space, and I wanted to be brief.

Before you get to description, you have to know about the population the data represent (if any - most online polls, for example, represent no one except those who happened to participate. That includes some sampling theory. Then there's data entry and preparation, including quality checks, etc.

Assuming the data are numeric rather than categoric (counts of people belonging to various political parties, for example), the biggest challenge in description is to get people to pay attention to more than the median or mean. Box plots (aka box-and-whisker diagrams or plots) are critical for understanging data whose center is taken to be the median. The standard deviation is critical if you're assuming the normal distribution (I like to call it Gaussian but that's a small point) and using the mean, etc. Otherwise, you're trapped into the talking head focus on a single number that conveys very little useful information.

Once I get past description, statistics is about figuring out how much risk you are willing to take. Sometimes that's a guesstimate (choosing between pizza places in a town you've never visited before), sometimes it's as precise as you can make it (choosing the person who will perform open heart surgery on a loved one or yourself). In formal inference, that value is alpha and the decision about whether to reject the applicable null hypothesis comes down to whether the estimated risk that rejecting the null is a Type-I error (the p-value) is larger or smaller than the risk you are willing to take. If p>alpha, there is too much risk of a Type-I error to reject the null given your ex-ante choice of alpha. If alpha>=p, the risk of a Type I error is small enough (according to your ex-ante choice) to reject the null.

A single paragraph along those lines might be something like:

"Statistics is the art and science of seeking to understand a population and predict its future by collecting and using data that represent the population. Data collection includes sampling, data entry, and checking. Using data in statistics has two parts. Descriptive statistics includes estimates of most likely data values, their variation, and graphs. Inferential statistics looks for associations and causal relationships between variables that help to explain observed and predict future values."

That doesn't say anything about data mining, an approach that was taboo in my econometric youth. I haven't kept up with the subject, though, so I'm in no position to say anything about it here. If it's an outgrowth of resampling theory, for example, I'd be sympathetic even though that probably puts me outside mainstream econometrics, but I don't know enough to comment one way or another. --George Brower

Ah, now this paragraph (George's above) is, I think, mainly coming from a practical perspective of statistics as a set of procedures and "best practices" (i.e., what I would call applied statistics). (No offense, oversimplifying your viewpoint like that...) I come at statistics from a more theoretical standpoint (much to the chagrin of my students), emphasizing why those practices work and (ultimately, like in grad school) how to assess their efficacy and develop new and better ones. But my perspective is probably more suited to the mathematical statistics article (part of the reason I created it in the first place — in time I hope it will grow into something "useful").

I accept that this article should remain almost entirely "applied". At the very least we should allude to the following in the first paragraph:

data collection (sampling, etc.)
data summary (descriptive stats)
data interpretation (inference, relationship)

A more detailed outline, which might be the basis of constructing the opening paragraphs (i.e., preferably above the table of contents):

basics
- population
- sample
- randomness (uncertainty) and probability (frequentist/subjectivist viewpoints should probably be alluded to but not explained in any detail)
focus
- applied statistics (description, inference, modeling)
- theoretical (math stat)
data collection
- sampling
- experimental design
data summary: descriptive statistics
- graphical
- numerical
data interpretation: inferential statistics
- estimation
- prediction
- hypothesis testing
relationships and modeling
- correlation
- regression/ANOVA
- time series
- data mining? (I don't know much about it either!)

Obviously, and not surprisingly given my previous admissions, this reads like a course syllabus. But it does stress what you can actually do with statistics. If we could somehow pack all that information (if only obliquely, and certainly not necessarily in that order) into the opening paragraphs without hopelessly confusing everyone, that would be great!

Subsequent sections can flesh out what it all means and point to "main articles" about each topic for more detail. (Still, obviously I'm evisioning a much lengthier article!)

I think we should also mention above the table of contents the use of "statistics" or "stats" as a synonym for "data" and why that's not quite right.

These are my thoughts at the moment, anyway...

- dcljr 22:55, 31 Aug 2004 (UTC)

My attempt at article lead section

I just discovered the term lead section for what I've been variously calling preamble, intro[duction], introductory paragraphs, and stuff above the table of contents. <g>

Anyway, I'm sure some people thought it would be impossible to include all that stuff (see my previous comment) in the lead, but here's my attempt. I got almost everything in there.

Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from sample data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government and industry.

Once data is collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.

Randomness and uncertainty in the observations is modeled using probability in order ultimately to draw inferences about the larger population. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.

The word statistics (or stats) is also used colloquially to refer to data collected on an entire population rather than a subset of it. Formally, however, statistics is almost always based on samples. In fact, the word statistic (singular) may be defined as a quantity calculated from sample observations.

I found that I just couldn't find a good way to stick in the frequentist/subjectivist thing. My concern about that was mainly to point out the difference between "classical" and "bayesian" approaches. Perhaps another short "non-sequitur" paragraph could deal with that. Also, I didn't say anything about ANOVA (which is closely related to hypothesis testing, regression and experimental design, so I didn't feel too bad about not mentioning it by name) or data mining (maybe just doesn't belong in the lead). Oh, and not all the links lead to useful articles this point. (contined below)

I think there is no need to mention the frequentist/subjectivist split in an article on statistics. As far as "best practices" go, you can use whatever philosophy you like, or none at all, to come up with good statistical practice. In mathematical statistics, everyone must agree that, as mathematical theorems, frequentist and bayesian theorems are all "true". Finally, for a while I have held the opinion that frequentism as a philosophy of probability stems from the erroneous identification of the definition of probability on the one hand, and the measurement of a probability on the other hand. Whatever the meaning one ascribes to the word "probability", there is essentially only one way to determine it empirically, and that is to observe a large random sample and make inferences about it using statistics. — Miguel 07:53, 2 Sep 2004 (UTC)

But the two probability interpretations do lead to (almost) completely different approaches to inference. It probably should be mentioned somewhere, just not in the lead. BTW, despite being educated almost entirely from the frequentist perspective, I'm always a little uncomfortable when relative frequency is presented in textbooks as the "definition" of probability. (IOW, I agree with you.) - dcljr 19:31, 2 Sep 2004 (UTC)

Comments? Suggestions? (...I ask with much trepidation) - dcljr 20:49, 1 Sep 2004 (UTC)

Well it's better than what's there now. The reference to human knowledge in the first sentence of the current article is weird (I can't decide whether it's redundant or just wrong). Your additions will be the object of further modifications, but I suggest you blow away the current lead section.CSTAR 23:44, 1 Sep 2004 (UTC)

Okay, I'll leave it here for a few days so others can comment. If there are no strong objections, I'll move it to the article. - dcljr 19:31, 2 Sep 2004 (UTC)

Be bold in updating pages — Miguel 17:33, 3 Sep 2004 (UTC)

In my opinion: I am happy with your first paragraph except for the word "sample"; the rest of your paragraples should be in the contents; statistics is not "formally" about samples; nor is your distinction between mathematical statistics and applied statistics particularly clear. --Henrygb 01:04, 5 Sep 2004 (UTC)

Is this a Bayesian/frequentist (/decision theory) thing? As I recall, all the classes I've taken and all (?) the textbooks I've seen talk about the subject in terms of samples — both applied and theoretical approaches. I guess I still don't understand what alternative you're proposing. (If not "uncertainty", if not "samples", then what?? Hmm... Are you the person who added the note about decision theory in the opening paragraph?) And when you say "formally", how formal are we talking? "Let X₁, X₂, ..., X_n be a random sample" formal? "Let X be a random vector with covariance matrix T" formal? "Let X be absolutely continuous with respect to Lebesgue measure μ" formal? Anyway, as I've already mentioned, I don't think this should be an article about statistical theory. Speaking of which, that's what I mean by mathematical statistics: the theory as opposed to the applications (applied = what you do with statistics; theory = why it works). I'm not sure how I could make that paragraph more clear. Suggestions? - dcljr 18:41, 7 Sep 2004 (UTC)

No. I mean things both like "the population of the United Kingdom is about 59.5 million", and like "the difference between the mean and the median is less than or equal to one standard deviation", neither of which have anything to do with samples, but are about data. Statistics covers both of these, as well as sampling. --Henrygb 00:44, 11 Sep 2004 (UTC)

I'm responding to Henrygb's last comment above (at 00:44, 11 Sep 2004), but the indentation is getting a bit extreme, so it's back to the left margin... Okay. Your examples actually wouldn't (necessarily) be covered by the term "statistics" in my book (especially in an article that's trying to explain what statistics is, as opposed to other, similar disciplines/practices):

"the population of the United Kingdom is about 59.5 million"

This figure is a "statistic" only in the colloquial sense of the word. It's presumably based on a census. That's not statistics (as in, "I have a degree in statistics"). In fact, you may be familiar with the controversy over using statistical methods in the U.S. census (see the Census article). It's not allowed under most people's interpretation of the relevant clause in the Constitution. (This only serves to illustrate the difference in the concepts; I'm not saying it's an airtight argument.) One could argue that graphical and numerical summaries of populations fall under the term "descriptive statistics", but no one objects to the use of those techniques to interpret census data. My point is, when the word "statistics" is used by statisticians (or by someone teaching the subject, etc.) it almost always means "inferential statistics", which uses information about a sample to infer something about a larger population. Of course, confusing the whole issue is the use of the word "statistics" by governments to refer to census data and summaries thereof (e.g., "Statistical Abstract of the United States" or the "Bureau of Labor Statistics"). The difference here is akin to the difference between the colloquial use of the term geography to refer to the "lay of the land" of an area, and the academic subject of geography, which studies many other things. In any case, the issue(s) you raise (and I've discussed) here should certainly not be ignored, but should be dealt with directly in the article.

"the difference between the mean and the median is less than or equal to one standard deviation"

That statement can be made in probability; you don't need statistics at all for that one. Certainly statistics relies heavily on probability, but they are different fields (just as engineering and physics are very different fields, even though the former relies heavily on the concepts and methods of the latter). This is why a great many Wikipedia articles start out, "In probability and statistics..." and not just "In statistics...." I don't want to offend you, Henrygb, but may I ask what your academic background is, especially as it relates to statistics? As you can see above, at first I thought your objections were based on a philosophical difference among statisticians (Bayesians, etc.), then I thought maybe you were objecting at a deep mathematical/theoretical level. I'd like to know what exactly you're basing your views on. - dcljr 05:17, 13 Sep 2004 (UTC)

A strange request, but I'll play. I have a mathematics degree from the University of Cambridge having concentrated on what was called "applicable mathematics" (i.e. numerical analysis, probability, statistics, mathematical economics, coding theory etc.). I am now a member of the (British) Government Statistical Service. Your turn.
I am saying statistics is about data and its handling, presentation and use for drawing inferences, and that the use of samples is only one part of that. What you describe as the "colloquial sense of the word" (which presumably also refers to topics like baseball statistics) is not only the origin of statistics but one of its major contemporary meanings. While random variables and distributions in probability have descriptive statistics, so too do data sets which are not random. Indeed I would suggest that what you think of as statistics is much more probability based than the broader concept I am considering. Look at the list of statistical topics and my guess is that the majority of the articles do not mention sampling. --Henrygb 00:13, 14 Sep 2004 (UTC)

So... when you're doing inference and not using sampling, then you must be using either Bayesian analysis or some decision-theoretic approach, right? Not classical inference (t-test, ANOVA...). Anyway, nevermind. I give up. If others want to weigh in on this subject, please do. Henrygb, at my User page you can see both my statistics credentials (User:dcljr) and my (latest) revised lead section (User:dcljr/Statistics#Preamble — ~~I know you won't agree with one sentence in there~~). I haven't done anything to the article yet because I'd like to flesh out a little more of the main article text to complement the extensive lead section I'm proposing. Then others can have at it. ~~- dcljr 06:15, 21 Sep 2004 (UTC)~~ I removed the offending statement from my lead section draft in my last edit. - dcljr 06:36, 21 Sep 2004 (UTC)

Probability

I can't make heads or tails from this paragraph:

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10-4 and a probability of 10-9, despite the very practical difference between them. If you expect to cross the road about 105 or 106 times in your life, then reducing your risk of being run over per road crossing to 10-9 will make you safe for your whole life, while a risk per road crossing of 10-4 will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

What is meant by 10-4 or 10-9? Is that meant to be scientific notation (ten to the -4th and 10 to the -9th)?

The example makes little sense either. Why 105 or 106 road crossings and not 100, say. And I don't think reducing the risk to 10-9 means it will make you safe for your whole life, rather than that it will be very unlikely that you will be run over.

Unfortunately, the only statistics I learnt was in high school, so I'm not certain how to improve this article myself.

--Martin Wisse 06:51, 2 Nov 2004 (UTC)

You are ignoring (or not seeing) the superscripts. 10⁻⁴ does indeed mean 10 to the power of −4, i.e. 0.0001 or a 1 in 10000 chance. --Henrygb 21:06, 29 Nov 2004 (UTC)

Could someone who feels half way competent to do so put some pointers to the philosophical foundations of probability and statistics. Statistical reasoning has always fascinated and amazed me with some breathtaking inferences, and it would be nice to know if there is a way into this stuff. --Publunch 18:08, 22 Dec 2004 (UTC)

Probabilities in Bayesian statistics

The following puzzles me

Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior distribution is then forced to be 0 (or 1) as well. In other words, the data is not taken into account at all! As Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates (…)

I haven't read Lindley's book, but I am a statistician and Bayesian statistics is my area, and I have no idea what the above is supposed to mean. As it stands it is just nonsense to me.

To keep it simple, let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations. Infinately many observations would result in a posterior distribution identical to distribution of the observations, with no weight on the prior distribution at all. No matter what the prior distribution is, it will count less and less as more observations are taken into account. In particular, if we use a degenerate prior distribution with infinite variance (and zero density everywhere), the Bayesian approach gives the same result as a “frequentist” approach. The reason is that the prior distribution has zero density and contributes nothing in the weighted average of the prior distribution and the distribution of the observations, giving a posterior distribution always identical to the distribution of the observations. Anyway, I have no idea why probabilities of 0 or 1 should cause trouble in Bayesian statistics. –Peter J. Acklam 22:36, 18 Jan 2005 (UTC)

I think you've misunderstood. The statement that if the prior probability of a proposition is 0 or 1, then so is the posterior, is correct; it's trivial mathematics. You're being really vague about your proposed model. You wrote:

let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is

Posterior distribution of what?? Often one talks about a N(μ, σ²) distribution of some quantity to be observed--call that X, and one speaks of prior and posterior distributions of μ (or of μ and σ, but let's keep it simple, and while we're at it assume σ = 1). That's the conditional distribution of X given μ. OK, simple case: the prior says that μ = 1 or 2, each with probability 1/2. Now keep repeating the experiment. The observations of i.i.d. copies of X are conditionally independent given μ. If μ is really equal to 1, then the posterior will, with probability 1, converge to a probability distribution that assigns probability 1 to μ = 1. The posterior distribution will not "converge to the distribution of the observations", since those will be normally distributed! Michael Hardy 02:50, 19 Jan 2005 (UTC)

a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations.

I wrote that way too late yesterday. What I had in mind was a case where you are estimating μ or σ². The posterior distribution does not, as you point out, converge to the distribution of the observations, but to a distribution based on the information in the observations. Anyway, never mind. –Peter J. Acklam 08:21, 19 Jan 2005 (UTC)

Help needed

Hi there. Could somebody take a look at the trend article? Is the statistical term trend correct? If so, it needs expansion. Thanks. Oleg Alexandrov | talk 03:46, 24 Jan 2005 (UTC)

Removed some external links

I have massaged the External links section a bit and removed the following entries (others could stand to be culled IMHO, but I didn't do so):

http://www.thenakedscientists.com/HTML/Columnists/robstanforthcolumn2.htm The Probability of Co-incidence
Dedicated website (in Italian) (http://www.statistica.it)

While the first may be an interesting article, it's not really directly relevant to statistics (it would belong at Probability, if anywhere); and I moved (2) to the statistics article at the Italian Wikipedia. - dcljr 23:11, 27 Jan 2005 (UTC)

virtual reality

Probability: What is the meaning of: In reality there is virtually nothing...? In throwing a dice the event "the dice has been thrown" has probability exactly 1. Meant is, I assume, from no future event one can be absolutely sure. But then this "event" itself is absolutely sure!?130.89.219.54 17:18, 31 Jan 2005 (UTC)

Statistical Software - removal of SigmaXL link

You seem to have a problem with links to commercial sites, but in a inconsistent manner. Why then don't you remove STATA's link? The rules of Wikipedia do not forbid links to commercial sites.

STATA (http://www.statsoft.com/textbook/stathome.html) seems to actually have helpful information on their page, and allows you to try certain things. While SigmaXL Excel Add-in (http://www.sigmaxl.com) only tells you to "Download a 30-Day trial". This together with your repeated insistance, makes me think that you are looking for free advertising. Then, Wikipedia is not the place to go. Oleg Alexandrov 16:33, 6 Mar 2005 (UTC)

For someone who does not know the difference between stata.com and statsoft.com you should not be editing the Statistical software page. As for my insistence, our product is a significant contribution to the market for powerful, easy to use and inexpensive statistical software. Therefore it deserves a place of mention alongside products like Minitab. I will remove the url, but request that you keep the name up.

This is meant to be a list of statistical packages in common use. I have previously heard of all the packages listed, except SigmaXL, StatPro and MacAnova. Some quick Google tests give 372,000 hits for Minitab, 363,000 hits for GNU Octave, over 6 million for Stata, and over 8 million for R (actually for statistical R). This compares to less than 500 for the StatPro add-in, about 700 for MacAnova, and 33,000 for SigmaXL. I have therefore removed StatPro and MacAnova from the list, and am tempted to remove SigmaXL unless someone can give evidence that it is as commonly used as some of the other remaining packages. -- Avenue 12:41, 21 Mar 2005 (UTC)

The 8 million Google references for Statistical R goes down to 41,000 if you enter "R language" or 38,000 for "R project" statistical.

A Google test for R is inevitably going to be subjective, and I admit that Statistical R will include some false hits, but I think those two search phrases are somewhat unnatural. There are 2.86 million results for R Statistical Software, and the first false hit was number 43 in the list, so I believe the true number of references to R would be measured in hundreds of thousands at least. -- Avenue 15:38, 31 Mar 2005 (UTC)

That SigmaXL was put in by an employee of that company, could be in itself a good enough reason to remove the thing. Probably that employee meant well, but we would not want Wikipedia to be a medium of free advertising. Oleg Alexandrov 13:00, 21 Mar 2005 (UTC)

I disagree; I believe their contribution should be judged on its merits. But the fact that, as an employee, they may have an interest in promoting their company's product means some skepticism is probably called for. -- Avenue 15:38, 31 Mar 2005 (UTC)

You are right, just because it is put by an employee it does not mean to be deleted automatically. It is all up to you if to keep that link, I know nothing about statistics. I am just weary of people abusing the external link section. Oleg Alexandrov 16:08, 31 Mar 2005 (UTC)

No evidence that SigmaXL is in relatively common use has been provided. I also note that it only has the fourth highest Google pagerank of the add-ins listed here [1] (http://directory.google.com/Top/Science/Math/Statistics/Software/Excel_Add-In/). I will therefore delete it from the list. However I will also add Google's link. -- Avenue 13:32, 3 Apr 2005 (UTC)

Questions and Suggestions

I'm neither a statistician nor mathematician so bear with me through these comments.

First sentence.

Is "statistics" a science or is "statistical theory" the science and "statistics" the term for the information gathered? Is "human knowledge" compared to "inhuman knowledge" or "non-human knowledge"? Should the separate article "data" be merged with "statistics" and a redirect left at the "data" heading? The separate "information" article within Wikipedia is significant and easily stands alone but the "data" article seems subsidiary to "statistics".

The first sentence would be more clear to me, a layman, if it read as follows: "Statistics are the information (i.e. knowledge) created by the application of mathematics to data."

Rest of first paragraph:

"The branch of mathematics used is statistical theory. Within statistical theory, randomness and uncertainty are modelled by probability theory. Because one aim of statistics is to produce the "best" information from available data, some authors consider statistics a branch of decision theory. Statistical practice includes the planning, summarizing, and interpreting of observations, allowing for variability and uncertainty."

I think the separate articles "data" and "probability theory" should be merged with "statistics".

And I think there needs to be a discussion of the "statistical failure" in the exit polls during the 2004 U.S. Presidential elections to explain -- in layman's terms -- the importance of the date accumulation, how errors arise, mathematically less probability of error the greater the number on the sample, etc.

Someone (either Mark Twain or Benjamin Disraeli) once said: "There are three kinds of lies: lies, damned lies, and statistics." I think there should be a discussion of "false statistics", information produced to prove a point rather than producing "correct" information.

Johnwhunt 18:49, 27 Mar 2005 (UTC)

Statistics is a science; for example, there are Statistics Departments in many universities. These teach the science (and hopefully some of the art) of statistics, including statistical theory and applications. But I agree that our article should probably also mention the more concrete meaning, i.e. statistics = the plural of statistic.

"Human" does seem redundant. I'll delete it and see if anyone complains.

Data has different meanings in statistics and in computer science, with the latter usage becoming more widespread over time. I think the Data article is needed to distinguish them.

Probability theory is a distinct area from statistics or even mathematical statistics, and deserves its own article.

The Misuse of statistics article discusses misleading statistics, and is listed in the "See also" section here.

There is a separate article on problems related to the 2004 exit polls: 2004 U.S. presidential election controversy, exit polls.

-- Avenue 01:38, 28 Mar 2005 (UTC)

Origin

The Origin section could stand to be made more consistent and accurate (esp. by cross-referencing with other Wikis and other sources). ~ Dpr 05:46, 11 Jun 2005 (UTC)

Retrieved from "https://academickids.com:443/encyclopedia/index.php/Talk:Statistics"