180 lines
15 KiB
Plaintext
180 lines
15 KiB
Plaintext
|
|
Episode: 1615
|
||
|
|
Title: HPR1615: 39 - LibreOffice Calc - Inferential Statistics Functions
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1615/hpr1615.mp3
|
||
|
|
Transcribed: 2025-10-18 05:52:14
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This episode of HBR is brought to you by AnanasThost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HBR15.
|
||
|
|
That's HBR15.
|
||
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
||
|
|
Hello, this is Ahuka, welcoming you to Hacker Public Radio.
|
||
|
|
And another exciting episode in our ongoing series on Libra Office Calc.
|
||
|
|
And in this particular episode, what I want to do is add to what we did last time.
|
||
|
|
Last time we looked at descriptive statistical functions.
|
||
|
|
This time I want to look on inferential statistical functions.
|
||
|
|
And that should allow us to cover everything I want to do about statistics functions that are in Calc.
|
||
|
|
Now, inferential statistics is what you do to say that something is likely,
|
||
|
|
or that it is not due to chance, or things of that sort.
|
||
|
|
It goes beyond simply describing what is in the numbers and let you say something about what
|
||
|
|
the numbers in a sample might mean for the population that generated the sample.
|
||
|
|
There are several types of inferential statistics that I want to address in this tutorial,
|
||
|
|
beginning with the idea of a confidence interval.
|
||
|
|
This is a pretty simple thing, but useful.
|
||
|
|
The idea is that in some population you are interested in, you draw a sample,
|
||
|
|
and use descriptive statistics to measure the sample. A common example is polling.
|
||
|
|
The entire population of a country could be very expensive to measure,
|
||
|
|
but a sample of, say, a thousand is affordable and gives you some insight into the whole population.
|
||
|
|
A confidence interval takes the sample measurement and gives a range of numbers based on that,
|
||
|
|
which will most likely contain the true population measurement.
|
||
|
|
To calculate this, you need three numbers, two of which we have already talked about in
|
||
|
|
descriptive statistics. The first is the sample measurement, and for example, let's use the average.
|
||
|
|
The second is the sample size, which in our sample data was 30 observations.
|
||
|
|
The third parameter is the alpha, and alpha is the probability that you've got the sample
|
||
|
|
purely by chance from a population that is significantly different from what you say it is.
|
||
|
|
It is, in short, the probability of being wrong in the inference you are drawing.
|
||
|
|
Now, you could choose any value you want for alpha, but you should know that there are some
|
||
|
|
industry standard values, and the most common one is 0.05 or 5 percent.
|
||
|
|
This means that your inference will be wrong one time in 20.
|
||
|
|
This is the most common value used in political polls, for example, which means that one
|
||
|
|
poll in 20 will be simply wrong. Everyone in the industry knows this, and now you do too.
|
||
|
|
To begin, select the confidence function in the function wizard.
|
||
|
|
Enter the value of 0.05 for the alpha, whatever you got for the sample standard deviation in your
|
||
|
|
numbers using the STDEV and 30 for the N or number of observations. It is important that you
|
||
|
|
use the sample standard deviation to get an accurate number, which is why they use the STDEV
|
||
|
|
function as the name on this field. Click OK, and you will get a number. In my case, I got 10.849.
|
||
|
|
Now, to turn this into a confidence interval, I take my average, which was 49.078,
|
||
|
|
and find the lower limit by subtracting 10.849 at the upper limit by adding 10.849.
|
||
|
|
In my numbers, I got the result that I am 95 percent confident that the true population mean
|
||
|
|
from which the sample was drawn is between 38.228 and 59.927.
|
||
|
|
Now, that's a fairly broad range, but our sample size was only 30. If our sample size was a
|
||
|
|
thousand, it would be a lot narrower. Now, the next major area of inferential statistics I want to
|
||
|
|
address concerns statistical tests. We won't be able to give a full description of the theory
|
||
|
|
behind all of these tests, which is more properly addressed as part of a course in statistics,
|
||
|
|
and as I said last time, there are colleges and universities that offer them, including
|
||
|
|
free online courses. So if you're really interested in that sort of thing, there are ways to
|
||
|
|
start acquiring that knowledge. But in general, statistical testing is about formulating a claim
|
||
|
|
and then doing a calculation to see if it is likely to be true. Again, the likelihood of truth
|
||
|
|
is something you can choose. The industry standard is usually 95 percent, but it is often expressed
|
||
|
|
in a slightly different way as significant at the point 0.05 level. This is just yet another
|
||
|
|
way of saying that one time in 20 year results will be wrong. So here are some common tests.
|
||
|
|
F test. For this use the FTEST function, this compares two samples and asks what is the
|
||
|
|
probability that the variance is the same in the populations from which they were drawn.
|
||
|
|
For this I created two columns of sample data in a way similar to how I created one column earlier
|
||
|
|
for the descriptive statistics. Since both use the random function to get numbers between 0 and 100,
|
||
|
|
it's not surprising that I got a 93 percent probability that they come from populations with the same
|
||
|
|
variance. After all, I made up the data the same way. T test. This compares the means of two samples
|
||
|
|
and asks that they are in fact the same in the population. This takes several parameters.
|
||
|
|
You need to have the two data distributions in the background. Calc is calculating both the
|
||
|
|
mean and the variance for each. You need to have the mode one tailed or two tailed and the type.
|
||
|
|
One, you would put in a one prepared a two for two samples equal variance or a three for two
|
||
|
|
samples unequal variance. Next test is the chi-squared CHI-T-E-S-T as the name of the function.
|
||
|
|
This is a test of whether two variables are independent of each other. For example,
|
||
|
|
does knowing the political party a person votes for allow you to predict which church they belong to?
|
||
|
|
If the answer is no then they are independent. This means you need to have data where each of the
|
||
|
|
variables is measured for every individual in the sample and that we know the measurements for each
|
||
|
|
individual. If you have this kind of data just plug in the two data groups into the function.
|
||
|
|
Z test. This is one of the most common tests used in statistics and one of the first introduced
|
||
|
|
to students of the subject. This test compares the sample to a known population and asks if the
|
||
|
|
sample is similar. For example, we might know that in general a country of 30 million people
|
||
|
|
has an average lifespan of 72 years with a standard deviation of 10 years. We get a sample of
|
||
|
|
30 people from a particular city in that country and get an average of 65 years. Is it likely that
|
||
|
|
this represents a significant difference or is it just random chance? This is the kind of thing you
|
||
|
|
can test with a Z test. Now, the next thing I want to look at is something called regression.
|
||
|
|
This is a technique that is used to estimate a relationship between two variables.
|
||
|
|
The most common way is to look for a straight line that best fits the data, though there are
|
||
|
|
also techniques that use curves and Calc has that built in as well. Essentially what a linear regression
|
||
|
|
does is to minimize the errors in the estimation. In the real world, if there is a relationship between
|
||
|
|
two variables, it is rarely a pure relationship because other things are going on.
|
||
|
|
And that means that the relationship you come up with in your model does not completely explain
|
||
|
|
what is going on. You pick one variable which you think is the cause, which is usually called either
|
||
|
|
the independent variable or the explanatory variable. And the other variable is the effect,
|
||
|
|
and it is called the dependent variable. The relationship you come up with takes the independent
|
||
|
|
variable as an input and comes up with a predicted value for the dependent variable.
|
||
|
|
The difference between the predicted value and what is actually measured is the error which
|
||
|
|
this technique seeks to reduce. Again, you can get a lot more technical about this stuff,
|
||
|
|
but that is best left to a course in statistics.
|
||
|
|
An example of what we're talking about is the project I was given as a class assignment in my
|
||
|
|
first year of graduate school in economics. We collected data on rents in the N-Arbor area and on
|
||
|
|
the distance of each apartment from the central campus. The economic model we had in mind
|
||
|
|
postulated that the closer the apartment was to the central campus, the higher the rent.
|
||
|
|
And we estimated a relationship using a linear regression technique.
|
||
|
|
Not surprisingly, we found a fairly strong relationship. But equally important is that there
|
||
|
|
are obviously other factors that affect the rent, such as the amenities, being near public
|
||
|
|
transportation, and so on. The example that I'm doing in my sample spreadsheet here,
|
||
|
|
I'm using a very simple economic model. I gathered data on the growth rate of GDP of the
|
||
|
|
United States by year, from 1933 to 2013. And the federal government budget deficit is a
|
||
|
|
percent of GDP for the same years. I'm going to postulate a relationship between them,
|
||
|
|
but which variable is independent and which is dependent here. It could go either way,
|
||
|
|
depending on how you look at it. But let's say that how fast or not the economy grows will
|
||
|
|
determine the level of the deficit. This is not a course in economics. I just need an example
|
||
|
|
to illustrate the technique. The first thing I usually do is get a scatter plot of the data to
|
||
|
|
see if the data fits my ideas at all. I did it for this, and I got a chart. And there is a web
|
||
|
|
site, a web page in the show notes where you can go and look at this. And also, as I said, the
|
||
|
|
spreadsheet that I created as an example for this lesson is also downloadable. All of this is in
|
||
|
|
the show notes. So what a scatter plot is is it takes all of these observations, plots them on
|
||
|
|
a graph with one variable on the horizontal axis, the other on the vertical axis. The vertical one
|
||
|
|
is the dependent variable. The horizontal one is the independent variable. You can sort of look at
|
||
|
|
that cloud of data and sort of see, yeah, is there a line that goes through it? And in the one
|
||
|
|
that I did, it sort of looks like, yeah, there's a line that goes up from left to right.
|
||
|
|
Well, what else can I do? I can go back to this chart, the scatter plot that I did,
|
||
|
|
and if I click on it to select it, I should see the eight handles. Okay, those are little squares
|
||
|
|
that are on the four corners and in the middle of the four sides. And those are the ones that you could
|
||
|
|
click on and drag to change the size of the chart. So I call those the handles. Now, once I've selected
|
||
|
|
it, and I see the eight handles, if I right click, I can select edit. And then if I right click on
|
||
|
|
the dots in my chart, I get a properties window. And in the properties window, at the very top, it says
|
||
|
|
format data series. Then under that, it says insert data labels. And under that, insert trendline.
|
||
|
|
And the trendline is the thing that I'm interested in here. Now, note that that is going to bring
|
||
|
|
up a window. If I select trendline, it then brings up a window that has properties. And I can
|
||
|
|
select the type of trendline. And it's called regression type. And you have four options, linear,
|
||
|
|
logarithmic, exponential, and power. Now, the one I want here is linear because that I said is what
|
||
|
|
I was going for. But I mentioned earlier that there are other kinds of regression that use more
|
||
|
|
complicated curves and what have you. And I said, count to do it. Now, you can see looking at it,
|
||
|
|
you've got three options right here. So I'm going to select linear. If I click on that, then I get
|
||
|
|
my same scatter plot. But now there's a straight line drawn through the middle of it that fits the data.
|
||
|
|
And then off to the side, I see a little thing that says linear regression line, a little label legend.
|
||
|
|
Now, if we go back to where we put in the trendline, take a look at that box. So that at the top,
|
||
|
|
it was the type tab and then regression type linear logarithmic, exponential power. Underneath that,
|
||
|
|
there were, there's a section called equation. And two check boxes. One check box is show the equation.
|
||
|
|
And the other is show the coefficient of determination, which is the R squared. If we check those,
|
||
|
|
what do we get? Well, now we go back and we have the scatter plot with the regression line drawn
|
||
|
|
through it. And sort of superimposed on top of the scatter plot, we see the regression equation
|
||
|
|
and the R squared. That's not really a terribly good place for it, in my view. It's really hard to read.
|
||
|
|
But remember in our discussion of charts, we talked about objects contained within objects.
|
||
|
|
Well, go into edit mode. And you can select these numbers as a unit and drag them to the side,
|
||
|
|
where you've got white space and it's a perfectly good place for them.
|
||
|
|
So when I did it with my sample, what I got, the equation could be interpreted as saying that the deficit
|
||
|
|
as a percent of GDP is equal to 15.299 times the GDP growth rate plus a constant of 1.123.
|
||
|
|
And the R squared is something called the coefficient of determination, which is a statistical measure
|
||
|
|
of how well the model fits the actual data. What is considered a good fit can vary according to
|
||
|
|
the circumstances, but again, this is not a course in either statistics or economics.
|
||
|
|
Of course, you can get these numbers from other functions as well. The regression equation
|
||
|
|
really has two parts, the slope and the intercept. You can get each of those individually
|
||
|
|
by calculating with functions called slope and intercept, surprise. And then you also have
|
||
|
|
the R squared, which is in a function called RSQ. So if you wanted to, you could individually
|
||
|
|
calculate each one of these, but it's a lot more convenient to do it this way, I think.
|
||
|
|
So over the last two tutorials, we've looked at some fairly common statistical functions
|
||
|
|
and showed how they can be used. I will when final time say that the objective is not to
|
||
|
|
provide an understanding of statistics, for that you would need to take a course on the subject.
|
||
|
|
All we're trying to do here is demonstrate that many of the statistical measures you might
|
||
|
|
want to use can be calculated within LibreOfficeCalk. I remember when I had to purchase a statistical
|
||
|
|
package for a lot of money to get these capabilities, so getting them in an open source program
|
||
|
|
is really great. The other thing I would emphasize is that the use of these functions is pretty
|
||
|
|
standardized. Step one is to make sure you have the data and you begin by checking the required
|
||
|
|
arguments of the function. If you have these data on hand, you insert the function from the function
|
||
|
|
wizard, point to the cells that contain your data and the function will do the work. It really is
|
||
|
|
that simple. So I do have a spreadsheet that has all of this stuff on it. If you want to see
|
||
|
|
the work that I did, you can download the spreadsheet, take a look at the link in the show notes.
|
||
|
|
But for now, this is Ahuka signing off for Hacker Public Radio and reminding you as always to
|
||
|
|
please support free software. Bye bye.
|
||
|
|
You've been listening to Hacker Public Radio at Hacker Public Radio. We are a community podcast
|
||
|
|
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
|
||
|
|
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
|
||
|
|
then click on our contribute link to find out how easy it really is. Hacker Public Radio was
|
||
|
|
founded by the digital dog pound and the infonomicum computer club and is part of the binary
|
||
|
|
revolution at binrev.com. If you have comments on today's show, please email the host directly,
|
||
|
|
leave a comment on the website or record a follow-up episode yourself, unless otherwise
|
||
|
|
stated. Today's show is released on the create of comments, attribution, share a like,
|
||
|
|
3.0 license.
|