hpr-knowledge-base/hpr_transcripts/hpr1615.txt

Episode: 1615
Title: HPR1615: 39 - LibreOffice Calc - Inferential Statistics Functions
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1615/hpr1615.mp3
Transcribed: 2025-10-18 05:52:14

---

This episode of HBR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello, this is Ahuka, welcoming you to Hacker Public Radio.
And another exciting episode in our ongoing series on Libra Office Calc.
And in this particular episode, what I want to do is add to what we did last time.
Last time we looked at descriptive statistical functions.
This time I want to look on inferential statistical functions.
And that should allow us to cover everything I want to do about statistics functions that are in Calc.
Now, inferential statistics is what you do to say that something is likely,
or that it is not due to chance, or things of that sort.
It goes beyond simply describing what is in the numbers and let you say something about what
the numbers in a sample might mean for the population that generated the sample.
There are several types of inferential statistics that I want to address in this tutorial,
beginning with the idea of a confidence interval.
This is a pretty simple thing, but useful.
The idea is that in some population you are interested in, you draw a sample,
and use descriptive statistics to measure the sample. A common example is polling.
The entire population of a country could be very expensive to measure,
but a sample of, say, a thousand is affordable and gives you some insight into the whole population.
A confidence interval takes the sample measurement and gives a range of numbers based on that,
which will most likely contain the true population measurement.
To calculate this, you need three numbers, two of which we have already talked about in
descriptive statistics. The first is the sample measurement, and for example, let's use the average.
The second is the sample size, which in our sample data was 30 observations.
The third parameter is the alpha, and alpha is the probability that you've got the sample
purely by chance from a population that is significantly different from what you say it is.
It is, in short, the probability of being wrong in the inference you are drawing.
Now, you could choose any value you want for alpha, but you should know that there are some
industry standard values, and the most common one is 0.05 or 5 percent.
This means that your inference will be wrong one time in 20.
This is the most common value used in political polls, for example, which means that one
poll in 20 will be simply wrong. Everyone in the industry knows this, and now you do too.
To begin, select the confidence function in the function wizard.
Enter the value of 0.05 for the alpha, whatever you got for the sample standard deviation in your
numbers using the STDEV and 30 for the N or number of observations. It is important that you
use the sample standard deviation to get an accurate number, which is why they use the STDEV
function as the name on this field. Click OK, and you will get a number. In my case, I got 10.849.
Now, to turn this into a confidence interval, I take my average, which was 49.078,
and find the lower limit by subtracting 10.849 at the upper limit by adding 10.849.
In my numbers, I got the result that I am 95 percent confident that the true population mean
from which the sample was drawn is between 38.228 and 59.927.
Now, that's a fairly broad range, but our sample size was only 30. If our sample size was a
thousand, it would be a lot narrower. Now, the next major area of inferential statistics I want to
address concerns statistical tests. We won't be able to give a full description of the theory
behind all of these tests, which is more properly addressed as part of a course in statistics,
and as I said last time, there are colleges and universities that offer them, including
free online courses. So if you're really interested in that sort of thing, there are ways to
start acquiring that knowledge. But in general, statistical testing is about formulating a claim
and then doing a calculation to see if it is likely to be true. Again, the likelihood of truth
is something you can choose. The industry standard is usually 95 percent, but it is often expressed
in a slightly different way as significant at the point 0.05 level. This is just yet another
way of saying that one time in 20 year results will be wrong. So here are some common tests.
F test. For this use the FTEST function, this compares two samples and asks what is the
probability that the variance is the same in the populations from which they were drawn.
For this I created two columns of sample data in a way similar to how I created one column earlier
for the descriptive statistics. Since both use the random function to get numbers between 0 and 100,
it's not surprising that I got a 93 percent probability that they come from populations with the same
variance. After all, I made up the data the same way. T test. This compares the means of two samples
and asks that they are in fact the same in the population. This takes several parameters.
You need to have the two data distributions in the background. Calc is calculating both the
mean and the variance for each. You need to have the mode one tailed or two tailed and the type.
One, you would put in a one prepared a two for two samples equal variance or a three for two
samples unequal variance. Next test is the chi-squared CHI-T-E-S-T as the name of the function.
This is a test of whether two variables are independent of each other. For example,
does knowing the political party a person votes for allow you to predict which church they belong to?
If the answer is no then they are independent. This means you need to have data where each of the
variables is measured for every individual in the sample and that we know the measurements for each
individual. If you have this kind of data just plug in the two data groups into the function.
Z test. This is one of the most common tests used in statistics and one of the first introduced
to students of the subject. This test compares the sample to a known population and asks if the
sample is similar. For example, we might know that in general a country of 30 million people
has an average lifespan of 72 years with a standard deviation of 10 years. We get a sample of
30 people from a particular city in that country and get an average of 65 years. Is it likely that
this represents a significant difference or is it just random chance? This is the kind of thing you
can test with a Z test. Now, the next thing I want to look at is something called regression.
This is a technique that is used to estimate a relationship between two variables.
The most common way is to look for a straight line that best fits the data, though there are
also techniques that use curves and Calc has that built in as well. Essentially what a linear regression
does is to minimize the errors in the estimation. In the real world, if there is a relationship between
two variables, it is rarely a pure relationship because other things are going on.
And that means that the relationship you come up with in your model does not completely explain
what is going on. You pick one variable which you think is the cause, which is usually called either
the independent variable or the explanatory variable. And the other variable is the effect,
and it is called the dependent variable. The relationship you come up with takes the independent
variable as an input and comes up with a predicted value for the dependent variable.
The difference between the predicted value and what is actually measured is the error which
this technique seeks to reduce. Again, you can get a lot more technical about this stuff,
but that is best left to a course in statistics.
An example of what we're talking about is the project I was given as a class assignment in my
first year of graduate school in economics. We collected data on rents in the N-Arbor area and on
the distance of each apartment from the central campus. The economic model we had in mind
postulated that the closer the apartment was to the central campus, the higher the rent.
And we estimated a relationship using a linear regression technique.
Not surprisingly, we found a fairly strong relationship. But equally important is that there
are obviously other factors that affect the rent, such as the amenities, being near public
transportation, and so on. The example that I'm doing in my sample spreadsheet here,
I'm using a very simple economic model. I gathered data on the growth rate of GDP of the
United States by year, from 1933 to 2013. And the federal government budget deficit is a
percent of GDP for the same years. I'm going to postulate a relationship between them,
but which variable is independent and which is dependent here. It could go either way,
depending on how you look at it. But let's say that how fast or not the economy grows will
determine the level of the deficit. This is not a course in economics. I just need an example
to illustrate the technique. The first thing I usually do is get a scatter plot of the data to
see if the data fits my ideas at all. I did it for this, and I got a chart. And there is a web
site, a web page in the show notes where you can go and look at this. And also, as I said, the
spreadsheet that I created as an example for this lesson is also downloadable. All of this is in
the show notes. So what a scatter plot is is it takes all of these observations, plots them on
a graph with one variable on the horizontal axis, the other on the vertical axis. The vertical one
is the dependent variable. The horizontal one is the independent variable. You can sort of look at
that cloud of data and sort of see, yeah, is there a line that goes through it? And in the one
that I did, it sort of looks like, yeah, there's a line that goes up from left to right.
Well, what else can I do? I can go back to this chart, the scatter plot that I did,
and if I click on it to select it, I should see the eight handles. Okay, those are little squares
that are on the four corners and in the middle of the four sides. And those are the ones that you could
click on and drag to change the size of the chart. So I call those the handles. Now, once I've selected
it, and I see the eight handles, if I right click, I can select edit. And then if I right click on
the dots in my chart, I get a properties window. And in the properties window, at the very top, it says
format data series. Then under that, it says insert data labels. And under that, insert trendline.
And the trendline is the thing that I'm interested in here. Now, note that that is going to bring
up a window. If I select trendline, it then brings up a window that has properties. And I can
select the type of trendline. And it's called regression type. And you have four options, linear,
logarithmic, exponential, and power. Now, the one I want here is linear because that I said is what
I was going for. But I mentioned earlier that there are other kinds of regression that use more
complicated curves and what have you. And I said, count to do it. Now, you can see looking at it,
you've got three options right here. So I'm going to select linear. If I click on that, then I get
my same scatter plot. But now there's a straight line drawn through the middle of it that fits the data.
And then off to the side, I see a little thing that says linear regression line, a little label legend.
Now, if we go back to where we put in the trendline, take a look at that box. So that at the top,
it was the type tab and then regression type linear logarithmic, exponential power. Underneath that,
there were, there's a section called equation. And two check boxes. One check box is show the equation.
And the other is show the coefficient of determination, which is the R squared. If we check those,
what do we get? Well, now we go back and we have the scatter plot with the regression line drawn
through it. And sort of superimposed on top of the scatter plot, we see the regression equation
and the R squared. That's not really a terribly good place for it, in my view. It's really hard to read.
But remember in our discussion of charts, we talked about objects contained within objects.
Well, go into edit mode. And you can select these numbers as a unit and drag them to the side,
where you've got white space and it's a perfectly good place for them.
So when I did it with my sample, what I got, the equation could be interpreted as saying that the deficit
as a percent of GDP is equal to 15.299 times the GDP growth rate plus a constant of 1.123.
And the R squared is something called the coefficient of determination, which is a statistical measure
of how well the model fits the actual data. What is considered a good fit can vary according to
the circumstances, but again, this is not a course in either statistics or economics.
Of course, you can get these numbers from other functions as well. The regression equation
really has two parts, the slope and the intercept. You can get each of those individually
by calculating with functions called slope and intercept, surprise. And then you also have
the R squared, which is in a function called RSQ. So if you wanted to, you could individually
calculate each one of these, but it's a lot more convenient to do it this way, I think.
So over the last two tutorials, we've looked at some fairly common statistical functions
and showed how they can be used. I will when final time say that the objective is not to
provide an understanding of statistics, for that you would need to take a course on the subject.
All we're trying to do here is demonstrate that many of the statistical measures you might
want to use can be calculated within LibreOfficeCalk. I remember when I had to purchase a statistical
package for a lot of money to get these capabilities, so getting them in an open source program
is really great. The other thing I would emphasize is that the use of these functions is pretty
standardized. Step one is to make sure you have the data and you begin by checking the required
arguments of the function. If you have these data on hand, you insert the function from the function
wizard, point to the cells that contain your data and the function will do the work. It really is
that simple. So I do have a spreadsheet that has all of this stuff on it. If you want to see
the work that I did, you can download the spreadsheet, take a look at the link in the show notes.
But for now, this is Ahuka signing off for Hacker Public Radio and reminding you as always to
please support free software. Bye bye.
You've been listening to Hacker Public Radio at Hacker Public Radio. We are a community podcast
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
then click on our contribute link to find out how easy it really is. Hacker Public Radio was
founded by the digital dog pound and the infonomicum computer club and is part of the binary
revolution at binrev.com. If you have comments on today's show, please email the host directly,
leave a comment on the website or record a follow-up episode yourself, unless otherwise
stated. Today's show is released on the create of comments, attribution, share a like,
3.0 license.