Episode: 1605
Title: HPR1605: 38 - LibreOffice Calc - simple Descriptive Statistics
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1605/hpr1605.mp3
Transcribed: 2025-10-18 05:43:56

---

This episode of HBR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's Aniston Fair at AnanasThost.com.
Hello, this is Ahuka, welcoming you to Hacker Public Radio.
And another in our series of tutorials, I guess you would call it, on Libra Office Calc.
And what I want to talk about today is the functions that are available to deal with simple descriptive statistics.
We started our look at functions by looking at all of the financial functions, very important area,
but I think there's a lot of good stuff here as well.
So, what are we talking about?
In statistics, there are generally speaking two types of analysis,
broken down between descriptive and inferential statistics.
Difference has to do with what claims you are making about the data.
If you are simply stating something about the data, for instance,
there were more men than women in the sample, that is descriptive.
If you are taking a look at an entire population and measuring things like the standard deviation,
that would be descriptive.
But if you're making a claim that something is not likely to occur by chance, for instance,
or that something is statistically significant, both of those statements are essentially the same thing.
Then you are in the realm of inferential statistics.
If you take a sample and do a measurement in the sample and then say, based on that,
this is what I think the population is like that is inferential.
So, Calc has functions to do both kinds of analysis, and this tutorial is going to focus on some
of the common descriptive statistics in Calc and how they are used.
Calc offers many statistical functions, of course, that you might want to make use of.
These let you get some analytics on data that you have.
But you need to have some data to start with.
And to do that, I'm going to make use of the random function to make up some numbers.
In cell A1 and B1, I set up a header by clicking merge and center,
making the font aerial 12, bold, and giving a colored background through the formatting cells option.
Then I select cells C1 and D1, merge and center to the same formatting.
So, for column A and B, I give the column named data, and for the combined C1, D1, I enter statistics.
In cell A2, I use the RAND function, found in the mathematics category, which gives me a random
number between 0 and 1. I then multiply by 100 to get numbers a little bigger, and I click and drag
through the column to get 30 numbers. When you start working, you will encounter an interesting
problem, which is that the random numbers keep changing when you change other cells.
What is happening here is that the formulas are recalculating every time the sheet recalculates.
But we can use a trick to get around this. Highlight all of the numbers,
all of the cells, and collect copy. Then with the same area highlighted,
paste on to itself by right-clicking, and selecting paste only, and then selecting number.
This takes the result of the function, and turns it into a RAND number, which replaces the
function at each cell. Now you have your data locked in, and you're ready to do some statistics.
First thing we'll look at, measures of central tendency.
In statistics, we distinguish between several different measures of central tendency.
Essentially, this is an attempt to answer the question, what does the most representative member
of this group look like? There is more than one answer depending on the data.
First is the question, does the data represent a qualitative or quantitative variable?
Yes, that question does keep coming up.
For quantitative data, there are several likely answers. One is the average,
also referred to in statistics as the mean, and another is the median, which one do you use?
Basically, it comes down to how symmetric the distribution is. When it's symmetric,
the two measures will be very close. When it is skewed, they will diverge a lot.
For instance, if you have a group of 10 people in a room and ask what the typical person's wealth is,
well, if all 10 people are reasonably normal, average people, you could just use the average and
get a good answer. But if one of those people does bill gates, you're going to get an extremely high
and unrepresented number. In that case, you should use the median, which divides the sample into
two groups and asks, where is the boundary between the top 50 percent and the bottom 50 percent?
So for average, go to the function wizard, select statistical as your category,
and average as the function. Click next. The window for putting in arguments opens with a space
for each number. You could enter each number one at a time, one per field, but that is not optimal.
Instead, click on the field for the first number, then click on cell A2. You will see that the field
now has that cell address. But now, hold down the shift key and click on cell A31, which selects
the whole column of numbers. Now the field will read equals average open parent A2 colon A31
close parent. That gives you the whole range of numbers, and when you click OK, you will get the
average of this group of numbers. In this case, we generated data using the Rand function,
multiplied by 100, which should mean random number between 0 and 100. So you should not be
surprised if your answer comes reasonably close to 50. Mine came out to 49, but your number could
be slightly different. Now the average, also called the arithmetic mean, is calculated by adding
up the measurements and dividing by the number of measurements. The geometric mean is calculated
by multiplying the numbers together, and then taking the n-th root, where n is the number of
measurements. The function is the Geo mean function, and you use it just like the average function.
In other words, select the data, paste it in, yada yada. Harmonic mean, fairly complicated to
describe, but it's used in scientific applications. The three types of mean arithmetic geometric and
harmonic are called the Pythagorean means. The harmonic mean is used, for example, in evaluating
computer algorithms. It is called the Har mean function, and again usage is exactly the same.
The rule of thumb with this is that of the three measures, the arithmetic mean is always the
largest, the geometric mean is in the middle, and the harmonic mean is the lowest.
Now, median. As above, go to the statistical functions, but this time select median.
As above, select the range for the first field. Now this may be a little farther from the middle,
depending on your numbers. Mine was 42, again random numbers, but the median is what divides
the group into two equals, you know, an upper and a lower 50%. Now, mode. This is what we get with
qualitative data, and mode is the most common. All right, so for example, let's say you had a sample
of people where you recorded their hair color as black, blonde, or red-headed, and wanted to know
what was the most representative. Now, in a case like this, it makes no sense to use something
like average. There's no way of doing that. Mode looks for the most common, so, you know,
which of those groups had the most people. Now, what you need to watch out for here is that
Calc works with numerical data, so what happens is you need to encode the data
with something like a number one means black, a number two means blonde, and a number three means red-head.
Do that, and you can put a range of data in the mode function and get your answer. Now,
because you've coded it, you could take this and put it into an arithmetic mean function,
and it would calculate something, but the answer would be totally meaningless. Do not do this.
All right, so yes, you need to code the data with numbers, but understand and use the right one,
and if you understand that all qualitative data needs something like mode, and you can't use
mean, that'll keep you on the path of virtue and righteousness. Now, the other thing that we
typically use in descriptive statistics are what we call measures of dispersion, and that tells you
how much variation there is in a group of numbers. Two different groups of numbers could have a
very similar average or median, and yet be very different when you look at the degree of variation.
For example, the number four and the number of six, if you average it together, you get five.
Well, you could also average the number zero and the number ten and get five,
and it's very clear that those are two different groups of numbers. So to address this,
we need to look at a few related measures of dispersion. But to use these functions, we need to first
discuss the difference between a population and a sample. A population means you have the entire
group measured. Well, a sample means you have some fraction of the group measured, and you want to
use that to make a claim about the population. For instance, in my past life, I worked for a company
that did political polling, and so I could say I expect, you know, 100 million voters in the United
States to vote in a presidential election, but when I do a poll, I might talk to one or two thousand
of them. All right, so the sample was one thousand, two thousand, something on that order.
The population was the one hundred million. Now this matters. I'm not going to go into it in
huge detail. There are reasons why these statistical measures are slightly different.
So just know which one you're dealing with, okay, so that you pick the right thing.
So, variance. If you want a short description of variance, it is the average of the squared
deviations from the mean. You have four possible functions here. Two of them are somewhat
specialized. I'll skip over them for now. So the ones I will talk about are VAR and VARP.
Now, VAR function. This assumes that you have a sample and will produce a slightly larger number
on that account because of the assumed sampling error. Go to the function wizard to statistical,
select the VAR function. Click next, then in the first of the number fields. Click the place
you're insertion mark, then click on cell A2. Hold down the shift key and click on cell A31.
This will put the range A2, colon A31 into your function. Click okay and get your number.
Now, the other function, VARP and the P is for population. The procedure is absolutely identical.
It'll just produce a slightly smaller number because with the population, obviously you don't
have any kind of sampling error. Then standard deviation. This is strongly related to variance
since it's the square root of variance. Again, you have four possible functions.
I will recover the two common ones. There's one for sample, one for a population.
So, the standard deviation can be considered in some sense a measurement of the average deviation
of each measurement from the mean. So, STDEV is standard deviation and this is the function to use
if you're measuring the standard deviation of a sample. Then there's STDEVP, which is the function
to measure the standard deviation for a population. The procedure for all of these is absolutely
identical. You go to the function wizard, you find the function, you click on the cells to
select them and then that gets inserted into the function box and then you click okay and you get
out again. Okay, other measures. A few other measures that you might want to look at for descriptive
statistics. I just want to bring in a couple here. Min and max, minimum, maximum. So, exactly the same
usage that we talked about, how to get them. Go to the function wizard, find the function and
click on the appropriate cells and insert the cell information and click okay and you get your number.
Now, what are the lessons learned from this? There's a few of them. First, a very useful trick is to
use the paste-only number trick to convert the contents of a cell or range of cells into the
resulting numbers. Now, recall that earlier in our series, we emphasize the difference between
the contents of a cell, frequently a formula or cell address if you are a skilled builder of
spreadsheets and the visible results, which are generally a number. If you ever want to get
just the number and lose the underlying formulas, this is how it's done.
Second, lesson learned. Use some of these functions. It helps to have a little background in the
theory, such as why samples and populations are treated differently or why there are three
different ways of measuring the mean. Now, going any further here is beyond the scope of these
tutorials. It is left as an exercise for you, the listener. But, you know, if you really want
no more about this, take a course in statistics. I'm sure there are lots of colleges around that
would be happy to give you one. In fact, actually, these days with these, multiple massively online
open courses, you can probably get one for free on the web. Descriptive statistics simply describe
what we see in a group of numbers. There is a branch of statistics that goes further and it is
called inferential statistics. In the next tutorial, we'll look at a few of the more common
inferential statistics functions. Now, one big takeaway from this lesson is that all of these
functions are used in similar ways. The procedures for using a function are very standardized,
so the key is not figuring out the mechanics. It is understanding which function to use and why
that is the correct function. The single most common error I see is people using the wrong function
because they don't understand why they should be using one function rather than another.
And in closing, let me just say that I have built a spreadsheet that has
examples of the things that we've been talking about. You are welcome to download it and take a
look at it. You can take a look at the functions and see exactly how they were entered and what
they look like. All of that is in the show notes. This is Ahuka, as always, signing off by reminding
you to support free software. Bye-bye.
You've been listening to Heccupublic Radio at HeccupublicRadio.org. We are a community podcast
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast
and click on our contributing to find out how easy it really is. Heccupublic Radio was found
by the digital dog pound and the infonomican computer club and is part of the binary revolution
at binwreff.com. If you have comments on today's show, please email the host directly, leave a comment
on the website or record a follow-up episode yourself. Unless otherwise status, today's show is
released on the creative comments, attribution, share a light, 3.0 license.