Episode: 1605 Title: HPR1605: 38 - LibreOffice Calc - simple Descriptive Statistics Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1605/hpr1605.mp3 Transcribed: 2025-10-18 05:43:56 --- This episode of HBR is brought to you by AnanasThost.com. Get 15% discount on all shared hosting with the offer code HBR15. That's HBR15. Better web hosting that's Aniston Fair at AnanasThost.com. Hello, this is Ahuka, welcoming you to Hacker Public Radio. And another in our series of tutorials, I guess you would call it, on Libra Office Calc. And what I want to talk about today is the functions that are available to deal with simple descriptive statistics. We started our look at functions by looking at all of the financial functions, very important area, but I think there's a lot of good stuff here as well. So, what are we talking about? In statistics, there are generally speaking two types of analysis, broken down between descriptive and inferential statistics. Difference has to do with what claims you are making about the data. If you are simply stating something about the data, for instance, there were more men than women in the sample, that is descriptive. If you are taking a look at an entire population and measuring things like the standard deviation, that would be descriptive. But if you're making a claim that something is not likely to occur by chance, for instance, or that something is statistically significant, both of those statements are essentially the same thing. Then you are in the realm of inferential statistics. If you take a sample and do a measurement in the sample and then say, based on that, this is what I think the population is like that is inferential. So, Calc has functions to do both kinds of analysis, and this tutorial is going to focus on some of the common descriptive statistics in Calc and how they are used. Calc offers many statistical functions, of course, that you might want to make use of. These let you get some analytics on data that you have. But you need to have some data to start with. And to do that, I'm going to make use of the random function to make up some numbers. In cell A1 and B1, I set up a header by clicking merge and center, making the font aerial 12, bold, and giving a colored background through the formatting cells option. Then I select cells C1 and D1, merge and center to the same formatting. So, for column A and B, I give the column named data, and for the combined C1, D1, I enter statistics. In cell A2, I use the RAND function, found in the mathematics category, which gives me a random number between 0 and 1. I then multiply by 100 to get numbers a little bigger, and I click and drag through the column to get 30 numbers. When you start working, you will encounter an interesting problem, which is that the random numbers keep changing when you change other cells. What is happening here is that the formulas are recalculating every time the sheet recalculates. But we can use a trick to get around this. Highlight all of the numbers, all of the cells, and collect copy. Then with the same area highlighted, paste on to itself by right-clicking, and selecting paste only, and then selecting number. This takes the result of the function, and turns it into a RAND number, which replaces the function at each cell. Now you have your data locked in, and you're ready to do some statistics. First thing we'll look at, measures of central tendency. In statistics, we distinguish between several different measures of central tendency. Essentially, this is an attempt to answer the question, what does the most representative member of this group look like? There is more than one answer depending on the data. First is the question, does the data represent a qualitative or quantitative variable? Yes, that question does keep coming up. For quantitative data, there are several likely answers. One is the average, also referred to in statistics as the mean, and another is the median, which one do you use? Basically, it comes down to how symmetric the distribution is. When it's symmetric, the two measures will be very close. When it is skewed, they will diverge a lot. For instance, if you have a group of 10 people in a room and ask what the typical person's wealth is, well, if all 10 people are reasonably normal, average people, you could just use the average and get a good answer. But if one of those people does bill gates, you're going to get an extremely high and unrepresented number. In that case, you should use the median, which divides the sample into two groups and asks, where is the boundary between the top 50 percent and the bottom 50 percent? So for average, go to the function wizard, select statistical as your category, and average as the function. Click next. The window for putting in arguments opens with a space for each number. You could enter each number one at a time, one per field, but that is not optimal. Instead, click on the field for the first number, then click on cell A2. You will see that the field now has that cell address. But now, hold down the shift key and click on cell A31, which selects the whole column of numbers. Now the field will read equals average open parent A2 colon A31 close parent. That gives you the whole range of numbers, and when you click OK, you will get the average of this group of numbers. In this case, we generated data using the Rand function, multiplied by 100, which should mean random number between 0 and 100. So you should not be surprised if your answer comes reasonably close to 50. Mine came out to 49, but your number could be slightly different. Now the average, also called the arithmetic mean, is calculated by adding up the measurements and dividing by the number of measurements. The geometric mean is calculated by multiplying the numbers together, and then taking the n-th root, where n is the number of measurements. The function is the Geo mean function, and you use it just like the average function. In other words, select the data, paste it in, yada yada. Harmonic mean, fairly complicated to describe, but it's used in scientific applications. The three types of mean arithmetic geometric and harmonic are called the Pythagorean means. The harmonic mean is used, for example, in evaluating computer algorithms. It is called the Har mean function, and again usage is exactly the same. The rule of thumb with this is that of the three measures, the arithmetic mean is always the largest, the geometric mean is in the middle, and the harmonic mean is the lowest. Now, median. As above, go to the statistical functions, but this time select median. As above, select the range for the first field. Now this may be a little farther from the middle, depending on your numbers. Mine was 42, again random numbers, but the median is what divides the group into two equals, you know, an upper and a lower 50%. Now, mode. This is what we get with qualitative data, and mode is the most common. All right, so for example, let's say you had a sample of people where you recorded their hair color as black, blonde, or red-headed, and wanted to know what was the most representative. Now, in a case like this, it makes no sense to use something like average. There's no way of doing that. Mode looks for the most common, so, you know, which of those groups had the most people. Now, what you need to watch out for here is that Calc works with numerical data, so what happens is you need to encode the data with something like a number one means black, a number two means blonde, and a number three means red-head. Do that, and you can put a range of data in the mode function and get your answer. Now, because you've coded it, you could take this and put it into an arithmetic mean function, and it would calculate something, but the answer would be totally meaningless. Do not do this. All right, so yes, you need to code the data with numbers, but understand and use the right one, and if you understand that all qualitative data needs something like mode, and you can't use mean, that'll keep you on the path of virtue and righteousness. Now, the other thing that we typically use in descriptive statistics are what we call measures of dispersion, and that tells you how much variation there is in a group of numbers. Two different groups of numbers could have a very similar average or median, and yet be very different when you look at the degree of variation. For example, the number four and the number of six, if you average it together, you get five. Well, you could also average the number zero and the number ten and get five, and it's very clear that those are two different groups of numbers. So to address this, we need to look at a few related measures of dispersion. But to use these functions, we need to first discuss the difference between a population and a sample. A population means you have the entire group measured. Well, a sample means you have some fraction of the group measured, and you want to use that to make a claim about the population. For instance, in my past life, I worked for a company that did political polling, and so I could say I expect, you know, 100 million voters in the United States to vote in a presidential election, but when I do a poll, I might talk to one or two thousand of them. All right, so the sample was one thousand, two thousand, something on that order. The population was the one hundred million. Now this matters. I'm not going to go into it in huge detail. There are reasons why these statistical measures are slightly different. So just know which one you're dealing with, okay, so that you pick the right thing. So, variance. If you want a short description of variance, it is the average of the squared deviations from the mean. You have four possible functions here. Two of them are somewhat specialized. I'll skip over them for now. So the ones I will talk about are VAR and VARP. Now, VAR function. This assumes that you have a sample and will produce a slightly larger number on that account because of the assumed sampling error. Go to the function wizard to statistical, select the VAR function. Click next, then in the first of the number fields. Click the place you're insertion mark, then click on cell A2. Hold down the shift key and click on cell A31. This will put the range A2, colon A31 into your function. Click okay and get your number. Now, the other function, VARP and the P is for population. The procedure is absolutely identical. It'll just produce a slightly smaller number because with the population, obviously you don't have any kind of sampling error. Then standard deviation. This is strongly related to variance since it's the square root of variance. Again, you have four possible functions. I will recover the two common ones. There's one for sample, one for a population. So, the standard deviation can be considered in some sense a measurement of the average deviation of each measurement from the mean. So, STDEV is standard deviation and this is the function to use if you're measuring the standard deviation of a sample. Then there's STDEVP, which is the function to measure the standard deviation for a population. The procedure for all of these is absolutely identical. You go to the function wizard, you find the function, you click on the cells to select them and then that gets inserted into the function box and then you click okay and you get out again. Okay, other measures. A few other measures that you might want to look at for descriptive statistics. I just want to bring in a couple here. Min and max, minimum, maximum. So, exactly the same usage that we talked about, how to get them. Go to the function wizard, find the function and click on the appropriate cells and insert the cell information and click okay and you get your number. Now, what are the lessons learned from this? There's a few of them. First, a very useful trick is to use the paste-only number trick to convert the contents of a cell or range of cells into the resulting numbers. Now, recall that earlier in our series, we emphasize the difference between the contents of a cell, frequently a formula or cell address if you are a skilled builder of spreadsheets and the visible results, which are generally a number. If you ever want to get just the number and lose the underlying formulas, this is how it's done. Second, lesson learned. Use some of these functions. It helps to have a little background in the theory, such as why samples and populations are treated differently or why there are three different ways of measuring the mean. Now, going any further here is beyond the scope of these tutorials. It is left as an exercise for you, the listener. But, you know, if you really want no more about this, take a course in statistics. I'm sure there are lots of colleges around that would be happy to give you one. In fact, actually, these days with these, multiple massively online open courses, you can probably get one for free on the web. Descriptive statistics simply describe what we see in a group of numbers. There is a branch of statistics that goes further and it is called inferential statistics. In the next tutorial, we'll look at a few of the more common inferential statistics functions. Now, one big takeaway from this lesson is that all of these functions are used in similar ways. The procedures for using a function are very standardized, so the key is not figuring out the mechanics. It is understanding which function to use and why that is the correct function. The single most common error I see is people using the wrong function because they don't understand why they should be using one function rather than another. And in closing, let me just say that I have built a spreadsheet that has examples of the things that we've been talking about. You are welcome to download it and take a look at it. You can take a look at the functions and see exactly how they were entered and what they look like. All of that is in the show notes. This is Ahuka, as always, signing off by reminding you to support free software. Bye-bye. You've been listening to Heccupublic Radio at HeccupublicRadio.org. We are a community podcast network that releases shows every weekday Monday through Friday. Today's show, like all our shows, was contributed by an HBR listener like yourself. If you ever thought of recording a podcast and click on our contributing to find out how easy it really is. Heccupublic Radio was found by the digital dog pound and the infonomican computer club and is part of the binary revolution at binwreff.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise status, today's show is released on the creative comments, attribution, share a light, 3.0 license.