Files

161 lines
9.1 KiB
Plaintext
Raw Permalink Normal View History

Episode: 1635
Title: HPR1635: 41 - LibreOffice Calc - Data Manipulation 1: Sorting and AutoFilter
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1635/hpr1635.mp3
Transcribed: 2025-10-18 06:08:48
---
This episode of HBR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello, this is Ahuka, welcoming you to Hacker Public Radio.
Another exciting episode in our ongoing series on Libra Office Calc.
What I want to turn to now is the idea of how to do data manipulation in Calc.
I'm going to start with some of the simplest things that you can do sorting and auto filter.
Now Calc is not a database.
It can be used for some data analysis and manipulation.
When I worked for the finance department of a hospital, it was very common for the financial
analysts to get a data dump from a centralized system as a CSV file loaded up in a spreadsheet
and then slice and dice the data to get the answers they wanted.
Now, it is not anywhere near what you can do with a good relational database and a structured
query, but you can do some quick and dirty analysis here.
To illustrate this, I went looking for a data set and I found one from Vanderbilt University
Department of Biostatistics.
Now, in case you were not aware, there's a lot of data sets you can download on the internet,
including in many cases, government data, and it's all freely available and you are
welcome to download it and use it as always, you know, do check to see if there's any particular
license involved.
But in this case, I thought it was fun to get some biostatistics, so I grabbed a meningitis
data set, partly just to get something in the medical area into our discussion.
Now as it happens, it was larger than I needed, so I discarded a lot of the data to get
something a little more manageable.
I hid most of the variables and kept these.
Case number.
It's just an index.
It starts at one and it goes up.
Year.
Looks like a two-digit year in which the patient was observed and some of them are blank.
Month.
Looks like the month within the year when the patient was observed and some of them are
blank.
Age.
The age of the patient when they were observed and some are blank.
Case of patient.
Black versus white.
Some blank.
Sex of patient.
Male versus female.
Some are blank.
There were 581 records in this data set, so I kept those and just those six variables.
So what are the things that I want to do with this little bit of a data set?
Well, sorting is one of them.
You can put them in some type of order.
That's the simplest thing you can do with any data set.
You can sort the data and put it in order by a column.
So let's say you wanted to sort by the year when the patient was observed.
So you click on the B column, which has that variable, and go to the data menu, sort,
and you get a window that opens up.
And this is interesting because it gives you a couple of different options.
So the window that comes up says sort range.
The cells next to the current selection also contain data.
Do you want to extend the sort range?
And it says to range of cell A1 to F582, in other words, the entire database.
Or sort the currently selected range, B1 to B, and then it goes to 1,048,000, so basically
I've selected the entire column.
Then there's three buttons.
Extend selection is the first button, and that's selected by default.
Extend selection and cancel.
So what is this all about?
If you had clicked current selection, column B would be sorted all by itself, but the other
columns would not be sorted.
This used to be the default behavior of spreadsheets.
But we've learned that people do attempt these kinds of manipulations on data sets.
So now it is smart enough to check.
If you select extend selection, the first button, the one that's already highlighted by default,
it will sort based on what is in column B, but will sort the data set from cell A1 to
cell F582.
Now, in this case, that's what we want because each row is an entire record here.
So we want to keep each record intact.
So extend selection is what we want.
That brings up another window with sort criteria, and you can have several sort keys.
So I just had year as being sort key one, and I've got a little radio button there that
says I've selected ascending, so it'll start with the earliest year and go up.
So, if I had chosen month for the second sort key, the data would first be sorted by year,
then for all the data points with the same year, they would be sorted by month.
That's not bad.
It's useful to know that you can do this multiple sort kind of thing, and it handles it in a sensible
way.
Next thing I want to talk about here is something called auto-filter.
Now filters are an important tool for working with large data sets.
While they don't replace a relational database with structured queries, they do give you
some useful features.
There are three kinds of filters, all accessed through the data menu in Calc.
We will discuss auto-filter in this tutorial, then move on to the others in subsequent tutorials.
Go to the data menu, to filters, and then to auto-filter, and click to turn on.
You will see a drop-down appear on every column in the open sheet.
These drop-downs will give you an entry for each unique value in the column.
For a column like case number, this feature is useless since each row has a unique case
number.
But if we wanted to only look at cases from 1978, we could go to the year column, click
the drop-down, remove the check mark from all in the bottom of the drop-down, which deselects
everything, then put a check mark in 78 to select only that, and the result is to display
only those rows which have 78 in the year column.
If you look at the row numbers on the left, every time one or more rows was skipped, there
is a heavier line separating the row numbers.
Also for each column that has a filter applied, there will be a small black square in the lower
right corner of the drop-down arrow box.
This is helpful if you need to quickly see where the filters are applied, such as when
you need to remove one.
Now note, you do need to be careful in working with a filtered dataset.
If you forget the filtering you have applied, you may get a wrong idea about your data.
Be sure to always check your filters first when working with a dataset to make sure you
have not filtered out some of the cases you want to see.
Now, you can combine filters as well.
Leaving the filter on the year column, I can next go to the sex column and select female
in a similar manner.
This reduces the dataset even further to only those cases which occurred in 1978 and involved
a female.
If I wanted to go even further, I could add a race qualification and filter for black.
This would give me all cases in 1978 which involved a black female.
Now, you can remove an auto filter easily, just click the drop-down and put a checkmark
in all at the bottom that will remove that particular filter.
Other filters you might want to use are shown at the top of the drop-down window.
The first one is top 10.
This returns the rows with the 10 highest values in the dataset.
If there are no other filters and you have at least 10 rows, you should get 10 rows based
on this column.
But if you have other filters, you may get less.
For example, when I first filtered on race to select white and then selected top 10 for
age, I got back six records.
That this means that of the 10 highest ages in my dataset, only six were of white patients.
Second option you might see there is empty.
This selects only those rows that are empty for the column you are filtering and then
not empty as you might guess.
Selects only those rows that are not empty for the column you are filtering.
Now suppose you wanted to select cases in 1978 involving females who are over the age
of 40.
You could hit the drop-down for the age column, remove the checkmark in all, then put checkmarks
in for every value greater than 40.
This would work.
Then the result is six cases.
But there is a better way and that involves using what are called standard filters, which
is what we are going to get to next.
Now the sample spreadsheet that I created with the data and the simple filtering things
I have done is available.
Link is in the show notes and you can get it from my website.
So please feel free to download and check that out.
And that means that this is Ahuka signing off for Hacker Public Radio and reminding everyone
to support FreeSoftware.
Bye-bye.
You've been listening to Hacker Public Radio at Hacker Public Radio.
We are a community podcast network that releases shows every weekday, Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself.
If you ever thought of recording a podcast and click on our contributing to find out
how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the Infonomicon Computer Club
and is part of the binary revolution at binwreff.com.
If you have comments on today's show, please email the host
directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the Creative Commons'