hpr-knowledge-base/hpr_transcripts/hpr0005.txt

Episode: 5
Title: HPR0005: Database 101 Part 1
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0005/hpr0005.mp3
Transcribed: 2025-10-07 10:12:22

---

The
Hello everybody, this is Spankdog and this is Hacker Public Radio.
On today's episode we're going to start a new series, a new in-depth series on databases.
We're going to start off with some very basic understanding of what databases are, some
basic terminology, and with each subsequent episode we are going to build on those fundamentals
and go into more detail as the year progresses.
So we're going to start off today talking about some very, very basic terminology because
it is very important that you understand some of the basic terms and exactly what a database
is.
It is in visual concept very simple but there are some details.
Some details that a lot of people may not know or understand about the databases as we
know them today.
First thing we really should define when we talk about databases is, well, the first word
or the first part of the word database is data.
So what exactly is data?
And it kind of may sound like a silly question but there is a common misconception people
throw the word data around very loosely but they're when they actually mean information
and they are actually two different terms altogether.
Data and information are not necessarily the same thing, not usually the same thing.
Data, if you want to go by a textbook definition of data, data is that which is extracted from
a compilation of data in response to a specific need.
All right, well that's a little, okay, you can think about that for a second if you want
to.
My favorite definition is to say that data is, it's a collection of facts from which conclusions
may be drawn.
These are like those minuscule or insignificant little events, tiny details that you store,
like in the case of computers, for example, log file details, Apache logs, any kind of
log file details, the time stamps that are in there, any observations, anything that's
stored that's just this minuscule insignificant data that by itself doesn't really have a whole
lot of value.
That's what data is.
So if you go out, you can do a little research and look up data and information.
Be careful.
If you look up data on you, you're going to get a lot of Star Trek references, data played
by Brent Spiner, but I digress.
So like here's an example of data.
Let's say that, let's say I were to sit down at, I don't know, a mall or something with
a pen in the paper and I logged details of every person that walked in such as their
height, their gender, what kind of clothes they were wearing, what color their hair was,
things like that.
This is data, little bits of information that in and of themselves, okay, so what?
A guy with black hair that's five foot eight walked into the mall, that's not really that
big of, it's not really that useful information.
Unless you're looking for that particular guy, but I digress.
Now to make that leap from data, which is insignificant, unapplied material, we come
to information and again, people throw these two together, but they are two different things.
Information is really applied data.
Information is the result of processing, manipulating and organizing data in a way that adds to the
knowledge of the person receiving it and that that's a quote that I think is pretty
on the money.
It's basically, well, I kind of said it earlier, it's application of data, useful extracts.
For example, let's use what I just said earlier, I'm standing at the mall logging people
that walk in and out of the mall and their information on it, well, that may not be all
that useful individually, but let's say that I was doing some sort of market research,
that information could be useful to somebody who was, I don't know, maybe selling clothes,
they wanted to know how, what the average height of most people is, you know, census type
material.
When you actually analyze all the data and come up with averages, average heights, what
total percentage, like male versus female, maybe you'll, maybe you'd be surprised to find
out that 75% of people that come to the mall are males age 21 to 31, I don't know.
You would not know that unless you actually sit down and gather data and then analyze
said data.
To come back to something a little bit closer to home, probably for a lot of our listeners,
let's go back to Apache logs.
If you are looking through your Apache logs, you might find you're getting a lot of new
hits from a particular website, you know, if you see one hit in your log, it's no big
deal, but you notice a pattern or a certain percentage increase of something that people
are finding on your site, that becomes useful information and that's the difference between
the two terms.
So applied data is what I think is the best way to talk about information.
So now we've gotten that out of the way, the next question, of course, is where do you
store data?
Well, in a database, that's what we're talking about here.
So database is another term that can be thrown around very loosely because fundamentally
a database is a very simple thing.
A database is a very simple generic term that describes a collection of data.
That's it.
Collection of data, data again being those tiny little bits of material that you gather
over time that are logged, that are observed, whatever the case may be.
It can be a spreadsheet, a CSV file, comma, separated value file, even a text file, a
word document.
It doesn't really matter.
You can have a word document that has all of your favorite recipes in it or something
like that.
That's a database of recipes.
It could be a spreadsheet of your CD collection or DVD collection or something like that.
That is a database that is a collection of data that's compiled and stored in one place.
That is the most simple example of a database.
But that's not really the way most people use the word database.
When you think of databases, especially in large scale applications or websites or things
like that, it's not quite that simple.
To run any kind of application or even web applications, even whether it be a forum, content
management system, anywhere up to, I don't know, the DMV or the IRS are running huge databases.
They're not storing them in text files.
They're not storing them in Excel spreadsheets because there's limits on those things.
When it comes to programming, it's difficult to read and write to those files because there's
no organization.
You have a text file.
It's literally line after line after line of information.
If I have a line of text file with 10 lines of data, let's say I have 10 people coming
in out of the mall and I logged their height and weight and level of attractiveness or
whatever the case may be.
Yeah, there's 10 records there.
I can look at that with my eyes.
I can parse through that data with my eyes and I may be able to pull out information
such as, hey, but everybody that came in was less than six feet tall or more than six
feet tall.
It's easy and you can do it in your head.
But what happens when that text file or that list goes from 10 people to 100 people?
You still may be able to glance at it and notice some patterns, but it makes it a little
bit harder.
What about that 100 jumps to 1,000 or 100,000 or millions?
And when you're talking about Apache logs and all the hits, you're talking of millions
of records on any decent size website.
When you talk about the internal revenue service and government databases, you're talking
out millions upon billions of records of data.
So you've got these huge collections of data, but if you were to put all of those into
a text file, and let's go back again to my text file of Mall example, I log 10 people
coming into the mall and you tell me, okay, well, tell me what was the tallest person.
I can look at it with my eyes.
I can pick out, okay, I see the heights, that guy's the tallest.
This woman was the tallest, whatever the case may be.
If I had 1,000 people on that list and you asked me to do the same thing, well, that's
going to take me a little bit more time, isn't it?
I'm going to have to go through page by page.
I'm going to have to point to the screen and go, okay, right now this guy is six foot,
one, and let me go, there's nobody, oh, here's how many six foot, three, that's the tallest,
now I have to keep going and looking further and then I have to keep, and by the time I've
looked through a thousand, it's taken a long time to get the information out of that data.
So you can imagine when you get into millions and you ask the question, who is the tallest
person, what is the average weight, things like that, it's not something you can do in your
head and it's a little bit trickier, and obviously that's where computers come in, they
can be very helpful with that.
Even there are also limitations of there when you start talking about millions of records
of data, you have to have an efficient way to read that data.
I can have that text file for example, or a comma separated value file, and write a program
that will go through and find the highest or the tallest person based on the height that
I've recorded, the data that I have on people's heights.
Well, if I write that for a very simple program to read and write from a text file which
is basic programming of any language, one of the things you learn in any basic programming
class, you'll realize that it's going to have to parse one record at a time, starting
at the top, it's going to keep going through.
You can write maybe some algorithms to help it out, but your data has to be sorted and
there's a lot of other factors, but trying to find that proverbial needle in a haystack,
even with a computer program, is not efficient because you have to keep reading and keep reading
and store stuff and information, store data in working storage variables and in memory,
and then keep looking through the rest of the data, and you have to look at all one million
records, even though the second one, ironically, may have had the highest height or the information
that you want to use.
You still have to read all the rest of it, which is not the most efficient way to do that.
Well, this is where something called a relational database, or actually, let's just take that,
let's just say a database management system comes into play.
A database management system helps organize all of that data to make collecting that information
from that data simpler and easier.
An example might be, let's see, maybe you wrote a backup software, backup system that
backs up your hard drive and writes it as a file name and automates the whole thing
and dates it and everything.
Something that would maintain a list of that data and that you could easily look up, okay,
here's the data, I want to go back to this backup file.
Earlier I mentioned having a CD collection, if you had, there are custom, you know, anybody
can put it into a spreadsheet of some kind, but there are also applications out there that
are custom designed to store a lot more information about your CD collection and you can look
stuff up more quickly and easily because they have something besides a text file behind
and they actually have database engines, database management systems to help you read and write
that data and there's many different theories by which these databases can operate and different
methods of storing and accessing the data and the most common type of database is what
I was just kind of referred to a minute ago and that is called an RGBMS or relational database
management system and this is the most common type of database and when most people say
database these days, this is what they're referring to.
I understand what I said earlier, database is in a very simple collection of data, fundamentally
that's all it is, but when people use the term database now and they say, oh, it's all
in the database, it's stored in the database blah, blah, blah, blah, blah, they're usually
talking about a relational database management system or some sort of database management
system.
Some examples of relational database management systems are oracles, probably the biggest
one right now, Microsoft SQL Server.
These are two of the big commercial products, DB2 is another one, but also included in
that are open source and other freely available databases like mySQL, Postgres, Postgres SQL,
database and too many more to go into, but any time you hear somebody refer to database
they're usually referring to one of those.
Now what a relational database management system will do is it basically takes all of your
information and we'll get into more detail in some of this in future episodes of the
HPR of the series, but suffice it to say their relational database management system gives
you a lot of tools and a very powerful engine to store all of the data.
Again, we're using very simple examples, a list of people walking in and out of them
all, but what if someone else in another state altogether has a bunch of information that
they've stored and then you buy a database from another company and you want to merge
all that together and do some analysis to see if there's any information useful information
out of all that data that's been collected, see if you can find something there that's
useful.
A relational database management system is a powerful program from maintaining that database
and will allow you to go in there and run queries and you've heard the word query before
you're querying the database or asking the database literally is what it means, but
S-Q-L is a programming language to choose to interface with databases and help pull back
information in a timely and efficient manner.
Instead of, let's go back to what I said earlier about having a million records and you
ask me to find the highest height out of all of those.
Well, manually it would be tough to do.
If I wrote a generic little C program, command line or something like that to find me the
highest one, it's going to have to read every single record of data and if the second
record had the highest data, it still has to read all of the others, assuming the highest
height, still has to read all of the others and it's not efficient.
A database management system has a lot of functionality built in that will make it much
faster to read the same information because it's stored in a different format and it's
easier to read and access that data.
So that's probably a good place to stop with this episode.
We're going to go into more detail about how those things are stored, talk about some
concepts like indexes and foreign keys in general and some different ways of accessing
databases and probably some examples along the way.
But I think that's a good stopping point for today and hope that brought a lot of people
up to speed and cleared up a few misconceptions about database terminology because it's important
to understand those basics and those fundamentals because a lot of people will use the database
and they don't realize why.
Don't blindly buy Oracle for an application you're using or force it because maybe you
learned Oracle in college or maybe you learned my SQL because of some open source app.
You really may not need it.
Sometimes it's perfectly fine to read and write from a text file or a comma separated
value file or an XML file.
Sometimes you don't need a big database engine.
Sometimes you may be using a text file when you should be using a big database engine
or some sort of database engine because it will make your program more efficient and
faster.
So understanding all that and keep that in mind that will help you make decisions in future
projects of whether you need a database, what type you may need, what size and if it's
really going to be worth your while to do so.
So tune in for future episodes in this many series.
You can always find those on hackerpublicradio.org and if you have any questions you can find the
contact information on the site and I look forward to seeing you guys in the future episode.
Thank you for listening to hackerpublicradio.htl-sponsored by carrow.net so head on over to
the C-A-R-O-L-E-P for all your personal needs.