Files

390 lines
24 KiB
Plaintext
Raw Permalink Normal View History

Episode: 14
Title: HPR0014: Databases 101 Part 2
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0014/hpr0014.mp3
Transcribed: 2025-10-07 10:19:26
---
In the next video...
It's the time
Another companionship
Hello world and welcome to After Public Radio.
I am Stank Doug and on today's episode we are going to continue our in-depth series, our
many series on databases, talking about from the ground up how they work, the fundamentals
and basics and working our way up to some more advanced topics.
This is episode two in the series so if you are just joining us I highly suggest you
go back and find episode one.
You can find that at hackerpublicradio.org and you can check to see that and a lot of
other great episodes on great topics but today we are going to continue with some relational
database management system topics and the first episode we did talk about databases in
general and touched on the idea of relational databases but we didn't get too much in
the terminology.
I wanted to take a break and let the first part of that kind of sink in and make sure we
are all on the same page and today we are going to talk about specifically our DBMS systems
or relational database management systems which is the most common and most prevalent type
of database management system out there.
The relational database management system is based on something called a relational model.
Obviously that is where the name comes from but what is a relational model?
One of the things we are going to do in this series is try to break down a lot of this
terminology.
I do think sometimes it is silly and overkill to make up all these puzzle words and terminology
to try to explain something that really can be much simpler.
What they mean by a relational database management system and a relational model is simply that
your data inside of this management system has relationships between each other.
Now if you have just a text file as we talked about in the first episode of this series,
if you just have a text file you don't really need a database system.
If you just have a list of names and phone numbers or something very basic but as soon
you add any level of complexity in there that you think that you are going to need something
to maintain large volumes and numbers of data or a large amount of data, that is when
you are going to want some sort of database management system and again relational database
management system being the most common.
The easiest way probably to visualize how data is stored in a relational database management
system which we are going to refer to as RDBMS from now on, envision a spreadsheet.
Any data stored in a relational or an RDBMS is stored in a two-dimensional array, aka
a spreadsheet.
Now, not a spreadsheet like Microsoft Office Excel or OpenOffice.org or any of those
type things.
It is easy to visualize that way but it is not an application.
It is a two-dimensional array.
That is to say that there are a certain number of rows and a certain number of columns and
they make up something called a table.
A table is the primary way that data is stored in an RDBMS so you can have a table that
is a list of people's name and phone number.
For example, again, you probably wouldn't use it if you are using a simple example like
that but if you are using something a little bit more complex like say you had an online
website that was taking orders.
Well, if you think about that, you need a lot of information to take orders.
You need people's name and address.
You need a lot of their billing information.
You need a catalog with what you are selling and the prices of those items, maybe descriptions,
pictures, all kinds of data.
If we go back to what we talked about in episode one, you could theoretically put all that
in one big line in a text file somewhere but as soon as you reach any quantity of records,
it is going to get confusing.
There is going to be a lot of repetition.
If somebody comes in and places five orders, they are going to have five records with the
same information all of it.
It is a lot of wasted space, inefficient accessing of the data.
That is when an RDBMS is going to come in.
What you do in an RDBMS is break down your data.
There is something called normal form, normalization and what you are going to do in the normalization
process is you are going to sit down and analyze the data that you have and you are going
to break it down into a series of tables or spreadsheets if you want to think about that
anyway.
You can even sit down and I do this all the time for a lot of the projects that we work
on in Binrib, sit down even on a napkin or a piece of paper or there are plenty of tools,
gooey tools that you can use to do this on a computer.
Sit down and kind of design and think about all the different data that you are going
to have and going to be tracking.
You have to do a lot of planning and preparation, it is not something you kind of do off the
cup.
You actually have to sit down and do some analysis and decide what data you are going to
need in your particular system and then how can you break that data up logically and
that process is called normalization.
Normalization in layman's terms is to break down your data so that there is no repetition
that data all works together and cooperates and is reliable and makes it easy to maintain
and read.
Now there are different types of normalization, different levels I should say of normalization
that goes up to fifth and sometimes people say six normal form.
We are not going to emphasize that now we might come back to that later on in another episode
but for now just think that you want to get your data spread out in such a way that let's
go back to the orders example that I used earlier.
You don't want one big text file with all of that data stored on there.
What you want to do is break that up into smaller logical segments so that it is easier
to maintain.
In one table or one text file or however you is easier for you to envision it, you want
to break that data into one entity.
You do not want to put all that in one entity.
You want to break that up so that it is in several different tables or entities logically
broken down.
For example, if you are running an online store, one bit of data that you are going to
have to have is your product line and information about it.
You are going to have a product name, maybe the size, the color, the price, things like
that.
Anything that has to do with that product, you probably want to separate that off into
its own database or excuse me into its own table inside of your database.
That way as you add new products to it, they only get added to this one table.
It is easily updated.
There is no confusion.
It is all in one place and one logical format.
If you ever wanted to do a report on a list of all the items that you have for sale, you
could do that very easily by just putting the data out of this table.
That data really is somewhat irrelevant of the other parts of your system.
Another part of your system of course would be the names of people who place orders.
Your customer list would probably be a separate entity because the customers are independent
of the products.
You probably would have a separate table when someone comes to your website and decides
to order something, they are going to have to register and give you their name, address,
phone number possibly, maybe credit card number could go in there.
Again, I am just making this up as we go along.
You would have to do an analysis on your own system to decide what best fits for you.
Then you might have a third one, a third table for the actual order of some sales.
This table one and table two is a list of products and a list of customers respectively,
but how do those two go together?
This customer may order items one, five, and seven, but this next customer may order
items one, two, and four, and how do you make those things work together?
Well you probably would have a third table when a customer places an order that ties those
two things together.
You see that these tables while independent in nature by the nature of their data, customer
table, product table, you can see that those are independent by the logical breakdown of
their data.
You see that they actually do have relationships and there is where that word comes back
into play.
They do have relationships between those tables and that is where you are also going to sit
down in your analysis and decide how you are going to store that data.
I guess to put it in a question form to make you think about it, what would you do or
how would you find out if customer number four, what items did he order?
Well it is not quite so simple as pulling only the ones where that customer's name matches
the order.
He may have multiple orders.
You got to be careful that someone else in there may not have ordered the same thing
that you are pulling the right data.
So all those relationships are important in the design and layout of your database.
I know this is kind of tricky to visualize via audio only on a radio show, but I am
trying to break this down into the most simplistic terms that I can.
So what you should be envisioning now in front of you are three different spreadsheets in
the example I gave you.
A three different tables is the proper term, table of customers, table of products, and
then a table, third table that ties those together called orders.
So when a customer comes to the site, the site is going to pull from the products table
and give them a list of products as they page through them.
You can imagine Amazon and all the products that they have is pulling out of those products
tables and displaying them on the screen.
If you register, it's going to create a record for you in the customer's table.
And if you ever place an order, it's going to insert a record then into your orders table
or their orders table.
Now, that's very simplistic because a whole lot going on besides that, but you can see
that process.
And when you write that order in there, it's going to have your name, or probably some
sort of numeric reference to your name to make it a little bit quicker.
For ease of thought here, it's going to have your name, and then it's going to have the
product name that you bought.
Now it doesn't have to have the price, or the color, or the size, or any of that necessarily,
depending on how you lay it out of size as an option, you might have to put that in there.
But actually, by putting in just the name of the customer and the name of the product,
it's tying the data from those two together and it can read those tables to pull the rest
of the data.
So it knows what color you ordered.
It knows what size you ordered because it pulls it from that table.
And those are the relationships in a relational database management system.
So I've used the word table.
Let's break down and analyze what exactly is a table.
I described as a spreadsheet earlier.
So if you've used Excel or any kind of spreadsheet application, you know across the top, you
have several columns.
And each one of those columns, I think by default, most spreadsheet applications will
say A, B, C, D, E, F, et cetera, et cetera.
Well, you have all the columns at the top and on the left, you have rows.
And that's just simply how many rows of data that you have.
So you may have a list of five customers.
If you want to visualize, imagine across the top, what are you going to put in column A?
Customer name, for example.
And column B might be street address.
Column three might be state four, city five, zip code, et cetera, et cetera.
So each one of those columns has a different type of data in it.
And imagine it, or even if you want to while we're listening to this type then, so you
can visualize it.
But you see that each one of those columns has a different type of data in it.
You don't want to mix in that.
You don't want to on your next row.
Oh, well, I'll put the address in column A this time.
And the name over in column C, logically, who does that when they put in a spreadsheet?
You have it all nice and neat, so it's readable and easy to understand the data that's
in there.
Well, the same holds true when you're putting data into a database.
So as a matter of fact, that's one of the benefits of a database to keep everything
nice and organized.
So you'll have a customer table.
And each one of those columns that we refer to in a spreadsheet, the proper name and
database terminology for that, those are your field names.
You're going to call a unique field, for example, customer name, or C name, or whatever
you want to call it.
Just plain old name, whatever you call it.
As long as it's unique, you can call those fields, whatever you want.
You can have as many fields as you deem necessary.
Again, the relational database management systems have, there is some theoretical limit, but
it's so ridiculous that nobody probably listening to this show is ever going to run into a situation
where they don't have enough room or the database management system doesn't allow them to
have enough fields.
So you're going to create a bunch of fields and you're going to define those field data
types.
For example, if the field you decide to call it customer name, or customer or underscore
name, or something like that, you're going to define it as a data type of what kind of
data is alphanumeric, which is, and I'm not going to go into the terms, but there's different
ways that you define that.
You say, okay, well, this is going to be an alphanumeric field, and it's going to be a
length of, I don't know, 50, or however long you think the longest name you might run into.
Always error on the side of caution, and there are some database tricks and data types
that you can use that are variable length.
We'll come back to that in a later episode, but for now, just understand you're going
to define the data type.
One of the other fields we mentioned is price, for example, well, price is a numeric
data type.
You're never going to put in a BC for a price, it doesn't make sense.
So you probably define that field as a numeric type.
So you put all that in there, just like you would in the spreadsheet, but you're going
to put these into a table table, you're going to define all those data types and the names
of them, and then every time you put a record into that data, a record is equivalent to saying
a row in your spreadsheet visualization.
So you're thinking of a spreadsheet, and let's say you're manually taking orders over
the phone and in a spreadsheet, and in Microsoft Excel, you type in all that information every
time someone takes an order, what a nightmare that could be.
You type in one row of information about somebody that is a record in database terms.
That is one row across filling in all, or most of the fields that you define at the top
of the name, the address, the phone number, et cetera.
And certain information, of course, is required.
You can't process someone's order if they don't give you a credit card number, for example,
or if they don't give you a shipping address.
So all these things factor in there as well.
We're going to come back to some database design, probably in a later episode as well.
But right now, you should have, visually, in your mind, several spreadsheets, aka tables,
and we're going to, from now on, get in the habit of calling them tables, because that
is the proper name.
You can visually think a spreadsheet, but there are two dimensional arrays called tables.
The tables have all of your fields defined in them, which again is across the top for
a visualization aspect.
And then every time you have a complete row of data in there, a rows are referred to as
records.
So this is all your data stored in there.
So let's imagine you've been a business for a month, and you have a database filled
with at least those three tables I mentioned, a customer's table, a product's table, and
an order's table.
So as you add new products to your store, all you have to do is add that product into
your product's table.
As new customers come in, it's going to add a new row to your customer table, and it's
going to continually grow, and this is where relational database management system shine
is the more data you put in there, they can still read and write that data very efficiently.
So that's one of the good benefits of our GVMS.
The other thing we probably want to talk about on this show is we mentioned these tables
and having relationships together, well, I'm going to leave, this is kind of a very deep
topic, but it's critical, it is the most important thing in understanding databases and relational
databases is the concept of keys.
We talked about these tables having relationships between each other, but how we kind of glossed
over that and intentionally so, well, that database has to relate to each other somehow,
and it has to be in design in such a way that you can look data up between the two of them.
If the data does not have relations between them, if they do not have relationships between
them, then again, you might not need a database management system or you might not need these
all together in the same database itself.
The whole point is to have relationships between that data.
So again, we go back to the products table and a customer's table, well, how do you tie
those two together?
Well each time you insert a table into a database, you want to, you don't have to, but you
want to, to take advantage and properly use databases, you want to define some unique
identifier for every table.
And that's what normalization that we refer to at the beginning, normalization, you don't
want to repeat any data anywhere.
You want to try to make every piece of data only once in the database because ASSafe space
be, you don't want to have to update multiple locations anytime something changes.
What if the color of one of your products changes?
If you've got it just in your products table, there's only one place that you can change
it.
But if it's also over here in the order table or in some other descriptions tables, you've
got this independent and not tied back to that, you can be updating data in a bunch of
different places and you don't have data integrity anymore, you don't have your data maintained,
you may forget to change it in a bunch of other places.
So what you want to do in these tables is logically break them down in such a way where you
define something called a primary key.
And a primary key on a, on let's say our customer database, for example, it could be a name
and we're going to go with name for right now, a name, some unique way to look up information
in that table.
In that table, if I've got a table with 500 people in there, I only want to find one particular
customer.
How do I do that?
Well, in the last, in the first episode, we talked about if you had a text file and all
that data in there and you have to go searching through it to find one unique name, one person,
it's going to be difficult.
You don't have to parse through that entire thing and all this extraneous data that you really
don't need.
Well, in an RDBMS, the data is stored in such a way that you can search much quicker and
much more efficiently, you define a primary key, which is indexed so that when you go looking
for something, it's going to go right to that data infinitely so, so much faster than
writing directly to a text file or some other format.
So if you only wanted to look up, I don't know, David Letterman, just see what he's ordered
from your site.
I don't know why that name popped into my head, but it did.
So you go in there and you search for that name.
It's a unique ID and the example that we're using here.
And you can say, okay, there he is.
Here's his address, phone number, so you went straight to him.
Believe it or not, that's actually a very bad idea.
But if your actual user that you were looking up was John Smith or some common name, you
could have multiple John Smiths with different addresses, they're completely different people.
So when you define your primary key, you have to be very careful, you have to come up with
some unique identifier to be able to tell that.
Back to your products table, for example, you might not be able to use a product name
as your primary key.
But if the name product name was antivirus, well, there's several different antivirus
software packages.
How are you going to tell one from another?
You're going to have to come up with either a longer, more unique name every time, or
more than likely what's going to happen and what most places do is they'll use a numeric
system, whether it's auto numbers, starting at one, to uniquely identify every product
in there.
Maybe assign some sort of code number, maybe something that's stored in a third-party
database, so AV1 for the first antivirus product, AV2 for the second antivirus product.
Over in the customer's table, think of some unique identifier that you could use to uniquely
identify a person.
The most common one that we all know is probably social security number, and there's a lot
of controversy over using that number.
It is used very frequently as unique identifier because theoretically, it's unique to every
person in the United States.
They all have a social security number, but that may not be safe to use, first of all.
If you're dealing with international, you might not have a social security number, so that
becomes a pitfall.
Really, what I'm getting at here is part of the analysis that you're going to do when
you lay your tables out is to come up with something unique that you can identify and
individual row.
You have to be able to do that in our DVMS.
You have to somehow come up with a way where you can find one row, one record of data in
there.
I think that's probably a good place to leave you thinking until we come back to another
episode and we'll follow up on this, but how would you break something down?
How would you uniquely identify a person?
That's probably the question I'm going to leave you with with this episode.
How would you uniquely identify with a person in any scenario or in any database?
How do you think and use it in reference to the application?
How does the United States government uniquely identify a person's social security number?
One of the first things that jumps to mind.
How does your state uniquely identify you in their database?
They could use the social security number, but technically, they should not.
What do you think they might use?
What about your school or your college or university or even high school?
How do you think they uniquely identify you?
They again could use the social security number, which is universally unique in the United
States, again, theoretically, but they shouldn't be using that either.
So what might they use?
What's different that could be unique for them to identify one individual person?
So that's a good question to leave you with this time.
I'll do a quick recap on some of the important terms and things that we brought up this episode.
We talked about RGBMS, relational database management system, and how they have relationships.
The key word in that is relationships.
Your data has to have relationships with the other data in your database.
Otherwise, there's no point in having it or no point in using it.
If you've just got one list of data, then there's no need in using an RGBMS.
But if you do decide that you're going that direction and you are going to need that,
then you're going to break your data down into logical clusters, remove repetition, don't
use reuse data.
Make sure the data exists only in one place as best you can, and this process is called
normalization.
And there are different levels of normalization that we did not go into that we'll hold
for another episode.
And the design of your database logically, think ahead of how you're going to uniquely
identify all of that data.
Sometimes you break down all of your data, but you're still going to have to add a little
bit more data to uniquely identify it.
In the example we talked about tonight, we said products.
Anti-virus may not be a good enough example.
Even, honestly, even Norton antivirus might not be good enough because there's different
versions of it.
So how would you uniquely identify it?
Would it be NAV1 for Norton antivirus version 1, NAV2 for version 2, et cetera, et cetera?
That's the kind of stuff you need to think about when you're designing your databases.
So I'm going to plant that in your head for this week and let you all think about it.
The question you should be having your head and think about when we start the next episode
in this many series will be how to uniquely identify a person, just as an exercise to
get you thinking for the next episode.
So thank you for your time this week, and we will see you again on an upcoming episode
of Hacker Public Radio.
Thank you for listening to Hacker Public Radio, HPR, sponsored by carrow.net, so head on
over to C-A-R-O-DOT-E-T for all your hopes for new Hacker Public Radio.