289 lines
26 KiB
Plaintext
289 lines
26 KiB
Plaintext
|
|
Episode: 2270
|
||
|
|
Title: HPR2270: Managing tags on HPR episodes - 3
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2270/hpr2270.mp3
|
||
|
|
Transcribed: 2025-10-19 00:41:19
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is an HBR episode 2, 270 entitled Managing Tag Mod HBR Episodes 3.
|
||
|
|
It is hosted by Dave Morris and is about 32 minutes long and can remain an explicit flag.
|
||
|
|
The summary is looking for the best way to store and manage tag in the HBR database part 3.
|
||
|
|
This episode of HBR is brought to you by An Honest Host.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HBR15.
|
||
|
|
That's HBR15.
|
||
|
|
Better web hosting that's honest and fair at An Honest Host.com.
|
||
|
|
Hello everybody. Welcome to Hacker Public Radio. My name is Dave Morris and today I'm
|
||
|
|
finishing a mini-series. I started a few weeks ago on the subject of Managing Tags on HBR
|
||
|
|
Episodes. This is the third episode of that mini-series. So you probably believe to know that it
|
||
|
|
is the last one. I don't know. It's haven't had much in the way of feedback. I've had a few comments
|
||
|
|
about help with database design and so forth, which is great. In the first show we looked at the
|
||
|
|
whole subject of tags and why we should need them. We need them with HBR shows and we looked at the
|
||
|
|
way that we're currently storing them and what's good and bad about that. Mostly bad. The major drawbacks
|
||
|
|
are with searching. It's the sort of thing that you might, the sort of design you might come up with
|
||
|
|
without really fully getting the idea of what databases are for and what they can do.
|
||
|
|
Hopefully this series will help to fill in some of those gaps as much as I can do that anyway,
|
||
|
|
not being a database expert. In the second show we looked at how we could make a simple tags table
|
||
|
|
and query it and thereby get a better overall effect. In particular we defined some of the things we
|
||
|
|
wanted to be able to do with tags, partly based around one of Drupes' comments back in an episode
|
||
|
|
he did about the way to improve HBR. We are going to in this last show look at a more, I've
|
||
|
|
written here, rigorous, efficient, normalized solution. So I think this is the way that somebody
|
||
|
|
trained in database design would do things and I've worked up to this over a couple of other
|
||
|
|
shows really just to air the whole subject of how one designs this type of thing for people
|
||
|
|
who are maybe not that experienced in this thing. I'm not that experienced but I do have come
|
||
|
|
across this sort of thing in the past and implemented it. Now one of the drawbacks with the method
|
||
|
|
that we looked at last time was that our tag table had multiple instances of the same tag
|
||
|
|
and it also doesn't conform to the accepted database design recommendations. So those two
|
||
|
|
things really, well you could get away with it, it's not really a good reason to do so as far as
|
||
|
|
I can see anyway. Although I did cite some instances of people who had implemented such things
|
||
|
|
in various services on the web, I don't think it's really the best way of doing things. In particular
|
||
|
|
the design in the last show doesn't really reflect the relationship between the HBR episodes
|
||
|
|
and the tags they're associated with and this relationship is what's referred to as many to many.
|
||
|
|
And what this means is that a given episode may have many tags, not very surprisingly,
|
||
|
|
and a given tag may be associated with many episodes. Now Mike Ray covered this subject very well
|
||
|
|
in episode 1569 called many to many data relationship how to. It covers the subject really,
|
||
|
|
really well, but I'm not sure that the audience was quite ready for to hear here such a thing.
|
||
|
|
I think we could definitely do with more database related shows for the sake of people who might
|
||
|
|
be wanting to get more into databases. Certainly I would recommend that you listen to Mike Ray's show
|
||
|
|
if you haven't done so already or if you've forgotten what he said. I've certainly listened and
|
||
|
|
read his wonderful notes since then because it does give a very good overview of how you would do
|
||
|
|
this type of thing. So in the many to many design, one copy of each tag would be held in a tags table
|
||
|
|
and there would be a second table which is linking or cross referencing the tags and episodes in
|
||
|
|
our particular case. I mean many to many is used in many contexts, but in the one that we're looking
|
||
|
|
at here we're talking about episodes and tags. So I drew a very simplistic diagram using open
|
||
|
|
office, Libre office drawer, which has its limitations for this sort of thing, but hopefully it gets
|
||
|
|
a message across. I've shown an example where we've got an episode's table, we're imagining that
|
||
|
|
there's a show 1, 2, 3, 4, which I've just shown in the little box, it says ID 1, 2, 3, 4 in it.
|
||
|
|
Then that is associated with the tag banana, obviously I show about bananas, and the way it's done
|
||
|
|
is that there's a joining table that contains a record that says episode, episode reference ID 1,
|
||
|
|
2, 3, 4 in the episodes table, and it's got a tag reference, so there's, imagine there's a tags
|
||
|
|
table where the tags have an index number and we have that number in the joining table. Then it shows
|
||
|
|
picture of the tags table with tag ID 456. I wasn't very imaginative with these numbers,
|
||
|
|
and it's associated with the tag banana. So if you if you like looking at pictures of things
|
||
|
|
like this, you might find that useful. So what I did in order to demonstrate this, and just to prove
|
||
|
|
to myself that this was a viable thing to do and that I fully understood it, and this was some time
|
||
|
|
ago I wrote this, I used the comma separated list that we already have in the episodes table,
|
||
|
|
which it's called EPS, for some reason, save typing presumably, and I used the contents to
|
||
|
|
populate new tables that I created. I did the population with a purl script, and I've included
|
||
|
|
the purl script in with this show just really for reference for completeness. I will look at it
|
||
|
|
in in brief a bit later on, but as part of this this particular episode I've also included the
|
||
|
|
SQL or SQL definition of the table, and it's a file that you can download if you're interested,
|
||
|
|
I call it normalized tags 2. I've also listed it in the notes in the long notes, and it just in
|
||
|
|
brief it it contains a new table of tags, which are called tags 2, that's because I already had a
|
||
|
|
table called tags for the the other method you would call it something else if you were doing this
|
||
|
|
for real. Remember this is experimentation, it contains single instances of tags, I did make it
|
||
|
|
hold the mix case and lower case versions of the tag, but since I did that I think in my SQL
|
||
|
|
and MariaDB I discovered that it's easier just to store one because it doesn't, you have to go
|
||
|
|
to some length to make it check the case of words and stuff, that wouldn't be the case with other
|
||
|
|
database systems, but it is with this. There's a joining table here which are called
|
||
|
|
EPS underscore tags 2 underscore X ref, the convention is, this is something that Mike Ray mentioned as
|
||
|
|
well, the fact that it's joining two tables, it's good to give it a name that references the two
|
||
|
|
tables and then follow it with X ref meaning that it's a cross referencing table, so it's easy when
|
||
|
|
you come to look at a database to work out what the thing is from its name. This table contains
|
||
|
|
just two columns, one is called EPS ID and the other one is called tags2ID, so EPS ID is the key
|
||
|
|
of one of the rows in the EPS table, so in other words it's an episode number because that's the key
|
||
|
|
to this table. Tag2ID is the equivalent of a reference to a row in the tags2 table and one of the
|
||
|
|
key things that you need to do when creating such a cross reference or joining table is that you
|
||
|
|
need to give it a unique index which is a structure which stores pointers to the rows in the table
|
||
|
|
effectively, and in this particular case it combines the two columns to make the key and it's
|
||
|
|
defined as a unique index, that means that you cannot have a case where the same episode and the
|
||
|
|
same tag is repeated. It doesn't make any sense to do that but since databases can control these
|
||
|
|
things then it's a good idea to make that the case, it's in to enforce it this way, so you can
|
||
|
|
imagine in this table there will be multiple instances of the episode number because a given
|
||
|
|
episode is entirely likely to have multiple tags and for each instance of that episode number
|
||
|
|
there will be a different tag number referencing into the tags2 table. I also created an index, a
|
||
|
|
unique index on the tags2 table which just ensures that it's impossible to add the same tag twice
|
||
|
|
into that table because otherwise the database would let you. Doing this means that you can't.
|
||
|
|
Then there's a further index I've created which is probably not necessary which I called all tags
|
||
|
|
and it indexes the tags2ID column of the cross referencing table. That's because when managing
|
||
|
|
the tags, if a tag gets deleted for whatever reason maybe it was misspelled and you want to change it
|
||
|
|
for some reason and effectively have to delete the old and replace it with the new, then deletion
|
||
|
|
is done if it's done external to the database then the index helps to speed things up. Well I'll
|
||
|
|
come on to that in a little bit more detail later on. Now one of the things that Mike mentioned in
|
||
|
|
his show was the use of so-called foreign keys. Now MariaDB or MySQL and all the various other
|
||
|
|
databases are so-called relational databases which means that the relationship between tables
|
||
|
|
or entities if you want to call them that can be defined by various components of the definition.
|
||
|
|
This one's lots of lots of further tutorial episodes to fully fully grasp but bear with me.
|
||
|
|
So the thing so-called referential integrity is managed through foreign keys but by default
|
||
|
|
MySQL before and MariaDB now don't support foreign keys so if you create a table you don't get the
|
||
|
|
ability to create foreign keys on it. I think that's true and at one point MySQL, the predecessor
|
||
|
|
MariaDB, couldn't actually do foreign key relationship at all. That's the reason I never used it
|
||
|
|
in my work because the time I wanted to do that I wanted to be able to use foreign keys.
|
||
|
|
There are different types of table you can create and I've noted down here that by default you
|
||
|
|
get one which doesn't support these features. You have to ask for it explicitly. At the moment we
|
||
|
|
don't have any of these types of table that can do this stuff in the HBR database so I've not
|
||
|
|
implemented this in my example though it's something we should be doing. So a foreign key then
|
||
|
|
is a way of showing relationships between database table or the data within them, making it
|
||
|
|
one table dependent on another effective. So the field EPS ID in the cross reference table is an
|
||
|
|
episode ID number in the EPS table and it should only contain episode ID numbers which match
|
||
|
|
episode numbers in the EPS table. You can't just add 999 in there because it's saying I am referencing
|
||
|
|
an existing entity in the an existing row in the EPS table. If you define it as a foreign key
|
||
|
|
then the database itself will say no you can't do this because this doesn't exist in the other
|
||
|
|
table. So the tags to ID field in the same table would also be a foreign key pointing into the
|
||
|
|
tags to table and again the database would constrain what can be placed there. One of the other things
|
||
|
|
that foreign keys can do is that it can ensure that if you delete something like you delete a tag
|
||
|
|
you delete the final reference to a tag then you can get the database to manage all of the
|
||
|
|
deletion. So the sort of scenario would be if you had a tag banana referred to by an episode
|
||
|
|
then if you deleted that tag in the sense that you said this was a mistake it's no longer to be
|
||
|
|
associated. You want to delete the tag from the tag table you also want to delete the cross
|
||
|
|
reference. So the maintenance of integrity is a feature called cascading deletion which means
|
||
|
|
that you can't delete a cross reference entry without the tag also being deleted. You don't end up
|
||
|
|
with tags sitting in the tag table which don't belong anywhere. Orphaned I think is the term certainly
|
||
|
|
the term I am inclined to use and Mike similarly I noticed. So I don't sure I explained that very
|
||
|
|
well but Mike did a much better job of it and so I can refer you back to episode 1569 if you
|
||
|
|
want to understand this more. So I've added in here note about the pulse script that I wrote which
|
||
|
|
I called refresh underscore tags underscore two. It's quite a complicated script so I'm not I'm not
|
||
|
|
going to go into details about it. It's not really a hugely relevant here but I thought I would
|
||
|
|
put it make it available in case anybody wanted to read it and understand it. In just to explain
|
||
|
|
it in three quick paragraphs it scans the eps table in the database collecting all of the tags stored
|
||
|
|
in CSV form and it's it stores them away with the episode number they belong to. It also collects
|
||
|
|
the tags stored in the tags two table if there are any and stores them again associated with an
|
||
|
|
episode number so it's made two two tables of information episode number and the tags that
|
||
|
|
it that are associated with it in the two tables remembering that this depends on the CSV list being
|
||
|
|
available. They can then compare the two sets of tags and look for differences so if a new tag is
|
||
|
|
appeared in the CSV list it can add it to the tags two table and if for some reason it's no longer
|
||
|
|
there then it can delete it and it manages the joining table along with the tags two table to
|
||
|
|
achieve this. I've noted that the script performs actions that the database itself would carry out
|
||
|
|
we used foreign keys and so forth and so the deletion in particular and deletion of stuff and
|
||
|
|
again again in my notes made a reference to the fact that we really need to implement the full
|
||
|
|
database capabilities in order to get this stuff. So moving on to the advantages and disadvantages
|
||
|
|
of this method well first of all this is the most efficient way of storing the tags we only
|
||
|
|
store one instance of everything it's obviously vastly preferable to the common separated
|
||
|
|
variable method which looked at in episode one and it's also preferable to the method in the last
|
||
|
|
episode because the same tags stored only once so if you want to make a spelling correction to a tag
|
||
|
|
you don't have to go and make it several times for example and if we have the full relational database
|
||
|
|
capabilities foreign keys caster cading deletion and so forth we can use the database capabilities
|
||
|
|
to help manage this type of structure it's what databases are designed to do the disadvantage and
|
||
|
|
I've only listed one really this is the best in in terms of database design but the concept of how
|
||
|
|
it works and the sort of way in which you manage it have become more complicated as a as a
|
||
|
|
consequence but I don't believe that that's really an issue especially since you can write scripts
|
||
|
|
or similar types that you can even write so called stored procedures in the database to help you
|
||
|
|
manage this sort of stuff so I don't think that should be a criterion for rejecting it so as in
|
||
|
|
the previous episodes let's look at the ways in which this method could be used for searching now
|
||
|
|
I've put quite a lot of detail in the notes here but I won't talk about them in in great detail
|
||
|
|
I'll leave them for you to examine if if you're interested the first one is how would you find
|
||
|
|
all the shows that have a given tag and I'm using the example we used originally which was look
|
||
|
|
for shows with the tag community the one note I've made here is that since I created the notes
|
||
|
|
the last show I discovered that the method I used to generate show notes which is a templating
|
||
|
|
system has the capability of making database queries within it so I've actually done that within
|
||
|
|
my notes and my notes themselves actually querying the my copy of the HPR database and because I can
|
||
|
|
do that I can also generate HTML tables to show the results so I've done that and hopefully it's a
|
||
|
|
bit clearer to to read I think it is anyway previously I was listing things which were too wide
|
||
|
|
for the for the page so you had to scroll sideways to see them HTML tables wrap in sensible ways
|
||
|
|
so I think it is preferable and also color things and that sort of stuff so the query that is
|
||
|
|
an example of how you would do this particular type of query to get the shows of the tags with
|
||
|
|
the tag community you have to examine the apps table and the cross reference table and the tags
|
||
|
|
to table in order to do this and you need to make sure that for every row you get out of these
|
||
|
|
various tables you are making comparisons and to say that you want rows where the episode id number
|
||
|
|
matches the episode id number in the cross reference table and similarly the tags id number in
|
||
|
|
the cross reference table matches the tag id number in the tags table so you're doing a sort of
|
||
|
|
um a set operation where the sets are overlapping between these three tables and then when you've
|
||
|
|
done that you then say I want only the rows from this combination which have a tag of community
|
||
|
|
and you get back a list which is the say which is actually not the same as last time because since then
|
||
|
|
the worlds moved on and we've got more got more shows in the database and one of them I think it
|
||
|
|
is has got community as a tag it's mine actually but the time of recording this hasn't come to the
|
||
|
|
top of the queue but whatever by the time you hear this it probably will have so if you want to find
|
||
|
|
shows with a combination of tags again it's the same same example as before we want to find shows
|
||
|
|
which contain the tag community or the tag hpr and there's an example query of how you would do
|
||
|
|
this and the result as a as an html table again we get more back because other shows have come into the
|
||
|
|
the database the shows the other previous two shows in this mini series appear and that's why
|
||
|
|
because they both contain hpr as a tag if we want to find shows which have community
|
||
|
|
and hpr as tags then there's a difference the way the the query is done which which I have shown
|
||
|
|
without a huge lot of explanation but you can see the result comes out of it I don't think it's
|
||
|
|
appropriate to drill down too deeply into this one now in the last episode I did a thing where I
|
||
|
|
followed up troops's suggestion which was that if we select a show
|
||
|
|
he was imagining that if if we're looking at a particular show listening to a particular show
|
||
|
|
it would be good to know which other shows share the same tag tags I should say and I wrote a little bit
|
||
|
|
of sequel to do this which I stored in a file and I shared the file with you this time I haven't
|
||
|
|
done the same thing I've done the same query the equivalent query but I haven't made it into a
|
||
|
|
SQL script there are two queries one is simply to get back the tags of the related to a given show
|
||
|
|
I'm using show 2071 as the target so we're assuming we're listening to show 2071 and that was Mr
|
||
|
|
X's show I didn't actually know what it was in this one because I did in the last episode but
|
||
|
|
I don't know what it was actually I think oh it was about his portable amateur radio device
|
||
|
|
anyway the tags are amateur radio electronics and open source and then the second query is
|
||
|
|
one which scans the database for all shows which have any of those tags so and then it's actually
|
||
|
|
listed them with the tag that it found so there's a batch of amateur radio or a batch of electronics
|
||
|
|
and a batch of open source shows that came back from the query and again it's similar to what was
|
||
|
|
done last time just different queries basically maybe more shows came back I didn't actually make a
|
||
|
|
note how many we got back I don't think it's all that important now in the last episode we
|
||
|
|
looked at using regular expressions because there is that capability within MariaDB to find
|
||
|
|
partial tags because all we've been doing so far is looking at whole tags but if we want to look
|
||
|
|
at a partial tag then regular expressions what we need to to use and I've got an example query
|
||
|
|
which is using the regular expression capability and it's using word boundary expression which I
|
||
|
|
mentioned in the last episode so I won't go into detail I've gone into a tiny bit more in detail
|
||
|
|
how this query works it's scanning four tables scanning the EPS table it's scanning the host table
|
||
|
|
because I thought it would be useful to get back the name of the author of the show it's scanning
|
||
|
|
the joining table and the tags table tags two and the joining is done to make subsets of the tables
|
||
|
|
and the way I mentioned earlier on then the regular expression part looks for the word ham
|
||
|
|
so it's got word boundary before and after it so there's only the word ham as a distinct word
|
||
|
|
but it's and looking for it as a component of a tag or indeed the whole tag if needs if there is one
|
||
|
|
but not as part of a word so a Birmingham would not come back from this query it uses a group by
|
||
|
|
which makes sure that you get only one answer if the query matches the same episode twice or more
|
||
|
|
so we get back a list of shows and their dates and the titles and the host and then the tags that are
|
||
|
|
associated with them and they all contain the word ham so it thinks like ham radio ham as a tag
|
||
|
|
actually in one case ham radio seems to be the most common one though ham radio without a space
|
||
|
|
is not one of the matches looking through yeah amateur radio ham ham space radio is the
|
||
|
|
the commonest one but ham just as a tag is also coming back so there's some very
|
||
|
|
degree of sophistication that can be achieved by using this technique now I thought it would be
|
||
|
|
useful just to finish off with a technique that's available within the database most databases offer
|
||
|
|
this this is a thing called a view can use it to hide away the complexity of some of the queries
|
||
|
|
and there's certainly the case that the queries we're using this time round are more complex than
|
||
|
|
the ones we used in the last episode so I've created a view and I've called a view I should say is
|
||
|
|
a piece of SQL SQL which is a means of storing away a select query one of those ones where you
|
||
|
|
you're asking for particular rows to be returned out of the out of the tables so it's a way of
|
||
|
|
storing away such a select query and then you can use the the name of that query to query again
|
||
|
|
as if it's a table all of its own but behind the scenes the the view query is being issued
|
||
|
|
is being executed and the results of that are then being returned or subset of those results
|
||
|
|
being returned it's a sort of nested query type of thing I've included the query that I came
|
||
|
|
up with as a file if you want to look at it and it's also listed in the notes I created a view
|
||
|
|
called EHTVU which is just a way of signifying that it relates to the EPS table the host table
|
||
|
|
and the tags table tags two tables I should say it queries these in a similar way to the the way
|
||
|
|
we've we've done in previous queries and once it's there and stored away it can then be executed
|
||
|
|
I maybe explain a teeny bit more about it in a moment yeah this isn't really intended to be a
|
||
|
|
database tutorial as such more a discussion of methodologies so I'm using a create a query
|
||
|
|
which is just listed in the notes here which uses the view so it's doing a select from EHT
|
||
|
|
underscore view and then it's saying where tag regular expression reggex and then it's looking
|
||
|
|
for a the word solder SLDR with word boundaries before and after it and then grouping the result
|
||
|
|
by the ID number that comes back it gets back three shows which contain the tag a tag which is
|
||
|
|
either solder or contains solder in the tag so the first one is show 941 which has got the tag
|
||
|
|
second one is 103 seven mr x where he was he was giving a tutorial on soldering and it contains
|
||
|
|
the tags solder and the third one he didn't use the tags solder and we've ever added these tags
|
||
|
|
actually can't remember who created them doesn't contain that tag but it contains multicore solder
|
||
|
|
as a tag so this thing fished the word solder at a multicore solder because it's a separate word
|
||
|
|
and returned it which I think is the way you'd want it to do now the view contains it is a select
|
||
|
|
which will do a query it gets back all of all of the instances of episodes and hosts and tags
|
||
|
|
you then have to sort of subset that when you call it which is what the example shows but within
|
||
|
|
the view there is what's called a sub select so I think I did this I think I did an equivalent
|
||
|
|
in the last episode and the sub select is using a function in my SQL called group concat
|
||
|
|
which looks for all instances in a table and concatenates them together with a comma so the result
|
||
|
|
the tag list that you see in the in the result is concatenated from the tags table all of the tags
|
||
|
|
which relate to that particular show again it's it's it's probably not a thing you would do for real
|
||
|
|
but it was a demonstration more than anything else okay so we're now at the conclusion and my
|
||
|
|
conclusion is that the hpr database needs a tag mechanism very much and we've looked at the
|
||
|
|
present tag storage system in this miniseries so it's not a good way to do things we've looked at
|
||
|
|
a somewhat better way of doing it but I've concluded that it has some drawbacks this third example
|
||
|
|
this third episode shows a better way of doing doing things in a in a relational database in a way
|
||
|
|
that we represent the true relationship between the episodes and tags and that relationship is
|
||
|
|
a many to many relationship so you could say it's taking me three shows to get to a conclusion
|
||
|
|
that Mike Bray drew in in one but I felt it was worth working through this in order to explain why
|
||
|
|
and why not some why not use some of the other solutions so although it's going to require some
|
||
|
|
work it's strongly recommended that we implement a tag scheme in the hpr database in the way that
|
||
|
|
it's been discussed in this show and we also enable the foreign key capabilities of MariaDB
|
||
|
|
so that for the reasons I've mentioned along the way today and at the same time we look at doing
|
||
|
|
similar upgrades to enable many to many relationships of hosts and episodes we have we don't have
|
||
|
|
that when it comes to hosts I think that is as important if not more important than the tags thing
|
||
|
|
there's another one which is that there's a many many relationships between episodes and series
|
||
|
|
thinking episode will be could be a member of more than one series I don't think it's as critical
|
||
|
|
as the other ones mind you and I'm prepared to be disagreed with on that one but it's definitely
|
||
|
|
something you should look at if nothing else okay so as before I've got a little
|
||
|
|
key at the end saying please include tags in your shows and if you if you have a moment to add
|
||
|
|
more tags to the missing shows the ones missing tag let's say then be very very much appreciated
|
||
|
|
but other than that I finished breathe cyber relief okay thanks everybody bye
|
||
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community
|
||
|
|
podcast network that releases shows every weekday Monday through Friday today's show like all our
|
||
|
|
shows was contributed by an HBR listener like yourself if you ever thought of recording a podcast
|
||
|
|
and click on our contributing to find out how easy it really is hecka public radio was found
|
||
|
|
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
|
||
|
|
at binwreff.com if you have comments on today's show please email the host directly leave a comment
|
||
|
|
on the website or record a follow up episode yourself unless otherwise stated today's show is
|
||
|
|
released on the creative comments attribution share a light 3.0 license
|