Episode: 2260
Title: HPR2260: Managing tags on HPR episodes - 2
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2260/hpr2260.mp3
Transcribed: 2025-10-19 00:30:32

---

This is an HBR episode 2, 260 entitled Managing Tag Mod HBR Episodes 2.
It is hosted by Dave Morris and in about 25 minutes long and Karima next visit flag.
The summer is looking for the best way to store and manage tag in the HBR database Part 2.
This episode of HBR is brought to you by AnanasThost.com.
At 15% discount on all shared hosting with the offer code HBR15 that's HBR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello, welcome to Hacker Public Radio, my name is Dave Morris.
Today I'm going to do the second in a brief group of shows.
I'm not calling it a series because it's not a series yet.
But we probably do need a database series.
But anyway, this is a second show talking about Managing Tags.
We want tags in the database.
We want to be able to search on them and we need to design them properly.
So in the first show of this group we looked at that stuff,
why we needed them and so forth and then we looked at how we're storing them at the moment.
But we're not actually using them and I talked about the advantages and disadvantages of doing
the tag storing in the way that we do and how usable it will be.
Now there are long notes for this because it's quite a detail.
Subjection, I put a lot of detail in here.
In fact, it's explicit detail.
It's not unclean but it's explicit use of English, I don't know.
Anyway, I did put a disclaimer in this one as well as the previous one saying,
I'm not really a trained database guy.
Just messed around with them for a fair number of years.
So I sort of feel I know my way around them.
But I don't think I could stand up against somebody who's really been trained in these things.
So if you have, you listen to this,
wondering what the hell I'm talking about and seeing the flaws in what I'm saying,
please pipe out and say, so you don't necessarily end up being committed to rebuilding this thing.
But some advice would always be useful.
So I went for a second approach to storing tags and accessing them and managing them.
And I thought that since tags have been used in all sorts of places on the web for a long time,
there were going to be plenty of suggestions for ways in which you could do this.
So I did various Google searches and came across one solution,
which uses a single table in the database containing tags.
I put a link in the notes and it's the solution marked scuttle solution.
I'm not quite sure what that means, but anyway, have a look at it if you're interested.
So the way I've implemented this, and I've got a test database which is a copy of the HBR
live database, I created a table called tags and it's got three columns in it.
It's got an ID.
The idea is a reference to the show number in the EPS table, the one that holds the episodes,
and it's got the tag itself, which is in whatever case it originally arrived in.
And then it's got LC tag, which is the lowercase form of the tag.
Now, I designed this originally because I'd not been that used to using MySQL.
And MariaDB and Postgres is what I've used most.
And I wasn't aware that MySQL and MariaDB don't do case sensitive matches by default.
So I don't actually need the two forms of the tag, but I'm not changing it just for this.
We're talking about design here, so if it ever does get turned into reality,
there'll be lots of scope for changing it and improving it.
So having made this table, how to fill it.
So I thought, well, in the database already, there are all of the tags that have been
submitted into the episodes, either by the hosts or by further processing after the event.
They're all there as CSV comes the variable stuff in the in the tags field of the EPS table.
So I went for method one in setting stuff up, managing it.
And it was to use the database capabilities themselves.
There's a thing within most databases to allow you to write bits of code in the database,
which can be run on the data there.
These are referred to as stored procedures or stored functions.
So I wrote something which was, this is not an original piece of code,
it's derived from something I saw on Stack Overflow.
And it's a, when you run it, it reads the tags in the EPS table throughout the entire database
and builds up the tags table from what it find.
I made a note here, this is fairly advanced database, trickery, magic even.
And I'm not a great expert on this, I just sort of found some advice and hacked it around a bit and used it.
So if you're not interested in this, then skip over it.
But I just really mention it because I did this.
It's not the ideal way I reckon.
But I've put the details of what I do in with these notes.
There's a file called Normalize underscore tags underscore one dot SQL,
which contains the SQL statements to define the stored procedures and the table
and to build indexes and that type of thing.
Then I've got another thing which I call Refresh underscore tags dot SQL,
which needs to be run and every time it's run, it empties the tags table and rebuilds it.
Very much a sledgehammer approach.
Second method, so I really wasn't happy with this, even though it's quite fun to do,
try and get my head around what it actually means and how to do it.
The second method was to write a pulse script.
It could be any language you wished if you wanted to do this sort of thing.
I made one at a pulse, because that's the language I know best.
And it's called Refresh underscore tags.
And what it does is it actually edits the tags in place,
using the same technique reading through the EPS tables tags and then looking at the tags table.
I've included this with the show and there's a brief explanation here as to how it actually works.
I'm not going to go into detail.
This is moderately long, bit of a bit of pearl.
The basis of it is that it uses a pearl module which understands CSV
and it used that to parse the CSV data.
And I did this because CSV is more tricky than you might think.
So if the CSV data is in good order and it will parse it,
if it finds anything faulty, it'll report it and ignore it.
But that would give us an opportunity to go and fix it.
But I want the further details in the notes.
I won't read them here because I suspect that it's not going to be amazingly interesting to people.
But it's using a fairly simple algorithm to hunt through all of the stored CSV tags
and looks to see if there are any changes in there that need to be propagated.
Now the table tags tables also got indexes.
Indexes are a feature within a database whereby you run a process
which looks through a table, a particular column in a table,
field in the table, both mean the same thing.
And it makes indexes which are effectively pointers into the table.
And you can set it up so that as you add and remove things in a table,
the indexes are kept in step.
So building an index alongside it is very useful.
And it means that when the database engine comes to try and look for something for you,
if there's an index that will use that,
rather than sequentially hunting through every row in the table or tables.
So doing this speeds things up.
So I have several indexes associated here,
which again I won't go into a huge lot of detail.
So the tags table contains repeated instances of a tag,
one for each matching episode number.
And I've got a little query here just to demonstrate
which selects the ID number and the tag from the tags table
where the tag is equal to the string grep.
And it returns two instances of the word grep,
which are used as tags on two shows 2040 and 2072.
You'd expect there to be more grep instances as tags in the database.
But I think that's largely due to the fact that only about 50% of the shows
have been tagged.
So let's look at the advantages and disadvantages of using this approach.
And we'll go on to look at how you would actually use it.
So I think that this solution gives a much more reliable and efficient solution
to the problem of storing and finding tags,
much more so than the common separated variables things.
The fact there's a separate table allows you to build indexes
and that optimizes access and so forth as I've mentioned.
You can't do this, not without a lot of work
against a common separated string type approach.
The disadvantages are really around the way I've implemented the management of it.
Using the stored procedure stuff, you have to blitz the whole table
and then rebuild it, which is not good.
When you use the pulse script, then it's much lighter weight
and runs very quickly.
And the first stored procedure method parses the CSV data
using SQL expressions, which is not good
because it doesn't really understand
properly formatted CSV data.
So the pulse script does much better job.
So those are really disadvantages that can easily be avoided
by not using the stored procedure stuff.
That one thing that a database designers would not like about this,
I think, would be that it's not normalized.
Normalization is a process where effectively you don't store duplicate values.
As much as possible, you avoid storing duplicate values.
And this solution is not normalized.
We'll come on to this in the next episode.
Yes, I'm sorry, there is another episode.
So in terms of searching, we can now do much more sophisticated searches
partly because the work of parsing the CSV
extracting the tags has been done
and the results stored in the table and the tags table.
So when we look to things in this method of searching in the first show,
every search required the tag list per episode to be looked at
and individual tags picked out using
rather arcane methods with regular expressions and so forth.
But in this case, everything has been done
by building the table in the first place.
And as I said before, the pulse script that does it
understands the CSV format and simply stores the right stuff away.
So if we want to be able to do what troops was suggesting,
which is examining the tags on a given show,
then find all the other shows that share the same tags we can do so.
And I'll come on to a method of doing that in a moment.
But I'm just going to skip through a bunch of queries
doing various tag searches.
So if you want to find all the shows with the given tag,
I'm using pretty much the same examples
as I did in the last episode just to compare things.
You're not going to see a huge difference in speed
admittedly because this database is quite small.
But I think you'll find that it's a better organized thing.
So it does produce better results potentially.
So in the last episode, we looked in the tags field, as I said,
now we're going to use the tags table.
And we're going to find all the shows associated
with the tag community, then we can report them.
The query is a bit more complicated.
Well, I guess it's actually not that much more complicated.
In database terms, it's pretty simple.
The previous example doing the same sort of thing
got very hairy when it came to selecting stuff out of CSV lists.
I'm not going to dig too deeply into this, I think.
Because, as I said before, this is not really a database tutorial.
But essentially what it's doing is it's using the ability of SQL
to examine multiple tables at once.
So it's looking in the EPS table and the tags table.
And it's looking in the tags table for all of the tags
which equal the string community.
So I'm looking for precisely that.
No other, not community news or anything like that.
And the way that the query is organized
means that we get back the details from the EPS table.
So I've got the ID number, the show number,
the data of the show, the title.
I've truncated the title using the substring function
just to the first 30 characters just to make it easier to see
on this page.
And it shows the tags from the EPS table,
the CSV stuff.
And it shows all instances where there's community.
I only added those tags because you can check
that these really are the rows
which contain the community string.
There are nine rows come back.
I've got a little sort of breakout box in the notes
which tries to explain the query a little bit more.
But I won't read that out.
I've said good proportion of what's in there anyway,
with that meaning too.
But if you really want to find out more
then you can do reading that, I think.
Just for interest, I used the SQLEO tool
that can Fallon mentioned in episode 1965.
And I used it to examine the database
and to demonstrate the table.
So effectively it's doing the same query
as the one I just mentioned.
But it's in a more graphical form.
I thought that might be useful.
It's actually a great tool.
I'm going to get more into using that myself.
In Ken's example, his table had little lines joining them
showing the relationship between them.
It's these relationships that make the sort of database
be referred to as relational databases.
But there aren't any in this database.
There are no relationships.
Explicit relationships here.
They're not there because we don't have
these capabilities within this database.
I think it's probably due to the age of the database design.
Within my SQL and MariaDB you need to go to some length
to set up these relationships,
which are just part of the database system.
Even in things like SQLite you can do this
without any great trouble.
Anyway, that's another thing that we,
I think we need to sort out but
and still with one thing at a time.
So my next example was showing combinations of tags.
So if we wanted to find a combination of tag community
and or HPR.
I did both of those example and HPR and or HPR.
You can search for them in the CSV tag.
But using the tags table this is easier
and there's another query here which I have not
broken out and explained.
Not really sure that I should think we need
to have a series on how databases work or something.
But the essence of it is that it's doing similar sort of query
and it's saying return all of the rows
in the X table where it matches tags
which are either community or HPR.
So it's an all type example.
It uses the thing called group by
which is a way of deduplicating the result.
We get back 14 rows.
Then I did as the AND version of the same thing
which it's simply, it does the same thing
but it's asking for only those cases
where a given row has got both matches in it.
So that returns five rows.
There's a sort of brief explanation of how that works.
But again, I want it's like reading out knitting patterns
or something I think this.
I don't hesitate to read out regular expressions
but structured query languages.
I'll be going a little bit over the top, I don't know.
So let's get on to the case where we're doing what
what troops are suggesting.
We're having been given a show, show number.
We go looking for the tags that that show contains
and return the other shows that relate to it.
I might put that better in the notes.
It's a more complex query.
And I've done it by building a SQL file
called Find Show Sharing Tags
with some underscores in there.
It's included in the notes here.
I've also displayed it in the notes
so you can have a look at it.
There are two queries in the file.
The first one simply reports, you wouldn't use this for real.
This is just for demonstration.
The first one just reports what tags are on a given show.
So you have to set, mySQL has the concept of variable.
So you set a variable, in this case called at show,
variables begin to an outside, to a number.
And then it searches, a query searches the tags database.
The table is say for the instances where that show number
is the ID number in the tags table.
Then there's a more complex query, the most complex so far.
So I'm not going to explain it.
But the essence of it is that it is searching the ex database.
It's actually doing a bit more than that.
It's doing searches of three table simultaneously.
And it's doing it with each of the tags
that came out of the target show.
And it's returning data such as the actual tag
that it's looking for, just to make it easier to understand.
Then the show number, the date, the host, host name.
And I just added that in just to demonstrate
that it could be done really.
And it contains the title as well.
So I've got a demonstration of this actually running
to run this within the database system.
You first need to set the at show variable.
I've set it to 2071.
And the way one thing you can do in MariaDB and MySQL
is you start a line with backslash.dot,
which means run the SQL that you find in this file.
And they follow it with the name of the file.
So simply invoked this file.
So it's a script, it's an SQL script effectively.
So it returns for 2071.
It returns three tags, amateur radio, electronics,
and open source.
This particular show is the undocumented features
by UV5R radio by MrX.
So the table that comes back is long.
It's got 30 rows in it.
And there's the shows that match amateur radio.
Shows that match electronics.
The shows that match open source.
That's 30 shows.
These have been made unique as well.
So you're not getting the same show showing up twice.
Look to check that I'm not talking nonsense.
Yeah, that was the intention anyway.
Then I did the same on 2072,
which is a sigflup show called
that awesome time I deleted my home directory.
The tags on that show are dd file system and grep.
I'm going to get one row back in this case, which is grep.
Remember I mentioned there were two instances
of grep in the table.
But this thing deliberately avoids repeating
the show that we're looking for.
So there's only one comes out.
So as I've said here, in hint,
there's only one show coming back.
This demonstrates the shortage of good tags
in the database at the moment.
Need I say more smiley face.
So that's a hint to anybody who's made it this far.
Really do with some help adding tags.
It is possible to do regular expression-y type things
with this as well, because sometimes you might want to be
doing something more than matching a tag verbatim.
This is not solving the same problem,
but it's just a general thought.
This is, I've said it's just fun.
I was experimenting with other types of query,
and I came up with one that looks
for a partial tag using a regular expression.
So this tag, the tag being searched for
is anything containing a word ending in working.
And it uses mySQL and MariaDB's regular expression
word boundary operator.
Which is a strange thing, which I won't try and read out.
It's in the notes, and it just again
to prove the sort of things that can be done with it.
It returns the show number, the date, the host, the title.
Then it shows the tags which are stored in the EPS table,
the CSV one.
And then it's using a feature of the database.
It also concatenates together all of the tags
that were found in the tags table.
They come out in sorted order because we
sort them in the table.
They come out sorted from the table.
I won't go into details.
I'm using some quite advanced bits and pieces.
So conclusion then.
Probably don't need to say that I prefer
this solution to the string of comma separated variables.
Downside of my original solution using stored procedures
and SQL is already been mentioned.
It's far too heavyweight.
On the other hand, the purl script,
it only finds the differences and makes changes.
It might do nothing at all, whereas the other one
would tear down the table and rebuild it regardless.
And it's just not a very good solution.
Probably there are ways that it could be done more efficiently,
but I'm not keen on it, to be honest.
Both of these approaches depend on the fact
that the EPS table contains a tags field.
And it's using them, the tags that are stored there
to build the table.
So I'm not sure that that would be the best way
of doing things in the final solution, the final way
in which we solve this.
Ideally, if this table, if this state of both
is being redesigned from scratch,
you would never have stored the tags in the episodes.
You would have simply had a separate tags table.
So there's a lot to be said for doing that,
but it does require quite a lot more infrastructure
to manage it at the point at which tags being entered
as a new show has been.
However, it wouldn't be that difficult.
These are trivial things that people
implementing databases do all the time.
I've included the epilogue that I included the last one
saying, please include tags when you're uploading.
It just makes life so much easier if you have added them
because you know best what's good to highlight in your show.
And if you have a bit of spare time
and feel that you could help to add missing tags,
then that would be amazingly helpful.
Even if you can just put aside an hour or something
every year that I haven't had enough people,
just any contribution would be much appreciated.
What you need to do is to go to the site,
the page on the HPL website,
which explains how to check a show
and prepare the tags for it and then send them in as an email.
And then we will add them to the database
of the script or two.
So it would be most appreciated if you could do that.
And we will get further on with the tag implementation project.
So I'm going to stop there.
I hope that wasn't too hard going.
I hope you managed to get something out of it.
And as I said before, I very much appreciate any feedback
you can to give me on the subject.
Okay then, bye-bye.
Buh-bye.
You've been listening to heckaPublicRadio at heckaPublicRadio.org.
We are a community podcast network
that releases shows every weekday Monday through Friday.
Today's show, like all our shows,
was contributed by an HPL listener like yourself.
If you ever thought of recording a podcast
and click on our contributing,
to find out how easy it really is.
heckaPublicRadio was founded by the digital dog pound
and the Infonomicon Computer Club.
And it's part of the binary revolution at binrev.com.
If you have comments on today's show,
please email the host directly,
leave a comment on the website
or record a follow-up episode yourself.
Unless otherwise status,
today's show is released under Creative Commons,
Attribution, ShareLife, 3.0 license.