Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr2211.txt
+++ b/hpr_transcripts/hpr2211.txt
@@ -0,0 +1,472 @@
+Episode: 2211
+Title: HPR2211: My podcast workflow
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2211/hpr2211.mp3
+Transcribed: 2025-10-18 15:47:38
+
+---
+
+This is HPR episode 2211 entitled My Podcast Workflow.
+It is hosted by Dave Morris and in about 26 minutes long and Karim and exquisite flag.
+The summary is how I download, manage, listen to and delete podcasts.
+This episode of HPR is brought to you by AnanasThost.com.
+Get 15% discount on all shared hosting with the offer code HPR15.
+That's HPR15.
+Better web hosting that's honest and fair at AnanasThost.com.
+Hello everybody. Welcome to Hacker Public Radio. My name is Dave Morris.
+Today I've got an episode which I've entitled My Podcast Workflow.
+Probably like most people who are listening to this.
+I've been listening to podcasts for quite some time.
+In my case I started in 2005 and that was when I bought my first MP3 player.
+And over that time I've used various podcasts, downloads or podcasts as people call them.
+And a lot of them existed and I've tried several of them.
+But now I use a script based on Bash Potter which I've rewritten and built up to meet my needs.
+I also use a database to hold details of the feeds that I subscribe to.
+And it also holds what episodes have been downloaded and what's on a player to be listened to and what can be deleted and all of that sort of thing.
+I've written scripts in Bash, Pearl and Python to manage all of this.
+So I'm going to be describing some details of the workflow.
+But I'm not going to go into specific details about scripts and details of methods and so forth.
+I was prompted actually to put this show together in 2016.
+And I'd heard a show produced by Fokey, show number 1992.
+How I'm handling my podcast subscriptions and listening.
+And this was April of 2016.
+And I thought it was a really interesting episode and I thought I must try and write something along the similar sort of lines to that.
+I think it's always interesting to hear how other people do this sort of thing.
+So thought a contribution might be a good thing.
+But I'm embarrassed to say that I started this in April 2016.
+And somehow it's been lurking in the background ever since.
+And this is January 2017 that I'm recording this.
+So it's been waiting a while.
+I thought it would be interesting to describe what a podcast feed actually is.
+Sure most people have used them.
+That's how you're listening to this.
+But most likely anyway.
+It's defined by an XML file and there are two main formats which are called RSS and Atem.
+And I've linked to details about these.
+I won't go into details myself to what they mean and what they come from.
+If you're interested you can find out lots of information.
+Basically they both consist of a list of structured items where an item is a distinct thing.
+And each item can contain a link to a multimedia file or so-called enclosure.
+And it's the enclosure that makes it a podcast.
+There are other sorts of RSS and Atem feeds which are not podcast feeds.
+And it's the enclosure that makes it one.
+There's a Wikipedia article I've linked to talking about podcasts which you might find interesting if you want to dig deeper.
+So the way which a feed's intended to be used is that when something new has been released,
+new podcasts has been released, a new episode in the case of HPR,
+the feed is updated to reflect the change.
+And then pod catchers are monitoring.
+So they probably, you might be running something that looks every hour or once a day
+or something like that.
+And it will go and look at a given feed to see if there's anything new.
+And it usually does that by scanning through all of the enclosures,
+all of the items in the list and checking to see whether it's already downloaded things.
+If it finds something new then it will download it and there are all sorts of complications
+as to how many podcasts it'll download at a time and so on and so forth.
+But the point of it is that the pod catcher keeps a record of what it's already downloaded.
+Now in the early days of podcasts and pod catchers,
+just saving the URL of the enclosure was enough because that was pretty much a unique item
+and unique thing that identified that podcast.
+Now it's not so much the case.
+But I think it was always designed that there was a unique identifier associated with each enclosure
+and RSS and Atom certainly contain them.
+So this acts as a label which can be stored to say I've seen this one already
+and thereby void duplicate downloads.
+So looking at my workflow, I'm using a rewritten version of a venerable bash potter
+which was a bash script written by Link Fessenden of the Linux Link Tech Show.
+People say he's the link off the link in the title, I don't think so.
+Anyway, he wrote this rather elegant piece of bash to do this job.
+But it has its limitations.
+I rewrote this.
+He based his around using the XSLT capabilities.
+There's a parser which you can use which whose name I've certainly forgot.
+XSLT pars.
+I can't remember its name but sure you'll find it.
+Maybe to make sure I put it in the notes.
+But what this does, it's a method of parsing an XML file.
+And he had written a thing called parsinclosure.excel
+which is used to parse the enclosures out of an RSS feed.
+But since he did that, a lot of other types of feeds have popped up
+which include an Atom and he hadn't catered for that.
+So I modified it to include Atom.
+I also added another one which I call parsID.excel
+which is quite capable of parsing out the ID strings from feed.
+And that's the thing I just mentioned about a unique tag per episode.
+I've included both of these in my, along with my notes.
+And I should say, I always forget to say this.
+I'm sure you've worked out yourself that there are long notes
+that I'm currently reading effectively.
+But they're there for you to refer to if you find it interesting.
+One of the drawbacks of Bash Potter add also my version of it
+is that it can't deal with feeds where the enclosure URL
+doesn't show the actual download.
+So in the early days then the enclosure simply consisted of a URL
+pointing to the audio file itself.
+So an MP3 or an org or whatever it was.
+In latter times where there are lots of intermediaries
+that serve up the audio for podcasts.
+In many cases the URL that's in the enclosure doesn't actually point to the,
+point directly to the audio.
+So if you download it with something like WGET or curl,
+then you get the end result.
+But if you're trying to work out things like where the file is,
+what file is going to be generated as a consequence,
+it's very hard to do.
+I haven't quite got a solution for this.
+These things are popping up more and more.
+I don't have a complete solution to this yet.
+Charles in NJ did a show 1935 called Quick Back Bash Potter Fix
+where he talks about something which is similar,
+possibly the same as this problem.
+Anyway back to what I do, I run this modified Bash Potter
+on one of my Raspberry Pi's once a day and it runs during the night.
+I originally did this because I had a slow ADSL connection.
+It's got faster now and I also had a download limit.
+And what I found was that if I ran the downloads during the day,
+it collided with what my kids were doing when they were doing stuff.
+But that's really not relevant anymore,
+because both my kids have gotten away to uni or whatever.
+But I still do the same thing,
+downloads at two in the morning, UK time.
+It doesn't really matter.
+It downloads to a directory on the Pi.
+I've got a disk attached to it.
+And I export that directory with NFS,
+so I can see it from other systems in the house.
+So let's talk about the database briefly.
+I use a database to hold the feed details
+and also details of what I've downloaded.
+And the reason I did this originally was
+I'm interested in databases and want to learn how to use them.
+I chose Postgres, Postgres QL.
+It's the way it's written.
+Because it's very feature rich and powerful.
+And the timer first started using it was vastly more powerful than mySQL.
+Still is quite a lot more so, but mySQL is caught up a bit.
+And I was using Postgres at work around the time
+I started doing this in 2005 or so.
+So it was useful to have a home project as well.
+I want to be able to generate all sorts of reports
+from the database and to perform actions based on its content.
+So the way I've set it up is that the database runs on my workstation,
+which is a thing I turn off at night,
+and rather than running on the server.
+That's maybe a decision I want to review in due course.
+The design, as I've said in my notes, is sort of bolted on.
+It's a bolted on database.
+You know, it's not integrated properly.
+The bash, the bash podder clone downloads podcasts every day
+and stores them in a directory.
+It does it based on the date.
+So every day you get a new directory containing today's downloads.
+Then the original original model was that a playlist would be generated for each day.
+I don't do that anymore.
+So what I do is I use the thing that scans what's been downloaded
+and it puts data into the database.
+I've said that really if you were going to do something like this,
+it would be better to have database and pod catcher for the integrated.
+I didn't do this because I started off with the original bash podder
+and added the database on as an add-on, as a bolt-on.
+But it would be wiser to do it that way.
+So I have a thing that runs every morning that looks at the nice downloads.
+As I said, and it updates the database.
+And I want to to eventually integrate the two.
+In the database, I have a bunch of tables.
+There's more than I've listed here and I'm not going into detail.
+There's a feed table that contains all the feeds, like the title of the feed in its URL.
+I also added a classification element to it.
+So I like to group my feeds into the classes like science or documentary.
+So I can work with them separately.
+There's a table of episodes which contains the information about each episode
+that it's got from the feed.
+It contains the title of the episode, the URL of the media.
+It points to where the downloaded episode is on disk.
+And it links to the feed, obviously.
+There's a group table which contains a definition of all the groups
+that I mentioned, like comedy, music or whatever.
+These are just things I've classified.
+There's a table of players.
+And I've got a fair number of them, and I even did a show about this in 2014.
+I bought one or two more since then, actually.
+So I index all my players out of the database.
+I keep playlists in the database.
+And these are also stored on the players.
+But I'll get onto that in a minute.
+So I wanted to speak briefly about audio tags.
+Many podcasters, people generating the audio,
+they do a great job of adding metadata for their episodes.
+It's really important to do that.
+HPR goes to a lot of trouble to make sure it's got good metadata.
+And it was one of the criteria in the podcast awards
+that we were nominated for last year.
+If you don't have metadata in your episodes,
+then you tend to be downgraded as a consequence.
+Anyway, all of the players I use use rockbox.
+And they can display metadata tags as I, as I deem appropriate.
+So it's good to be able to see what's playing now
+and what's coming up next, which you can configure.
+And I also like to check out tags when I'm managing my episodes.
+So I can display more information on my workstation,
+for example, thing.
+The episode I'm currently listening to has got quite long notes associated
+with it. I can display them because they're in the tags.
+However, a lot of podcast episodes these days have quite poor,
+or even nonexistent tags.
+Quite a few recently that I've subscribed to, feeds have subscribed to,
+which don't have tags at all, which I find very, very strange.
+So I wanted to, when I saw this and saw that tags I was not happy with,
+I wanted to write something in which would improve them.
+I know there are plenty of tools out there to do that,
+but I felt I wanted something that I could build into scripts.
+So it needed to have a command line interface
+and most of these things tend to be GUI-based.
+I wrote something called Fixed Tags,
+which has actually been used to manage tags on HBR episodes quite some time.
+It runs on the HPR server.
+It's available on GitHub, I put a link to it.
+It's written in Pearl, and it has some issues about it,
+because the modules it uses are sort of obsolescent,
+which is quite surprising, but Pearl is gradually falling into a state of disrepair
+due to waning interest, unfortunately.
+I also wrote another tool based around the concept in Fixed Tags,
+which I could run daily to manage tags.
+This thing is called Tag Manager,
+and it works on the principle of scanning through all of the podcast episodes
+that are on disk, and it applies rules tag rules to them.
+So there are rules like, if there is no title tag,
+add one from the title field of the item in the feed.
+So the idea here was that some people don't bother to put a title in there in their metadata,
+and that bothers me a lot.
+I don't want to be seeing blank audio files popping up on my player.
+So because the feed itself needs to contain a title per enclosure,
+or at least between per item, I store that away in the database
+and I store away a few other fields as well.
+And I can write rules that say, like I just said,
+if there's no title tag,
+go and look in the database where there will be a title field from the feed
+and put that in instead.
+So I came up with a rules format to do this,
+based around a well-known format of configuration file.
+There's a per module, which is called config general, which I'm using,
+which uses a format similar to what you find in a patchy configuration file.
+It's fine, but it's got quite a lot of limitations.
+So the rules I came up with tend to be rather ugly,
+because I'm trying to build a lot more into it than the format really caters for.
+I put an example of how I deal with a particular feed,
+the BBC Elements podcast, which is very good.
+It's finished now, but I think you can still download the episodes.
+It talks about all the elements in the periodic table,
+which sort of stuff I love.
+Anyway, I put the rules in there and I just do things like,
+in sort of title, if there isn't one,
+if there's no comment, use the description out of the feed.
+And I also fiddle with the title to add the name elements to the front of it.
+So it's quite complex, and I won't go into details of it.
+It uses pearl regular expressions to do its stuff,
+and it works fine.
+But it's ugly.
+And I'd like to rewrite it in due course.
+I'd like to come up with my own language, rules, language,
+config file format, whatever you like to call it.
+But that's a project for later.
+Anyway, I write episodes to players, surprise, surprise.
+Now, I'm old school, right?
+Don't listen to very many episodes on my smartphone.
+I do have a smartphone.
+I currently got a OnePlus 1, which to me is huge.
+And I don't really want to be lugging that around to listen to stuff.
+I do occasionally do that.
+I certainly listen to podcasts in my car
+by connecting the Bluetooth to my car stereo.
+So, yeah.
+So that's independent of this,
+and I use an antenna pod on my phone to download stuff.
+However, mostly I'm listening to stuff on MP3 players.
+And I've written stuff to copy episodes to a given player.
+So, the way I work is I load up a player,
+listen to everything on it,
+and then refill it when it's finished.
+I usually write the podcast episodes in groups,
+so I might load a particular player with groups
+like business and comedy and documentary, etc, etc.
+and then listen to them in sequence.
+As episodes are written to a player,
+their status is updated in the database,
+so it's marked that they're on the given player.
+And there's a playlist which is also written to the player.
+Rockbox can work from a predefined playlist,
+so you can upload a playlist to it,
+which is in M3U format.
+You just need to be careful about the paths
+to the individual file.
+And so I upload that,
+and that's how I just tell Rockbox
+to use that playlist and off it goes.
+The way I delete stuff is that I run a script on my workstation
+to whenever a new episode comes up in the list.
+I mark that episode as being listened to
+through a script to that marks in the database.
+And then when I finished it,
+I simply run a script again to go through the list of episodes
+in which are being listened to,
+because I might have several players on the go at once.
+I'm not listening to all simultaneously,
+but when one needs charging,
+then I'd switch over to another one.
+And the deletion script looks for things in the being listened to state
+and says which of these can I delete?
+So I just say,
+I listen to that, delete it, delete it, and so on.
+So I make sure that they're actually deleted from my disk,
+a disk on the Raspberry Pi, actually,
+as soon as they have been listened to.
+I'd never bothered to delete them off the player.
+I simply overwrite them when I next load the player up.
+So there's a bunch of other tools that I've developed
+for generating reports and so on and dealing with issues.
+And as I've sort of mentioned,
+there's a feed viewer,
+which I can check details of a feed,
+or of a group of feeds, or of a list of feeds, or whatever.
+And it can also summarize all of the downloaded episodes
+belonging to a feed,
+and it generates reports in a variety of formats.
+I used it to generate the notes for two HPR shows,
+which I referred to here,
+shows I did on the podcast feed I'm listening to.
+And I was able to generate HTML at the back end of this thing.
+So as always, I tend to over-engine it.
+But that's what it muses me.
+I've got a tool for subscribing to a new feed,
+not too surprisingly,
+and that's the point at which I assign it to a group,
+and then I decide,
+because the feed that I'm newly subscribing to
+will already have a whole bunch of episodes in it.
+I can at that point say,
+I want to get the latest five or ten,
+or something, or none at all.
+Just wait for new ones to come out.
+I can do that at the point at which I subscribe.
+And obviously I've got the reverse tool,
+which allow me to cancel the subscription.
+I store the feed details in an archive,
+and I add notes to that,
+as I'm deleting them to say,
+why I deleted this podcast is boring,
+or whatever it is.
+And that way I can always look back and say,
+oh, I did actually listen to that,
+and I hated it.
+This is why.
+I've also been known to re-subscribe to a feed
+that I've forgotten,
+and I've listened to it.
+And it's during the subscription,
+I get a prompt that says,
+hey, you've listened to this before
+and you didn't like it because of this.
+So it acts as the memory I don't always have.
+So let's get to the conclusions then.
+I have been doing this for quite a long time.
+I seem to have actually started building this stuff in 2011,
+although I started listening in 2005.
+And I kept a journal of what I was doing,
+which I tend to do with projects.
+That's a file of the formatted text.
+And I noticed, as I was preparing this,
+it's got more than 8,000 lines of notes in there
+about what I've been doing.
+So it goes to proof.
+I've been doing quite a lot of work on this over the years.
+So what's good about it?
+Well, it's mine.
+It's originally inspired by Bash Potter,
+but the current script is completely a complete rewrite.
+It works.
+It does all I want it to do.
+Now it doesn't need much effort to run and maintain.
+Along the way I've learned tons of stuff,
+understand XML and SSLT better.
+I understand RSS and Atom feeds better.
+I know a lot more about Bash scripting, still learning.
+And by the way, quite a lot of that I learned
+in trying to hack all this stuff together,
+like using Bash to interface to a database, for example,
+which is a loony thing to do.
+Anyway, all of that I've used to make shows.
+That's it.
+Things I've discovered,
+weird things to do with Bash.
+And I've done HPR shows about.
+I've learned quite a lot more about Postgres
+and databases in general.
+And I understand quite a lot more about audio tags
+and the taglib library that I used to work on tags in Perl.
+And a little bit in Python.
+So it does, the scheme I'm using does have quite a lot of good ideas
+about how to deal with podcasts.
+I think it good ideas anyway.
+But podcast feeds and episodes,
+though in many cases they're not very well implemented.
+So what's bad then?
+It's clunky, badly designed.
+It's the result of hacks on top of hacks.
+It's really an alpha version of what it should be,
+what I wanted it to be.
+And it's one of those cases where you think,
+well okay, I've learned some stuff.
+And now I'm going to throw it away and start again.
+I'm just reluctant to do that or have been.
+It's not sufficiently resilient to issues with feeds
+and the bad practices you find in feeds.
+For example, BBC have this weird habit of releasing an episode.
+I think they automate it actually.
+Then they re-release it, re-release it a few days later.
+And it's often I think because somebody has checked
+what was generated by the automation,
+and found that it's truncated a bit
+or it's added, it's left some junk at the start or something.
+And they edit it and then they re-release it.
+But what they seem to do is they seem to release it
+with as if it's a brand new episode.
+So they don't keep the same idea, which is what you should do.
+They re-release it with the different URL,
+which is fair enough, but in such a way that you get a duplicate.
+Now other podcasts deal with this better than mine does.
+I think because they don't use additional information
+like hashtags that they have generated themselves,
+not hashtags, hashes at MD5 hashes or that give a better way of identifying.
+Another thing that's bad is not easy to extend.
+The current business of obscuring podcasts
+behind strange URLs that you then have to dig down through
+to find the actual name that has thrown everything in a loop.
+Whereas a lot of other podcatchers have dealt with this
+through better design, I think.
+And the last point is it's completely incapable of being shared.
+I'd have liked to have offered this to the world in large,
+but in its current incarnation it's absolutely not something
+anybody else would want.
+It's very much an alpha thing and it's hugely hacky.
+And you know, you'd an idiosyncratic and strange.
+So nobody else would want it as it stands at the moment.
+So it's a mixed thing.
+Anyway, I thought I'd share some of the details of it
+if you want to know more than ask me, but I won't be...
+I don't plan to do any more about this, because as I say,
+it's too weird and idiosyncratic.
+Okay, that's it then. Bye now.
+You've been listening to HecopobliGradio at HecopobliGradio.org.
+We are a community podcast network that releases shows
+every weekday Monday through Friday.
+Today's show, like all our shows, was contributed by an HPR listener
+like yourself.
+If you ever thought of recording a podcast, then click on our
+contributing to find out how easy it really is.
+HecopobliGradio was founded by the digital dog pound
+and the Infonomicon Computer Club,
+and is part of the binary revolution at binrev.com.
+If you have comments on today's show,
+please email the host directly, leave a comment on the website
+or record a follow-up episode yourself.
+Unless otherwise stated, today's show is released
+under Creative Commons' Attribution,
+ShareLive3.0 LiveSense.