Episode: 2211 Title: HPR2211: My podcast workflow Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2211/hpr2211.mp3 Transcribed: 2025-10-18 15:47:38 --- This is HPR episode 2211 entitled My Podcast Workflow. It is hosted by Dave Morris and in about 26 minutes long and Karim and exquisite flag. The summary is how I download, manage, listen to and delete podcasts. This episode of HPR is brought to you by AnanasThost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at AnanasThost.com. Hello everybody. Welcome to Hacker Public Radio. My name is Dave Morris. Today I've got an episode which I've entitled My Podcast Workflow. Probably like most people who are listening to this. I've been listening to podcasts for quite some time. In my case I started in 2005 and that was when I bought my first MP3 player. And over that time I've used various podcasts, downloads or podcasts as people call them. And a lot of them existed and I've tried several of them. But now I use a script based on Bash Potter which I've rewritten and built up to meet my needs. I also use a database to hold details of the feeds that I subscribe to. And it also holds what episodes have been downloaded and what's on a player to be listened to and what can be deleted and all of that sort of thing. I've written scripts in Bash, Pearl and Python to manage all of this. So I'm going to be describing some details of the workflow. But I'm not going to go into specific details about scripts and details of methods and so forth. I was prompted actually to put this show together in 2016. And I'd heard a show produced by Fokey, show number 1992. How I'm handling my podcast subscriptions and listening. And this was April of 2016. And I thought it was a really interesting episode and I thought I must try and write something along the similar sort of lines to that. I think it's always interesting to hear how other people do this sort of thing. So thought a contribution might be a good thing. But I'm embarrassed to say that I started this in April 2016. And somehow it's been lurking in the background ever since. And this is January 2017 that I'm recording this. So it's been waiting a while. I thought it would be interesting to describe what a podcast feed actually is. Sure most people have used them. That's how you're listening to this. But most likely anyway. It's defined by an XML file and there are two main formats which are called RSS and Atem. And I've linked to details about these. I won't go into details myself to what they mean and what they come from. If you're interested you can find out lots of information. Basically they both consist of a list of structured items where an item is a distinct thing. And each item can contain a link to a multimedia file or so-called enclosure. And it's the enclosure that makes it a podcast. There are other sorts of RSS and Atem feeds which are not podcast feeds. And it's the enclosure that makes it one. There's a Wikipedia article I've linked to talking about podcasts which you might find interesting if you want to dig deeper. So the way which a feed's intended to be used is that when something new has been released, new podcasts has been released, a new episode in the case of HPR, the feed is updated to reflect the change. And then pod catchers are monitoring. So they probably, you might be running something that looks every hour or once a day or something like that. And it will go and look at a given feed to see if there's anything new. And it usually does that by scanning through all of the enclosures, all of the items in the list and checking to see whether it's already downloaded things. If it finds something new then it will download it and there are all sorts of complications as to how many podcasts it'll download at a time and so on and so forth. But the point of it is that the pod catcher keeps a record of what it's already downloaded. Now in the early days of podcasts and pod catchers, just saving the URL of the enclosure was enough because that was pretty much a unique item and unique thing that identified that podcast. Now it's not so much the case. But I think it was always designed that there was a unique identifier associated with each enclosure and RSS and Atom certainly contain them. So this acts as a label which can be stored to say I've seen this one already and thereby void duplicate downloads. So looking at my workflow, I'm using a rewritten version of a venerable bash potter which was a bash script written by Link Fessenden of the Linux Link Tech Show. People say he's the link off the link in the title, I don't think so. Anyway, he wrote this rather elegant piece of bash to do this job. But it has its limitations. I rewrote this. He based his around using the XSLT capabilities. There's a parser which you can use which whose name I've certainly forgot. XSLT pars. I can't remember its name but sure you'll find it. Maybe to make sure I put it in the notes. But what this does, it's a method of parsing an XML file. And he had written a thing called parsinclosure.excel which is used to parse the enclosures out of an RSS feed. But since he did that, a lot of other types of feeds have popped up which include an Atom and he hadn't catered for that. So I modified it to include Atom. I also added another one which I call parsID.excel which is quite capable of parsing out the ID strings from feed. And that's the thing I just mentioned about a unique tag per episode. I've included both of these in my, along with my notes. And I should say, I always forget to say this. I'm sure you've worked out yourself that there are long notes that I'm currently reading effectively. But they're there for you to refer to if you find it interesting. One of the drawbacks of Bash Potter add also my version of it is that it can't deal with feeds where the enclosure URL doesn't show the actual download. So in the early days then the enclosure simply consisted of a URL pointing to the audio file itself. So an MP3 or an org or whatever it was. In latter times where there are lots of intermediaries that serve up the audio for podcasts. In many cases the URL that's in the enclosure doesn't actually point to the, point directly to the audio. So if you download it with something like WGET or curl, then you get the end result. But if you're trying to work out things like where the file is, what file is going to be generated as a consequence, it's very hard to do. I haven't quite got a solution for this. These things are popping up more and more. I don't have a complete solution to this yet. Charles in NJ did a show 1935 called Quick Back Bash Potter Fix where he talks about something which is similar, possibly the same as this problem. Anyway back to what I do, I run this modified Bash Potter on one of my Raspberry Pi's once a day and it runs during the night. I originally did this because I had a slow ADSL connection. It's got faster now and I also had a download limit. And what I found was that if I ran the downloads during the day, it collided with what my kids were doing when they were doing stuff. But that's really not relevant anymore, because both my kids have gotten away to uni or whatever. But I still do the same thing, downloads at two in the morning, UK time. It doesn't really matter. It downloads to a directory on the Pi. I've got a disk attached to it. And I export that directory with NFS, so I can see it from other systems in the house. So let's talk about the database briefly. I use a database to hold the feed details and also details of what I've downloaded. And the reason I did this originally was I'm interested in databases and want to learn how to use them. I chose Postgres, Postgres QL. It's the way it's written. Because it's very feature rich and powerful. And the timer first started using it was vastly more powerful than mySQL. Still is quite a lot more so, but mySQL is caught up a bit. And I was using Postgres at work around the time I started doing this in 2005 or so. So it was useful to have a home project as well. I want to be able to generate all sorts of reports from the database and to perform actions based on its content. So the way I've set it up is that the database runs on my workstation, which is a thing I turn off at night, and rather than running on the server. That's maybe a decision I want to review in due course. The design, as I've said in my notes, is sort of bolted on. It's a bolted on database. You know, it's not integrated properly. The bash, the bash podder clone downloads podcasts every day and stores them in a directory. It does it based on the date. So every day you get a new directory containing today's downloads. Then the original original model was that a playlist would be generated for each day. I don't do that anymore. So what I do is I use the thing that scans what's been downloaded and it puts data into the database. I've said that really if you were going to do something like this, it would be better to have database and pod catcher for the integrated. I didn't do this because I started off with the original bash podder and added the database on as an add-on, as a bolt-on. But it would be wiser to do it that way. So I have a thing that runs every morning that looks at the nice downloads. As I said, and it updates the database. And I want to to eventually integrate the two. In the database, I have a bunch of tables. There's more than I've listed here and I'm not going into detail. There's a feed table that contains all the feeds, like the title of the feed in its URL. I also added a classification element to it. So I like to group my feeds into the classes like science or documentary. So I can work with them separately. There's a table of episodes which contains the information about each episode that it's got from the feed. It contains the title of the episode, the URL of the media. It points to where the downloaded episode is on disk. And it links to the feed, obviously. There's a group table which contains a definition of all the groups that I mentioned, like comedy, music or whatever. These are just things I've classified. There's a table of players. And I've got a fair number of them, and I even did a show about this in 2014. I bought one or two more since then, actually. So I index all my players out of the database. I keep playlists in the database. And these are also stored on the players. But I'll get onto that in a minute. So I wanted to speak briefly about audio tags. Many podcasters, people generating the audio, they do a great job of adding metadata for their episodes. It's really important to do that. HPR goes to a lot of trouble to make sure it's got good metadata. And it was one of the criteria in the podcast awards that we were nominated for last year. If you don't have metadata in your episodes, then you tend to be downgraded as a consequence. Anyway, all of the players I use use rockbox. And they can display metadata tags as I, as I deem appropriate. So it's good to be able to see what's playing now and what's coming up next, which you can configure. And I also like to check out tags when I'm managing my episodes. So I can display more information on my workstation, for example, thing. The episode I'm currently listening to has got quite long notes associated with it. I can display them because they're in the tags. However, a lot of podcast episodes these days have quite poor, or even nonexistent tags. Quite a few recently that I've subscribed to, feeds have subscribed to, which don't have tags at all, which I find very, very strange. So I wanted to, when I saw this and saw that tags I was not happy with, I wanted to write something in which would improve them. I know there are plenty of tools out there to do that, but I felt I wanted something that I could build into scripts. So it needed to have a command line interface and most of these things tend to be GUI-based. I wrote something called Fixed Tags, which has actually been used to manage tags on HBR episodes quite some time. It runs on the HPR server. It's available on GitHub, I put a link to it. It's written in Pearl, and it has some issues about it, because the modules it uses are sort of obsolescent, which is quite surprising, but Pearl is gradually falling into a state of disrepair due to waning interest, unfortunately. I also wrote another tool based around the concept in Fixed Tags, which I could run daily to manage tags. This thing is called Tag Manager, and it works on the principle of scanning through all of the podcast episodes that are on disk, and it applies rules tag rules to them. So there are rules like, if there is no title tag, add one from the title field of the item in the feed. So the idea here was that some people don't bother to put a title in there in their metadata, and that bothers me a lot. I don't want to be seeing blank audio files popping up on my player. So because the feed itself needs to contain a title per enclosure, or at least between per item, I store that away in the database and I store away a few other fields as well. And I can write rules that say, like I just said, if there's no title tag, go and look in the database where there will be a title field from the feed and put that in instead. So I came up with a rules format to do this, based around a well-known format of configuration file. There's a per module, which is called config general, which I'm using, which uses a format similar to what you find in a patchy configuration file. It's fine, but it's got quite a lot of limitations. So the rules I came up with tend to be rather ugly, because I'm trying to build a lot more into it than the format really caters for. I put an example of how I deal with a particular feed, the BBC Elements podcast, which is very good. It's finished now, but I think you can still download the episodes. It talks about all the elements in the periodic table, which sort of stuff I love. Anyway, I put the rules in there and I just do things like, in sort of title, if there isn't one, if there's no comment, use the description out of the feed. And I also fiddle with the title to add the name elements to the front of it. So it's quite complex, and I won't go into details of it. It uses pearl regular expressions to do its stuff, and it works fine. But it's ugly. And I'd like to rewrite it in due course. I'd like to come up with my own language, rules, language, config file format, whatever you like to call it. But that's a project for later. Anyway, I write episodes to players, surprise, surprise. Now, I'm old school, right? Don't listen to very many episodes on my smartphone. I do have a smartphone. I currently got a OnePlus 1, which to me is huge. And I don't really want to be lugging that around to listen to stuff. I do occasionally do that. I certainly listen to podcasts in my car by connecting the Bluetooth to my car stereo. So, yeah. So that's independent of this, and I use an antenna pod on my phone to download stuff. However, mostly I'm listening to stuff on MP3 players. And I've written stuff to copy episodes to a given player. So, the way I work is I load up a player, listen to everything on it, and then refill it when it's finished. I usually write the podcast episodes in groups, so I might load a particular player with groups like business and comedy and documentary, etc, etc. and then listen to them in sequence. As episodes are written to a player, their status is updated in the database, so it's marked that they're on the given player. And there's a playlist which is also written to the player. Rockbox can work from a predefined playlist, so you can upload a playlist to it, which is in M3U format. You just need to be careful about the paths to the individual file. And so I upload that, and that's how I just tell Rockbox to use that playlist and off it goes. The way I delete stuff is that I run a script on my workstation to whenever a new episode comes up in the list. I mark that episode as being listened to through a script to that marks in the database. And then when I finished it, I simply run a script again to go through the list of episodes in which are being listened to, because I might have several players on the go at once. I'm not listening to all simultaneously, but when one needs charging, then I'd switch over to another one. And the deletion script looks for things in the being listened to state and says which of these can I delete? So I just say, I listen to that, delete it, delete it, and so on. So I make sure that they're actually deleted from my disk, a disk on the Raspberry Pi, actually, as soon as they have been listened to. I'd never bothered to delete them off the player. I simply overwrite them when I next load the player up. So there's a bunch of other tools that I've developed for generating reports and so on and dealing with issues. And as I've sort of mentioned, there's a feed viewer, which I can check details of a feed, or of a group of feeds, or of a list of feeds, or whatever. And it can also summarize all of the downloaded episodes belonging to a feed, and it generates reports in a variety of formats. I used it to generate the notes for two HPR shows, which I referred to here, shows I did on the podcast feed I'm listening to. And I was able to generate HTML at the back end of this thing. So as always, I tend to over-engine it. But that's what it muses me. I've got a tool for subscribing to a new feed, not too surprisingly, and that's the point at which I assign it to a group, and then I decide, because the feed that I'm newly subscribing to will already have a whole bunch of episodes in it. I can at that point say, I want to get the latest five or ten, or something, or none at all. Just wait for new ones to come out. I can do that at the point at which I subscribe. And obviously I've got the reverse tool, which allow me to cancel the subscription. I store the feed details in an archive, and I add notes to that, as I'm deleting them to say, why I deleted this podcast is boring, or whatever it is. And that way I can always look back and say, oh, I did actually listen to that, and I hated it. This is why. I've also been known to re-subscribe to a feed that I've forgotten, and I've listened to it. And it's during the subscription, I get a prompt that says, hey, you've listened to this before and you didn't like it because of this. So it acts as the memory I don't always have. So let's get to the conclusions then. I have been doing this for quite a long time. I seem to have actually started building this stuff in 2011, although I started listening in 2005. And I kept a journal of what I was doing, which I tend to do with projects. That's a file of the formatted text. And I noticed, as I was preparing this, it's got more than 8,000 lines of notes in there about what I've been doing. So it goes to proof. I've been doing quite a lot of work on this over the years. So what's good about it? Well, it's mine. It's originally inspired by Bash Potter, but the current script is completely a complete rewrite. It works. It does all I want it to do. Now it doesn't need much effort to run and maintain. Along the way I've learned tons of stuff, understand XML and SSLT better. I understand RSS and Atom feeds better. I know a lot more about Bash scripting, still learning. And by the way, quite a lot of that I learned in trying to hack all this stuff together, like using Bash to interface to a database, for example, which is a loony thing to do. Anyway, all of that I've used to make shows. That's it. Things I've discovered, weird things to do with Bash. And I've done HPR shows about. I've learned quite a lot more about Postgres and databases in general. And I understand quite a lot more about audio tags and the taglib library that I used to work on tags in Perl. And a little bit in Python. So it does, the scheme I'm using does have quite a lot of good ideas about how to deal with podcasts. I think it good ideas anyway. But podcast feeds and episodes, though in many cases they're not very well implemented. So what's bad then? It's clunky, badly designed. It's the result of hacks on top of hacks. It's really an alpha version of what it should be, what I wanted it to be. And it's one of those cases where you think, well okay, I've learned some stuff. And now I'm going to throw it away and start again. I'm just reluctant to do that or have been. It's not sufficiently resilient to issues with feeds and the bad practices you find in feeds. For example, BBC have this weird habit of releasing an episode. I think they automate it actually. Then they re-release it, re-release it a few days later. And it's often I think because somebody has checked what was generated by the automation, and found that it's truncated a bit or it's added, it's left some junk at the start or something. And they edit it and then they re-release it. But what they seem to do is they seem to release it with as if it's a brand new episode. So they don't keep the same idea, which is what you should do. They re-release it with the different URL, which is fair enough, but in such a way that you get a duplicate. Now other podcasts deal with this better than mine does. I think because they don't use additional information like hashtags that they have generated themselves, not hashtags, hashes at MD5 hashes or that give a better way of identifying. Another thing that's bad is not easy to extend. The current business of obscuring podcasts behind strange URLs that you then have to dig down through to find the actual name that has thrown everything in a loop. Whereas a lot of other podcatchers have dealt with this through better design, I think. And the last point is it's completely incapable of being shared. I'd have liked to have offered this to the world in large, but in its current incarnation it's absolutely not something anybody else would want. It's very much an alpha thing and it's hugely hacky. And you know, you'd an idiosyncratic and strange. So nobody else would want it as it stands at the moment. So it's a mixed thing. Anyway, I thought I'd share some of the details of it if you want to know more than ask me, but I won't be... I don't plan to do any more about this, because as I say, it's too weird and idiosyncratic. Okay, that's it then. Bye now. You've been listening to HecopobliGradio at HecopobliGradio.org. We are a community podcast network that releases shows every weekday Monday through Friday. Today's show, like all our shows, was contributed by an HPR listener like yourself. If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is. HecopobliGradio was founded by the digital dog pound and the Infonomicon Computer Club, and is part of the binary revolution at binrev.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is released under Creative Commons' Attribution, ShareLive3.0 LiveSense.