hpr_website/www/eps/hpr2211/hpr2211_full_shownotes.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="generator" content="pandoc">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  <meta name="author" content="Dave Morriss">
  <title>My podcast workflow (HPR Show 2211)</title>
  <style type="text/css">code{white-space: pre;}</style>
  <!--[if lt IE 9]>
    <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
  <![endif]-->
  <link rel="stylesheet" href="http://hackerpublicradio.org/css/hpr.css">
</head>

<body id="home">
<div id="container" class="shadow">
<header>
<h1 class="title">My podcast workflow (HPR Show 2211)</h1>
<h2 class="author">Dave Morriss</h2>
<hr/>
</header>

<main id="maincontent">
<article>
<header>
<h1>Table of Contents</h1>
<nav id="TOC">
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#podcast-feeds">Podcast Feeds</a></li>
<li><a href="#workflow">Workflow</a><ul>
<li><a href="#bashpodder">Bashpodder</a></li>
<li><a href="#database">Database</a></li>
<li><a href="#audio-tags">Audio tags</a></li>
<li><a href="#writing-episodes-to-a-player">Writing episodes to a player</a></li>
<li><a href="#deleting-what-ive-listened-to">Deleting what I’ve listened to</a></li>
<li><a href="#other-tools">Other tools</a></li>
</ul></li>
<li><a href="#conclusions">Conclusions</a><ul>
<li><a href="#whats-good-about-this-scheme">What’s good about this scheme?</a></li>
<li><a href="#whats-bad">What’s bad?</a></li>
</ul></li>
<li><a href="#links">Links</a></li>
</ul>
</nav>
</header>
<h2 id="overview">Overview</h2>
<p>I have been listening to podcasts for many years. I started in 2005, when I bought my first MP3 player.</p>
<p>Various <a href="https://en.wikipedia.org/wiki/List_of_podcatchers" title="List of podcatchers">podcast downloaders</a> (or <em>podcatchers</em>) have existed over this time, some of which I have tried. Now I use a script based on <a href="http://lincgeek.org/bashpodder/" title="Bashpodder">Bashpodder</a>, which I have built to meet my needs. I also use a database to hold details of the feeds I subscribe to, what episodes have been downloaded, what is on a player to be listened to and what can be deleted. I have written many scripts (in Bash, Perl and Python) to manage all of this, and I will be describing the overall workflow in this episode without going into too much detail.</p>
<p>I was prompted to put together this show by <a href="http://hackerpublicradio.org/correspondents.php?hostid=309" title="folky">folky’s</a> HPR episode <a href="http://hackerpublicradio.org/eps/hpr1992" title="How I&#39;m handling my podcast-subscriptions and -listening">1992 “<em>How I’m handling my podcast-subscriptions and -listening</em>”</a> released on 2016-03-22. Thanks to him for a very interesting episode.</p>
<p><small><strong>Note:</strong> I’m embarrassed to say that I started this episode in April 2016 and somehow forgot all about it until January 2017!</small></p>
<h2 id="podcast-feeds">Podcast Feeds</h2>
<p>A <a href="https://en.wikipedia.org/wiki/Podcast" title="Podcast">podcast</a> feed is defined by an XML file, using one of two main formats. These formats are called <a href="https://en.wikipedia.org/wiki/RSS" title="RSS">RSS</a> and <a href="https://en.wikipedia.org/wiki/Atom_(standard)" title="Atom">Atom</a>. Both formats basically consist of a list of structured items each of which can contain a link to a multimedia file or “enclosure”. It’s the enclosure that makes it a podcast as opposed to other sorts of feeds - see the <a href="https://en.wikipedia.org/wiki/Podcast" title="Podcast">Wikipedia article on the subject</a>.</p>
<p>The way in which the feed is intended to be used is that when new material is released on the site, the feed is updated to reflect the change. Then <em>podcatchers</em> can monitor the feed for changes and take action when an update is detected. The relevant action with a podcast feed is that the enclosures in the feed are downloaded, and the podcatcher maintains a local list of what has already been downloaded.</p>
<p>The structure of an RSS or Atom feed allows for there to be a unique identifier associated with each enclosure, and this is intended to act as a label for that enclosure to make it easier to to avoid duplicates.</p>
<h2 id="workflow">Workflow</h2>
<h3 id="bashpodder">Bashpodder</h3>
<p>I use a rewritten version of <a href="http://lincgeek.org/bashpodder/" title="Bashpodder">Bashpodder</a> to download my podcasts. I have modified the original design in two main ways:</p>
<ol type="1">
<li>I enhanced the XSLT file (<a href="hpr2211_parse_enclosure.xsl" title="parse_enclosure.xsl"><code>parse_enclosure.xsl</code></a>) used for parsing the feed (using <a href="http://xmlsoft.org/XSLT/xsltproc.html" title="xsltproc"><code>xsltproc</code></a><a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>) so that it can handle feeds using Atom as well as RSS. The original only handled RSS.</li>
<li>I made it keep a file of ID strings from the feeds to help determine which episode has already been downloaded. The original only kept the episode URLs which was fine at the time, but is not enough in these days of idiosyncratic feeds. My XSLT file is called <a href="hpr2211_parse_id.xsl" title="parse_id.xsl"><code>parse_id.xsl</code></a>.</li>
</ol>
<p>My Bashpodder clone cannot deal with feeds where the enclosure URL does not show the actual download URL. I am working on a solution to this but haven’t got a good one yet. <a href="http://hackerpublicradio.org/correspondents.php?hostid=229" title="Charles in NJ">Charles in NJ</a> mentions a fix for a similar (or maybe the same) problem in his show <a href="http://hackerpublicradio.org/eps/hpr1935" title="Quick Bashpodder Fix">1935 “<em>Quick Bashpodder Fix</em>”</a>.</p>
<p>I run this script on one of my Raspberry Pi’s once a day during the night. This was originally done because I had a slow ADSL connection which was being quite heavily used by my kids during the day. The Pi in question places the downloads in a directory which I export with NFS and mount on other machines.</p>
<h3 id="database">Database</h3>
<p>As I have already said, I use a database to hold the details of my feeds and downloads. This came about because of several reasons:</p>
<ul>
<li>I’m interested in databases and want to learn how to use them</li>
<li>I chose <em>PostgreSQL</em> because it is very feature-rich and flexible, and at the time I was using it at work.</li>
<li>I wanted to be able to generate all sorts of reports and perform all kinds of actions based on the contents of the database</li>
</ul>
<p>The database runs on my workstation rather than on the server.</p>
<p>As far as design is concerned, I “<em>bolted on</em>” the database to the existing Bashpodder model where podcasts are downloaded and stored in a directory according to the date. Playlists were generated by the original Bashpodder for each day’s episodes, and I have continued to do this until fairly recently.</p>
<p>Really, if using a database in this way, it would be better to integrate the podcatcher with it. However, I didn’t do this because of the way it evolved.</p>
<p>As a result I have scripts which I run each morning whose job it is to look at the night’s downloads and update the database with their details. The long-term plan is to write a whole new system from scratch which integrates everything, but I don’t see this happening for a while.</p>
<p>In my database I have the following main tables:</p>
<dl>
<dt><code>feeds</code></dt>
<dd><p>Contains the feed details like its title and URL. It also classifies each feed into a group like <em>science</em> or <em>documentary</em></p>
</dd>
<dt><code>episodes</code></dt>
<dd><p>Contains the items within the feeds with information like the title, the URL of the media, where the downloaded episode is and the feed the episode belongs to.</p>
</dd>
<dt><code>groups</code></dt>
<dd><p>This table contains the groups I have defined, like <em>comedy</em> and <em>music</em>. This is just my personal classification</p>
</dd>
<dt><code>players</code></dt>
<dd><p>The database has a list of all the players I own. I did <a href="http://hackerpublicradio.org/eps/hpr1656" title="My audio player collection">a show about this</a> in 2014.</p>
</dd>
<dt><code>playlists</code></dt>
<dd><p>I make my own playlists for each player, and these are stored in the database (and on the player).</p>
</dd>
</dl>
<h3 id="audio-tags">Audio tags</h3>
<p>Many podcasters generate excellent metadata for their episodes. All of the players I use on a regular basis run Rockbox, and it can display the metadata <em>tags</em> which helps me to work out what I’m listening to and what’s coming next. I also like to look at tags when I’m dealing with podcast episodes on my workstation, so I reckon having good quality metadata is important.</p>
<p>Because a number of podcast episodes have poor or even non-existent tags I wanted to write tools to improve them. I originally wrote a tool called <a href="https://github.com/davmo/fix_tags" title="The fix_tags script on GitHub"><code>fix_tags</code></a>, which has been used on the HPR server for several years, and is available on GitHub. I also wrote a tag management tool for daily use.</p>
<p>The daily tool is called <code>tag_manager</code> and it scans all of the podcast episodes I currently have on disk and applies tag rules to them. Rules are things like: “if there is no title tag, add one from the title field of the item in the feed”. I also do things like add a prefix to the title in some cases, such as adding ‘HPR’ to all HPR episodes so it’s easier to identify them individually in a list.</p>
<p>The rules are written in a format which is really ugly, but it works. I have plans to develop my own rule “language” at some point.</p>
<p>Here’s the rule for the BBC “<em>Elements</em>” podcast:</p>
<pre class="cfg"><code>&lt;rule &quot;Elements&quot;&gt;
    genre = $default_genre
    year = &quot;\&quot;.(defined(\$ep_year) ? \$ep_year : \$fileyear).\&quot;&quot;
    album = &quot;Elements&quot;
    comment = &quot;\&quot;.clean_string(\$comment).\&quot;&quot;
    # If no title, use the enclosure title
    &lt;regex &quot;^\s*$&quot;&gt;
        match = title
        title = &quot;\$ep_title&quot;
    &lt;/regex&gt;
    # If no comment, use the enclosure description
    &lt;regex &quot;^\s*$&quot;&gt;
        match = comment
        comment = &quot;\$ep_description&quot;
    &lt;/regex&gt;
    # Add &#39;Elements:&#39; to the front of the title if it&#39;s not there
    &lt;regex &quot;^(?!Elements: )(\S.+)$&quot;&gt;
        match = title
        title = &quot;Elements: \$1&quot;
    &lt;/regex&gt;
&lt;/rule&gt;</code></pre>
<h3 id="writing-episodes-to-a-player">Writing episodes to a player</h3>
<p>I use tools I have written to copy podcast episodes to whichever player I want to use. Normally I listen to everything on a given player then refill it after re-charging it. I usually write podcast episodes in groups, so I might load a particular player with groups like <em>business</em>, <em>comedy</em>, <em>documentary</em>, <em>environment</em>, and <em>history</em>.</p>
<p>As episodes are written their status is updated in the database and a playlist is created. The playlist is held in the database but is also written to a file on the player. Rockbox has the ability to work from pre-defined playlist files, and this is the way I organise my listening on a given player.</p>
<h3 id="deleting-what-ive-listened-to">Deleting what I’ve listened to</h3>
<p>As I listen to an episode I run a script on my workstation to mark that particular episode as “being listened to”, and when I have finished a given episode I run another script to delete it. The deletion script simply looks for episodes in the “being listened to” state and asks which of these to delete.</p>
<p>This way I make sure that episodes are deleted as soon as possible after listening to them. I never explicitly delete episodes from the players, I simply over-write them when I next load a particular player.</p>
<h3 id="other-tools">Other tools</h3>
<p>A lot of other tools have been developed for viewing the status of the system, fixing problems and so forth. Some of the key tools are:</p>
<ul>
<li>A feed viewer: it summarises the feed and any downloaded episodes. It can generate reports in a variety of formats. I used it to generate the notes for two HPR shows (<a href="http://hackerpublicradio.org/eps/hpr1516" title="01 The podcasts I listen to">1516</a>, <a href="http://hackerpublicradio.org/eps/hpr1518" title="02 The podcasts I listen to">1518</a>) I did on the podcast feeds I’m subscribed to.</li>
<li>A tool for subscribing to a new feed; this is the point at which the feed is assigned to a group and where it is decided which episodes are to be initially downloaded.</li>
<li>A tool for cancelling a subscription: such feeds are held in an archive with notes about why they were cancelled - for the sake of posterity. Also, I have been known to re-subscribe to a feed I have cancelled. The subscribing script checks it in the archive and asks if I really want to do this and why I said I wanted to cancel last time!</li>
</ul>
<h2 id="conclusions">Conclusions</h2>
<p>I have been fiddling about with this way of doing things for a long time. I seem to have started in 2011 and since that time have kept a journal associated with the project. This currently contains over 8000 lines of notes about what I have been doing, problems, solutions, etc.</p>
<h3 id="whats-good-about-this-scheme">What’s good about this scheme?</h3>
<ul>
<li>It’s pretty much all mine! I was inspired originally by Bashpodder, but the current script is a complete rewrite.</li>
<li>It works, and does pretty much all I want it to do and now needs very little effort to run and maintain.</li>
<li>Along the way I have learned <strong>tons</strong> of stuff. For example:
<ul>
<li>I understand XML and XSLT better</li>
<li>I understand RSS and Atom feeds better</li>
<li>I know a lot more about Bash scripting, though I’m still learning!</li>
<li>I have learned a fair bit more about PostgreSQL and databases in general</li>
<li>I understand a fair bit more about audio tags and the TagLib library that I use to manipulate them (both in Perl and Python)</li>
</ul></li>
<li>It does have what I think are a lot of good ideas about how to deal with podcast feeds and episodes, though these are often implemented badly in my scripts.</li>
</ul>
<h3 id="whats-bad">What’s bad?</h3>
<ul>
<li>It’s clunky and badly designed. It’s the result of hacks layered on hacks. It’s really an alpha version of what I want to implement and should be junked and completely rewritten.</li>
<li>It is not sufficiently resilient to feed issues and bad practices by feed owners. For example, the BBC have this strange habit of releasing an episode then re-releasing it a while later for reasons unknown. They make it difficult to recognise the re-release for what it is, so I sometimes get duplicates. Other podcatchers deal with this situation better than my system does.</li>
<li>It’s not easy to extend. For example, the current trend of “hiding” podcast episodes behind strange URLs which have to be interrogated through layers of redirection to find the <strong>actual</strong> name of the file containing the episode. Adding an algorithm to handle this is quite challenging, due to the design.</li>
<li>It’s completely incapable of being shared. I’d have liked to offer my efforts to the world, but in its current incarnation it’s absolutely not something anyone else would want.</li>
</ul>
<h2 id="links">Links</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/List_of_podcatchers">Wikipedia article entitled “<em>List of podcatchers</em>”</a></li>
<li><a href="http://hackerpublicradio.org/eps/hpr1992">HPR episode 1992 from <em>folky</em></a> entitled “<em>How I’m handling my podcast-subscriptions and -listening</em>”</li>
<li><a href="http://hackerpublicradio.org/eps/hpr1935">HPR episode 1935 from <em>Charles in NJ</em></a> entitled “<em>Quick Bashpodder Fix</em>”</li>
<li><a href="https://en.wikipedia.org/wiki/Podcast">Wikipedia article on the <em>Podcast</em></a></li>
<li><a href="https://en.wikipedia.org/wiki/RSS">Wikipedia article on <em>RSS</em></a></li>
<li><a href="https://en.wikipedia.org/wiki/Atom_(standard)">Wikipedia article on the <em>Atom</em> standard</a></li>
<li><a href="http://lincgeek.org/bashpodder/">Bashpodder</a></li>
<li><a href="http://hackerpublicradio.org/eps/hpr1656">HPR episode 1656</a> entitled “My audio player collection”</li>
<li>The <a href="https://github.com/davmo/fix_tags">fix_tags</a> script on GitHub</li>
<li>Information about <a href="http://xmlsoft.org/XSLT/xsltproc.html">xsltproc</a></li>
<li>Resources:
<ul>
<li>My version of <a href="hpr2211_parse_enclosure.xsl">parse_enclosure.xsl</a></li>
<li>My ID parser <a href="hpr2211_parse_id.xsl">parse_id.xsl</a></li>
<li><a href="hpr2211_components.html">List of all components</a> of the system described in this episode</li>
</ul></li>
</ul>
<!--
vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker
-->
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>I had forgotten the name of the parsing tool <code>xsltproc</code> when recording the audio, so added it in the notes.<a href="#fnref1">↩</a></p></li>
</ol>
</section>
</article>
</main>
</div>
</body>
</html>