162 lines
14 KiB
Plaintext
162 lines
14 KiB
Plaintext
|
|
Episode: 1362
|
||
|
|
Title: HPR1362: Fixing a bad RSS feed
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1362/hpr1362.mp3
|
||
|
|
Transcribed: 2025-10-18 00:14:13
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
Hello, this is Dave Morris. Today, I want to tell you about some experience I've had
|
||
|
|
with our SS feeds that are a little bit bent. This hinges around the fact that I have
|
||
|
|
written my own podcast management system, which I felt I needed because none of the available
|
||
|
|
systems did exactly what I wanted. An old story, I know. So, my system is based on Bash
|
||
|
|
Potter from Link, Link Fessenden TLLTS. Bash Potter is a single Bash script which reads a file
|
||
|
|
of feeds, which are either RSS or Atom feed URLs, that is, and parses them using XSLT to
|
||
|
|
get the enclosure URLs. It filters these against a history file, which contains a list of previously
|
||
|
|
downloaded enclosures, and then it downloads any that are new. So, this method of doing
|
||
|
|
things is pretty tolerant of badly formed feeds, because it's XSLT is pretty tolerant
|
||
|
|
and is just pulling out a particular element of the RSS. So, I built a thing on top of this,
|
||
|
|
which uses Postgres, a Postgres database to keep a lot of information about the feeds
|
||
|
|
I'm subscribed to, and then I know what's been downloaded, and I use this to keep track
|
||
|
|
of what I've listened to, what's on what player, I've got several players, and that sort
|
||
|
|
of thing. So, I use this to delete episodes as I listen to them, and I generate reports
|
||
|
|
and all manner of weird stuff. An example of the weirdness is that it tells me that I'm
|
||
|
|
currently subscribed to 84 feeds, I do listen to a lot of podcasts. I've got 98 episodes
|
||
|
|
to listen to, which adds up to 3 days, 10 hours, 46 minutes, and 35 seconds of listening
|
||
|
|
time. So, that should keep me quite busy. The database contains the results of parsing
|
||
|
|
the feeds in quite a lot of detail, and this parsing phase is intolerant of badly formed
|
||
|
|
feeds. In the past year or so, I've noticed the number of badly structured RSS feeds,
|
||
|
|
atoms not a problem, it's RSS that's the issue, and these of course, the script that does
|
||
|
|
the parsing, which uses pearl, and a particular parsing module, it causes these to fail.
|
||
|
|
Now, having heard so far, you might wonder why I pars or my subscribed feeds twice. Well,
|
||
|
|
it's one of these systems that's just grown over the years. It's a bit reminiscent of
|
||
|
|
some of the Wallace and Gromit films, I always think, some hair brain system like that,
|
||
|
|
and anyway, it's a bit idiosyncratic, but it's more specifically because I run Bash
|
||
|
|
Portal that my version of it overnight on my server, which is running 24-7, and then
|
||
|
|
I do the post-processing part of it on my workstation where the database lives. Yeah, I know,
|
||
|
|
and I'm going to rewrite this stuff. So, let me tell you a bit about the couple of
|
||
|
|
the feeds that have given me trouble. The first one is the Mintcast feed. Now, I've been
|
||
|
|
listening to the Mintcast for a while, and during the time that I've been a subscriber,
|
||
|
|
they've had a number of problems with their feed. Two real problems. One was that they
|
||
|
|
were suffering from the duplication of episodes within the feed, sometimes with different URLs,
|
||
|
|
which is quite puzzling. This causes Bash Portal to download the different versions. They
|
||
|
|
also suffered from mixing of MP3 and org episodes in the same feed, even though they offered
|
||
|
|
two different feeds. They were getting crosstalked between the two. Well, I emailed the
|
||
|
|
Mintcast guys, asking what was going on, and offering any help that I could give, which
|
||
|
|
wasn't any, in fact. They told me that the problem was a WordPress bug, which, as far as
|
||
|
|
I'm aware, has not yet been completely resolved. They have fixed the mixed audio issue, but
|
||
|
|
not the duplication. The feed still suffers from this problem, and what I've done is I've
|
||
|
|
taken the copy of the feed and edited it right down to its bare bones, and it's available
|
||
|
|
to view in the show notes of this episode, in case you'd like to see what the structure
|
||
|
|
of an RSS feed looks like if you've never seen one before. The other feed that gives
|
||
|
|
me trouble. I've had others, but most of them have been fixed, but the one that still remains
|
||
|
|
is the feed belonging to another podcast called The Pod Delusion. They have an extra feed
|
||
|
|
where they have conference recordings and other things. What they've done with the structure
|
||
|
|
of this is that they've put together episodes which contain multiple enclosures because
|
||
|
|
the episodes may be contained several recordings from a conference. It's a perfectly logical
|
||
|
|
thing to do. Here's conference X, here talks A, B, and C, but RSS doesn't allow it. It's
|
||
|
|
illegal. Well, maybe not illegal. It's something that many parsers don't know what to do with,
|
||
|
|
so it's not advisable. I wrote to the people at The Pod Delusion pointing out this issue,
|
||
|
|
and they're going to look into it. I must say it doesn't hit all of the podcast that I've
|
||
|
|
messed around with, but it certainly hits this The Bash Potter one, and also my parser,
|
||
|
|
the database parser end of it, and G Poder doesn't like it very much either. Let's talk about
|
||
|
|
solutions then. I've, because people have not managed to resolve these things, I've come up
|
||
|
|
with local workarounds. The general principle I've used is to write small
|
||
|
|
Perl scripts to do feed rationalization before they're fed to my podcatcher. I've got a script
|
||
|
|
per feed at the moment because the two problems are different in many ways, and it's just easy
|
||
|
|
to write two scripts rather than one generic one. What the scripts do is that they run on my server,
|
||
|
|
and they run from a cron job, which is set up to run just before the main podcatcher runs.
|
||
|
|
Each script reads the feed in question, and writes a corrected version to a place that's visible
|
||
|
|
to the Apache web server running on my server, and then the podcatcher configuration file points
|
||
|
|
to this corrected feed up on the server, rather than the original. And when the database back
|
||
|
|
end bit runs, it looks at the same place. So what I'm working with is a fixed feed. So I wanted
|
||
|
|
now to just walk you through the two scripts that I've written to fix this. The first one is relating
|
||
|
|
to the Mintcast problem, and it's called Mintcast Fix. There's a Perl script, this is a
|
||
|
|
Perl script which manipulates the feed and saves the results, as I've said. The full script, if
|
||
|
|
you're interested in it, is available on Gatorius, and there's a link to it in my show notes.
|
||
|
|
But assuming there's also a link to the script, an HTML version of this script, so that you can view
|
||
|
|
it as I tell you about it, should you be so interested. And so I'm going to try and explain what
|
||
|
|
the script actually does. It uses two Perl modules called LWPSimple and XMLRSS. LWPSimple is a generic,
|
||
|
|
generic module for shifting HTTP data. And in fact, it uses multiple protocols. I forgot
|
||
|
|
what LWP stands for now. I think it's Web Protocol, something. I can't remember. Anyway, it's
|
||
|
|
the thing that actually pulls down the contents of the feed from the Mintcast site in this case.
|
||
|
|
The second one parses the RSS data, which is an XML variant. So the script
|
||
|
|
contains reference to the Mintcastog feed, which is stored in a variable called DollarURL.
|
||
|
|
And the name of the file to which the modified version will be written, this is called Dollar
|
||
|
|
Feedfire. Now, the next thing it does is to create an XMLRSS object, which will be used to
|
||
|
|
parse the feed. The URL is downloaded using LWPSimple's get method. And if this fails, the script
|
||
|
|
abortes. And it's then parsed with the XMLRSS parse method. Now, I'm using this because it's one
|
||
|
|
of the few Perl modules that's able to handle RSS with multiple enclosures. I think there is
|
||
|
|
another one. But in the main, these parses don't like it. They don't like multiple enclosures
|
||
|
|
because it's not the intended structure, although there's nothing in the RSS standard that
|
||
|
|
prevents it, but it's not the way it was intended to be used. And this is why people dislike RSS,
|
||
|
|
I suspect, because it's not very well defined. Now, an RSS feed, if you look at the Mintcast
|
||
|
|
feed that I've mentioned earlier, it contains a nested structure, which starts off with an RSS
|
||
|
|
container, which inside that is a channel container, which defines the whole, the parameters
|
||
|
|
of the feed. And then inside there are multiple items, and each item can contain an enclosure.
|
||
|
|
So in the case of podcast, then there will be enclosures per item. Other sorts of feeds
|
||
|
|
are not necessarily. So what we're doing in the script is to iterate through all of the items,
|
||
|
|
which are presented as an array by the module. And we just walk through this and process it.
|
||
|
|
In this particular script, we, because for some reason, other wordpress system generates
|
||
|
|
a, doesn't generate an author entry. This particular script goes and pokes one into it.
|
||
|
|
So what's actually being done here is that we're trying to reduce multiple enclosures down to
|
||
|
|
a single enclosure. And we're working on the assumption that the multiple enclosures are all the
|
||
|
|
same, which is when you look at the example you'll see is the case. So all the script actually does
|
||
|
|
is it goes to an item. And then it iterates through the array of enclosures. And the first one
|
||
|
|
that it finds containing a type of audio slash org, it simply saves that into the structure
|
||
|
|
and stops the iteration. The reason this was done was because it was written for the case where
|
||
|
|
you might have MP3s embedded in there as well. So you could just take the first one off the
|
||
|
|
array in fact. So what's actually being done here is that the structure that was created by
|
||
|
|
parsing is being edited in place. So once that's been done for every item in the feed, the feed
|
||
|
|
simply written out into the feed file. And that's all, that's all there is. The pearl might look
|
||
|
|
a little more complicated than that. The way in which this is being done is not ideal, I guess.
|
||
|
|
The way that the XMLRSS module presents the data that it parses is it just gives you access to
|
||
|
|
a structure, a pearl structure, which mirrors the RSS structure. So you've got things like
|
||
|
|
an array of items, each of which can contain an array of enclosures. So it's a complex nested
|
||
|
|
structure. A lot of modules wouldn't work this the way they would use accesses, which would give
|
||
|
|
you more elegant and simple access to it. In the show notes, I won't go through this in there,
|
||
|
|
just now. But in the show notes, there's an attempt to explain a bit about the way that pearl
|
||
|
|
represents a raise of a raise of a raise within structures like this. It's not a pleasant thing to
|
||
|
|
look at, I have to admit. But I doubt whether there are many languages which should be able to
|
||
|
|
do this in a more elegant way. I think it's more down to the author of the module.
|
||
|
|
Anyway, so that's pretty much all there is to say about the min cast one.
|
||
|
|
If we look now at the pod delusion case, it uses the same method. It uses the same modules,
|
||
|
|
and a lot of what it does is similar to the min cast one. The difference is that in this
|
||
|
|
particular case, we're generating two XML RSS objects. Because the transmutation of one to
|
||
|
|
the other is more complex, we need to parse the incoming feed into one structure and then generate
|
||
|
|
a new structure to write out again. So there are two objects created, RSS in and RSS out.
|
||
|
|
Again, we download the feed using the LWP simple, and then we parse it as before with the
|
||
|
|
provisor that we allow multiple enclosures. We've still got a multiple enclosure issue here.
|
||
|
|
Now there's a variable here called channel, dollar channel, which is a reference to the
|
||
|
|
channel element of the feed. Remember I mentioned this in relation to the min cast feed.
|
||
|
|
This is the outermost layer of the feed structure, which holds everything the attributes which
|
||
|
|
define the feed as a whole. It's just things like the title, the link, the description, some of
|
||
|
|
which are mandatory, those three are mandatory in fact. So what we do is to copy the title,
|
||
|
|
the link, the pub date and the description from the input channel to the output one. Then
|
||
|
|
there's another loop as before, which is walking through the items of the incoming feed.
|
||
|
|
But the difference here is that for every enclosure found in the input feed, a new
|
||
|
|
item is generated in the output feed. The new item that we're creating is initialized with
|
||
|
|
the attributes taken from the input feed and has the enclosure added to it. So the result is that
|
||
|
|
every enclosure, when we've got multiple enclosures, ends up in its own item, which may be a
|
||
|
|
duplicate of the previous item. And that happens if the input item contains multiple enclosures.
|
||
|
|
So the final step then is to write the newly created RSS out to the required file as before
|
||
|
|
and then the script's done. So what we've done then is to de-duplicate the enclosures,
|
||
|
|
making multiple items containing them. You wouldn't actually want to publish such a thing,
|
||
|
|
but it's fine for my purposes. So this works fine. This is all working as I, as I would have
|
||
|
|
wished. It shows that parsing an RSS feed into a database in the way that I've been doing things
|
||
|
|
a lot more difficult than you'd expect. It's a lot more error prone.
|
||
|
|
And as I said earlier, this in my impression anyway is down to the fact that RSS is not a
|
||
|
|
well-defined standard and people interpret the standard in different ways and produce stuff
|
||
|
|
that the various packages don't like. So I've finished off the show notes with a pointer to
|
||
|
|
a nice tutorial on RSS, what it's all about and what the various fields are. In case you want to
|
||
|
|
dig into that more deeply. And there are various links to the scripts in a readable form and
|
||
|
|
HTML form for you to look at. So that's it.
|
||
|
|
You have been listening to Hacker Public Radio or Hacker Public Radio. Those are
|
||
|
|
we are a community podcast network that releases shows every weekday Monday through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by a HPR listener like yourself.
|
||
|
|
If you ever consider recording a podcast, then visit our website to find out how easy it
|
||
|
|
really is. Hacker Public Radio was founded by the Digital.Pound and the Infonomicum Computer Club.
|
||
|
|
HPR is funded by the binary revolution at binref.com. All binref projects are proudly sponsored by
|
||
|
|
Lina pages. From shared hosting to custom private clouds, go to LinaPages.com for all your hosting
|
||
|
|
needs. Unless otherwise stasis, today's show is released under a creative comments,
|
||
|
|
attribution, share a life, lead us our lives.
|