rss feeds contain invalid xml #301

Open
opened 2025-11-11 15:16:45 +00:00 by ken_fallon · 4 comments
Owner

✗ The feed "https://hackerpublicradio.org/hpr_mp3.rss" is not valid xml.
✗ The feed "https://hackerpublicradio.org/hpr_ogg.rss" is not valid xml.
✗ The feed "https://hackerpublicradio.org/hpr_spx.rss" is not valid xml.
✗ The feed "https://hackerpublicradio.org/hpr_total_mp3.rss" is not valid xml.
✗ The feed "https://hackerpublicradio.org/hpr_total_ogg.rss" is not valid xml.
✗ The feed "https://hackerpublicradio.org/hpr_total_spx.rss" is not valid xml.

$ wget --no-verbose https://hackerpublicradio.org/hpr_mp3.rss --output-document=- | xmllint --format -
-:1814: parser error : Entity 'Atilde' not defined
    <title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interv
                                               ^
-:1814: parser error : Entity 'laquo' not defined
    <title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interv

✗ The feed "https://hackerpublicradio.org/hpr_mp3.rss" is not valid xml. ✗ The feed "https://hackerpublicradio.org/hpr_ogg.rss" is not valid xml. ✗ The feed "https://hackerpublicradio.org/hpr_spx.rss" is not valid xml. ✗ The feed "https://hackerpublicradio.org/hpr_total_mp3.rss" is not valid xml. ✗ The feed "https://hackerpublicradio.org/hpr_total_ogg.rss" is not valid xml. ✗ The feed "https://hackerpublicradio.org/hpr_total_spx.rss" is not valid xml. ``` $ wget --no-verbose https://hackerpublicradio.org/hpr_mp3.rss --output-document=- | xmllint --format - -:1814: parser error : Entity 'Atilde' not defined <title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interv ^ -:1814: parser error : Entity 'laquo' not defined <title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interv ```
Owner

XML
XML specifies five predefined entities needed to support every printable ASCII character: &, <, >, ', and ". The trailing semicolon is mandatory in XML (and XHTML) for these five entities (even if HTML or SGML allows omitting it for some of them, according to their DTD).

Ok, will need to create a filter that searches for any named character entities that do not match the above and convert them to their numeric equivalent. See this Wikipedia article: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

XML XML specifies five predefined entities needed to support every printable ASCII character: &amp;, &lt;, &gt;, &apos;, and &quot;. The trailing semicolon is mandatory in XML (and XHTML) for these five entities (even if HTML or SGML allows omitting it for some of them, according to their DTD). Ok, will need to create a filter that searches for any named character entities that do not match the above and convert them to their numeric equivalent. See this Wikipedia article: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Owner

Here is a list of named character entities recognized by HTML5: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

Here is a list of named character entities recognized by HTML5: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
Owner

Where do we want to resolve this issue?

  1. Do we want to clean the data when it is submitted--so that it goes into the database in usable form by xml parsers.
  2. Clean the data when generating xml by the hpr_generator
  3. Both
Where do we want to resolve this issue? 1) Do we want to clean the data when it is submitted--so that it goes into the database in usable form by xml parsers. 2) Clean the data when generating xml by the hpr_generator 3) Both
Author
Owner

A classic case of "there is a problem" but not providing any useful information. Thank you for you patients in this matter ;-)

I added the tool I use to check this hpr-check-feeds.

It checks the dynamic rss feeds (currently in use) with the hpr_generator ones.

As an example the feeds below should be functionally identical.

✗ The feed "https://hackerpublicradio.org/hpr_total_mp3.rss" is not valid xml.
🗸 The feed "https://hackerpublicradio.org/hpr_total_rss.php" is valid xml.

<title>HPR4499: Greg Farough and Zoë Kooyman of the FSF interview Librephone lead developer Rob Savoye</title>
versus
<title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interview Librephone lead developer Rob Savoye</title>

So the solution is we support UTF-8 everywhere, so we should not escape it.

I also noticed that

<link>eps/hpr4499/index.html</link> is missing the fqdn part, and the enclosure is

<enclosure url="https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4500/hpr4500.mp3" length="9742008" type="audio/mpeg"/>

versus what is in hpr_total_mp3.rss

<enclosure url="http://hackerpublicradio.org/eps/hpr4500.mp3" length="9742008" type="audio/mpeg"/>
A classic case of "there is a problem" but not providing any useful information. Thank you for you patients in this matter ;-) I added the tool I use to check this [hpr-check-feeds](https://repo.anhonesthost.net/HPR/hpr-tools/src/branch/main/workflow/hpr-check-feeds.bash). It checks the dynamic rss feeds (currently in use) with the `hpr_generator` ones. As an example the feeds below should be functionally identical. ✗ The feed "https://hackerpublicradio.org/hpr_total_mp3.rss" is not valid xml. 🗸 The feed "https://hackerpublicradio.org/hpr_total_rss.php" is valid xml. ``` <title>HPR4499: Greg Farough and Zoë Kooyman of the FSF interview Librephone lead developer Rob Savoye</title> versus <title>HPR4499: Greg Farough and Zo&Atilde;&laquo; Kooyman of the FSF interview Librephone lead developer Rob Savoye</title> ``` So the solution is we support UTF-8 everywhere, so we should not escape it. I also noticed that `<link>eps/hpr4499/index.html</link>` is missing the fqdn part, and the enclosure is ``` <enclosure url="https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4500/hpr4500.mp3" length="9742008" type="audio/mpeg"/> versus what is in hpr_total_mp3.rss <enclosure url="http://hackerpublicradio.org/eps/hpr4500.mp3" length="9742008" type="audio/mpeg"/> ```
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: HPR/hpr_generator#301