- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
302 lines
27 KiB
Plaintext
302 lines
27 KiB
Plaintext
Episode: 3393
|
|
Title: HPR3393: We need to talk about XML
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3393/hpr3393.mp3
|
|
Transcribed: 2025-10-24 22:38:14
|
|
|
|
---
|
|
|
|
This is Hacker Public Radio Episode 3393 for Wendy, the 4th of August 2021.
|
|
Today's show is entitled, We Need to Talk About XML It is Hosted by Clot 2,
|
|
and is about 31 minutes long and carries a clean flag.
|
|
The summary is an extensible markup language. This is too good to be true.
|
|
This episode of HPR is brought to you by archive.org.
|
|
Support universal access to all knowledge by heading over to archive.org
|
|
forward slash donate.
|
|
Hey everybody, thanks for listening to Hacker Public Radio. My name is Clot 2 and in this
|
|
episode I want to talk about XML and in the next episode from me that you'll hear
|
|
probably is going to be about XML Starlit. I didn't want to cover XML Starlit before I just
|
|
kind of talked about XML in general because not everyone is super familiar with XML
|
|
and even if you are there may be some things about XML that you don't really really know.
|
|
You might sort of know it intuitively but you might not have ever thought about it so
|
|
I'm going to talk about XML and try to get our heads wrapped around that before
|
|
diving into how to then parse XML. XML is a hierarchical markup language.
|
|
So it's a little bit like HTML and in fact there are moments where those two are more
|
|
closely related than others. I'll explain that later. But it uses sort of the same kind of
|
|
structure that HTML uses. If you've not ever seen HTML then just go to any web page such as
|
|
hackerpublicradio.org and right-click and view page source. You'll see lots of HTML there.
|
|
So it uses the same kind of opening and closing tags and it's used to store and exchange data.
|
|
That was sort of the intent of the format as I understand it. It was intended for, well,
|
|
certainly intended for data storage. I believe historically it was kind of considered an
|
|
interchange format although that might have just been something that it sort of grew into. I'm
|
|
not sure. I mean like specifically an interchange format. I don't know but XML uses those tags,
|
|
open tag, type in some stuff, closed tag. That's kind of the XML way. It is extremely flexible.
|
|
It is extensible. In fact I think that's what XML actually stands for extensible markup language.
|
|
So it is extremely flexible and extensible and it is used in everything from documentation to
|
|
graphics to the very way that you received this podcast, probably RSS or Adam. Those are
|
|
broadcasting formats essentially over the internet that announce that new data exists.
|
|
So really briefly here, let's look at a sample XML document or you'll listen to it anyway.
|
|
I'll try to include this in the show notes if I remember. But it would open with the tag
|
|
identifying this document as an XML document. And the simplest version of that is just
|
|
angle bracket, XML, close angle bracket. So that would be the opening tag. Then you might have
|
|
some tag underneath that that sort of delves down into what data you're actually storing.
|
|
So let's make one of maybe the solar system. So we'll do a tag. I don't know. How about angle
|
|
bracket, sole, S-O-L, close angle bracket. Now you might think surely XML does not have a tag
|
|
specifically named for the solar system. Like that's, if that's true then there must be innumerable
|
|
XML tags. And while you'd be both right and wrong, XML doesn't natively have a tag called
|
|
sole for the solar system. It doesn't have a milky way tag. It doesn't have a, it doesn't have
|
|
lots of tags. But in fact, XML natively doesn't really have tags. You get to make them up.
|
|
So unlike HTML where you know, you open up your HTML document with bracket HTML, close angle bracket.
|
|
And then you do the head and then the title and then the close head and then the body and then
|
|
maybe a div and a paragraph tag and an image tag and so on. And you know kind of like what
|
|
you have to choose from. You kind of have a menu of choices there. And you know that if you make
|
|
up your own tag, HTML hopefully will produce an error. It will tell you that that's not a valid tag.
|
|
I haven't tried that lately so I don't know how badly that would break things. HTML seems to be
|
|
still pretty forgiving and it kind of defaults, you know, to sort of a quirks mode where it's just
|
|
like, well, I don't really understand what that is. I'll skip over it and just try to ignore it.
|
|
That sort of thing. So HTML is a little bit unpredictable like that. XML will get into whether
|
|
what is policy is later. But in terms of what tags you have to choose from in XML, you can just make
|
|
your own, you can make them up on your own. And as long as everyone working on that document,
|
|
on the XML document that we're talking about, as long as everyone sort of agrees mutually,
|
|
okay, those are the tags that exist in this, what we call a schema, then that's perfectly acceptable.
|
|
And that's a very powerful mechanism that enables people to use XML for basically anything they want
|
|
because there's no preset unless someone says this is the preset for this document. Okay, so we got
|
|
angle bracket, XML, close angle bracket, angle bracket, soul, so well, close angle bracket,
|
|
angle bracket, planet, close angle bracket, angle bracket, name, close angle bracket, and then we'll
|
|
type in the word mercury because that's the first, the first planet from, from the sun, and then
|
|
I guess that's the direction we'll go in right now. Close that name tag and that's an angle bracket,
|
|
slash name, close angle bracket, and then we can do let's say angle bracket, albedo, close angle
|
|
bracket, 0.11 because that's the albedo of mercury, angle bracket slash albedo, close angle bracket,
|
|
and that those are two bits of information, right? We got the name of the name of the planet and
|
|
the albedo of the planet, and then around those two bits of information, we're going to surround them
|
|
in planet tags. So we've opened our planet, and then we put name in albedo, and then we're going to
|
|
close the planet tag, so angle bracket slash planet, close angle bracket, and we could do that for
|
|
every single one, we'd planet name, venus name, albedo, 0.7, albedo, close planet, planet name,
|
|
terra, or earth, or whatever you want to call it, close name, albedo, 0.39, close albedo,
|
|
albedo, close planet, let's say that's all we want to get to, we just wanted to get up to our own,
|
|
the one, the place where we live, so we're there now, so we'll close the soul tag, angle bracket,
|
|
slash soul, and then close the XML tag, angle bracket slash XML. And we're done, we have a
|
|
complete XML document that ought to validate, and I guess I could check that, I'll do XML Lint sample
|
|
dot XML, and yeah, it looks good, it looks like it's happy with my XML code, my document, so
|
|
everything's good, and we're ready to process that XML with some application designed
|
|
for processing XML, I would reasonably expect it to function, you know, as designed. Okay,
|
|
so let's talk a little bit about the components of our sample XML document. First of all,
|
|
the entity itself, the thing that we have when we've made an XML document is called a document,
|
|
and that's an important term to kind of keep in mind because you'll see it be referenced
|
|
in different places, sometimes you'll find something called the document object model or the
|
|
DOM, and people call, I mean, you'll see that in relation to like JavaScript and stuff too,
|
|
but in XML, there's a document object model, and that's kind of a different way, or that's a view
|
|
of the data that you're representing in whatever you're representing to get in, in this case, XML.
|
|
So the document, it's not just like the icon on your desktop after you've created the file,
|
|
that's, yes, we could call that a document, we could put it in our documents folder, yes,
|
|
that's a document, but the document in terms of what parsers are looking at and so on, that is,
|
|
essentially, from the opening XML tag to the closing XML tag, that's the document. You'll find
|
|
similar sort of definitions with, for instance, YAML, which I think I did an episode on, YAML,
|
|
you know, it opens with those three dashes, according to YAML went, and that's like the beginning
|
|
of a YAML document, and if you have three dashes somewhere else in that document, then you're
|
|
starting a new document, so it's important for scoping to understand that this is the document,
|
|
it contains the full tree of the data that we're representing. Okay, so that's document,
|
|
the next idea in terms of parsing are nodes. A node would be, in this example, a node would be
|
|
soul, a node would be planet, it could be name, it could be albedo, those are all nodes in this
|
|
XML document, as we've designed our schema, those are nodes. So nodes, no, I'm sorry, name, name,
|
|
and albedo I think would not be nodes, yeah, that's true, they would not be nodes, I mean, I kind
|
|
of do feel like some people would call them nodes casually, but parsers generally would not consider
|
|
those nodes because they don't contain other tags, and that's important because what they do contain
|
|
is what's called content, which makes them elements, so the name tag and the albedo tag are elements
|
|
in our document because they contain content, so soul is a node that contains other nodes and elements,
|
|
planet is a node that contains elements, name is an element because it actually contains the data,
|
|
it contains the word mercury, or venus, or tera, and albedo is an element because it contains content
|
|
of zero dot one one, or zero dot seven, or zero dot three nine, whatever, so those are the four
|
|
things that you need to think about when thinking about sort of looking at XML programmatically,
|
|
the document itself, the nodes contained in the document, the elements contained in the nodes,
|
|
and the content that are part of the elements, or the content of the elements, okay, so when you're
|
|
coming up with your XML document, whatever, for whatever reason, you are designing or following
|
|
a schema, and a schema, s-c-h-e-m-a, a schema is essentially just the agreed upon tags and, or
|
|
a hierarchy of those tags within a document, and as we've just done, as I've just demonstrated to you,
|
|
you can do that, you're on your own, you can say that this is a schema, our valid tags,
|
|
our sole planet name, and albedo, but more than that, you could describe, which I'm not going to
|
|
do right now, but we could write up a document defining our schema and describing, for instance,
|
|
what, where the tags are allowed to appear, so for instance, maybe the name tag is a pretty flexible
|
|
tag, like planets have names, but I mean, so does a solar system, so instead of having a sole
|
|
tag, we could say, let's just do system, and then the system could have a name tag, and so name
|
|
would be a valid tag within system, or within planet, but it would not be valid to put a name tag
|
|
inside of, say, an albedo tag, that would be silly, or maybe not, maybe you do want a name tag
|
|
to classify what you're measuring, what kind of albedo you're measuring, or you measuring,
|
|
I don't know what the different kinds are, but I've seen different terminology, you know,
|
|
ice versus liquid, and so on, so I don't know, specular, specular, specular sounds like a
|
|
measurement of, or a type of albedo, I don't know, point is, it would not be valid there,
|
|
but it would be valid here, so that kind of definition of hierarchy, and I guess the parent
|
|
child relationship, that's a schema, and that's important to know about, because if you're using
|
|
something that someone actually has defined, then you'll want to understand the schema, so that you
|
|
don't violate it left and right, a great example of this would be SVG scalable vector graphics,
|
|
which is I think a W3C specification, they've written out exactly what kind of tags can exist,
|
|
and then which tags can contain other tags in a valid way, and that's important, because, you
|
|
know, you have some kind of tag very specific, maybe, to circles that just doesn't apply,
|
|
it could not apply to squares, I don't know if that's even mathematically reasonable,
|
|
but let's pretend like it is, let's assume it is, there's something specific to a circle that's
|
|
not, that wouldn't work with squares, and so you wouldn't want to have the tag even be allowed
|
|
in, in this tag or that tag, so that's important to understand docbook is another one I've done
|
|
an episode on docbook, I think, if not, I really should, use docbook all the time for documentation,
|
|
and it's just some guy, Norm Walls, decided to invent a schema, and it got popular, and now it's
|
|
kind of, it's a very, well, I don't know if it's very popular, but it is a pretty darn well-known
|
|
technical documentation schema, and it has very strict rules as to where you can put tags,
|
|
and where they're not allowed, and so on, so those are schema, schema, schema, and another thing to
|
|
kind of wrap your head around is that data object model, or data object model, or document object model,
|
|
I guess data object model probably, anyway, the DOM, it's important because it represents,
|
|
it's another way of looking at your XML data, and this, that's the thing about XML, is once you've
|
|
defined, once you've created your XML document, you've put data into it, and you've structured it
|
|
in a way that is logical, there are lots of different ways to then view that data, one of them is
|
|
as XML, you just look at the tags, you kind of see their relationship, maybe if you've got a
|
|
friendly XML application, maybe you can see through indentation, sort of kind of the hierarchy
|
|
of all the tags, and that's one way to look at it, and it's a valid way, and it's maybe perfectly
|
|
suitable for you, but there are other ways, and one popular way is through almost like a,
|
|
I think of it kind of as a flow chart, but it's actually probably better to think of it as a family
|
|
tree, because that does show, it demonstrates in a pretty familiar way to us, because I mean, if you've
|
|
ever looked at a family tree, then you get it, right, you see, oh, those are my parents, and
|
|
here's all the kids that they had, which includes me, and here's the kids that I've had, and so on,
|
|
and so this document, this DOM model, kind of, you can, it can sort of show you your data in a
|
|
tree view, so again, just looking at our, at the sample, planet data, you might have it at the
|
|
very top, your document, that's your entry point, the XML tag, below that, right now, we only have
|
|
one solar system in our data base, as it were, so it would just be the sole tag, which again,
|
|
probably not the optimal name for that on retrospect, it probably should have just been system,
|
|
and then we could have done like name, sole, but for now, I'll just go with, with what we've got,
|
|
so we got the sole tag, and within sole, if you'll recall, there were one, two, three planets,
|
|
so under sole, we will have three iterations, three children of sole, and it's going to be planet,
|
|
planet, planet, planet, and then under each of those nodes, the planet, planet, planet, then you have
|
|
two little children, node, name, and albedo, name, and albedo, name, and albedo, so it kind of,
|
|
it forks out pretty rapidly there, and then multiplies even further, because you've got more children
|
|
of children of a common parent, and you can do that, you can do that with a whole document,
|
|
like a properly structured XML document, you could map it out as, like I say, kind of a family tree,
|
|
and represent as much data as you want, and certainly if you just do it like on a notepad,
|
|
that would be one thing, where you just kind of use the name of the nodes and the elements,
|
|
but in applications that have sort of dynamic ways of representing data, you might even be able to
|
|
sort of crawl through your document almost as like index cards strung together, and you could see
|
|
like the bits of data that you need in this larger sort of tree format, so lots of different
|
|
ways to look at it. Another way, if you think about it to represent that data would be through the
|
|
very familiar URI scheme that you and I use all the time for internet websites,
|
|
such as, what's an example here, hacker, public radius, slash, index, underscore, full.php,
|
|
that's actually not the greatest example, because it's very, very short actually, but you know,
|
|
like example.com slash episodes slash mygreatepisode.html, that shows inheritance.
|
|
We understand from that that we're on this domain called example.com, and then there's this,
|
|
there's a child of that domain is episodes. There might be another child of that domain,
|
|
which would be code, and you go to the code folder to see code samples, and you go to the episodes
|
|
to hear the episode content, and then you go to another sibling of those might be videos,
|
|
and you go there to see the sample videos of the code being written, whatever. So you've got
|
|
all that sort of in a linear sort of view, and you can do the same thing with the sample
|
|
document. So you might do like, again, you'd kind of start at the top, so you've got your document.
|
|
Starting at the top in this format says to do a double slash. So slash slash, that's kind of like,
|
|
go back to the beginning of the document, and at the beginning, so now we're in XML, we're in the
|
|
document, and so the first tag we would look at would be soul, and then slash planet, and then you
|
|
have to make your choice slash what planet you want to go to. I guess we'll go to, well, no,
|
|
after that, you have to make your choice of what element you want to look at. So that would be,
|
|
you want to look at the names, or you want to look at the albedo, because obviously in this linear
|
|
fashion, you can't fork off to lots of different places. You have to choose your path to the element
|
|
that you want to view. So you might have slash slash soul slash planet slash name, or slash slash
|
|
soul slash planet slash albedo, and you would get either the name of the planet, or the albedo of
|
|
the planet, depending on which directory as it were you went into. And that's a useful convention,
|
|
because sometimes you know the data that you want in something called x-path, you'll use that
|
|
kind of notation, so that you can zero in and on all elements within this node called name,
|
|
show me the contents of those elements, and then you would get returned Mercury, Venus,
|
|
and Earth, and then you know what planets you're dealing with, and so on. Okay. And I will get into
|
|
that more with XML starlet. I just want to kind of introduce the idea that XML can be viewed
|
|
in lots of different ways. And now I will close out the episode with sort of why I think XML is so
|
|
great. I'm really fond of XML, and I know that's a hard sell, because it kind of has a reputation,
|
|
a lot of people struggle with it, even people who use it a lot struggle with it. I struggle with it
|
|
all the time. And believe it or not, that is one of the best selling points for me. I really,
|
|
really like the strictness of XML. And I say this from a place of deep and earnest pain, having been
|
|
dealt to me by formats like markdown and RST, and so on. And I love those formats. I honestly do.
|
|
I use ASCII dock, which you know is markdown-ish. I use markdown, like I think they're great,
|
|
because I love structure over in text files. I really like structure more than a lack of structure.
|
|
I mean, I didn't mean to say only in text files. There are lots of places where I like structure,
|
|
but I'm speaking specifically about text files. I like structure rather than lack of structure,
|
|
because it's easier to parse something structured. And XML is highly structured, and that means that
|
|
you can be very explicit about what you mean. So in XML, if you are typing out, I don't know, a
|
|
bulleted list, then you can be very clear about when that bullet list ends, or how many
|
|
paragraphs are meant to be within one bullet versus another bullet. And in markdown and RST and such,
|
|
it's not uncommon to attempt to generate data like that, and then you have to break out of
|
|
your bullet list format for a moment to do, I don't know, a code sample or a block code or something
|
|
that you want in your bullet list, but your markdown format, your non-markup format decides,
|
|
oh, you've thrown off the predictability of the parsing. You've done something that I didn't
|
|
anticipate, and so we're going to break you out of your bullet list and then start a new bullet list
|
|
later, or I guess when that's bullets, it's not that big of a deal, but when it's numbers,
|
|
it's really inconvenient, or whatever. You can take your pick, there are lots of different
|
|
examples of how implicit notation, while better than complete lack of structure and just doing
|
|
whatever you want, whenever you want, and trying to make parsers just guess, or make exemptions
|
|
while parsing. It does kind of fall apart sometimes if you go too far outside of expectations,
|
|
whereas XML, you can just do exactly what you want. You still have to fit within the schema,
|
|
you have to be writing valid XML, but if you've chosen your schema well, then you can still do what
|
|
you want to do, it's just that you happen to be marking it up clearly while you're doing it,
|
|
so I really like the structure of XML, I also like the strictness of XML, and that seems almost
|
|
counterintuitive, but the fact that when I go to process an XML document, and I get a flat refusal
|
|
from my parser to process it, because there's an error in it, that's one of the best things
|
|
that could ever happen, because that means I catch the error before I push it out to the world,
|
|
or to the next person in the pipeline, where it immediately becomes their problem. That's not
|
|
what I want. I want to be able to catch the error before it leaves my desk and fix it, and then
|
|
pass it on to the next step in the pipeline, whether that's posting it to the internet, or whether it is
|
|
handing off a document to someone else for further processing, or whatever. It's just really good
|
|
to be alerted about the bugs that you've accidentally introduced into your document, so I would much
|
|
rather have that early frustration of, why aren't you doing what I want you to do parser? I'd
|
|
rather much rather have that in the privacy of my own office than to post it onto the internet and
|
|
break my podcast feed again, or push it over to the next person in the pipeline at work, and have
|
|
someone ask me if I needed some time to go learn XML properly, because I've just broken the PDF
|
|
build there, whatever. It's just better for me to suffer than everyone after me to suffer. I
|
|
also really like XML, because it's really, really great at transforming, and this is arguably not
|
|
that big of a benefit anymore. I mean, there's really, there are transformers, there are DIMA
|
|
doesn't now. I mean, Pandock exists. It can transform practically anything. However, that said,
|
|
XML actually really does, it lends itself well to that sort of dynamic interchange format,
|
|
and for that reason, you find it in lots of places. You'll find database dumps offering XML as
|
|
an output option. You'll find XML in LibreOffice documents, and in ePubs, there's XML in there,
|
|
not a whole lot of it, but there is some. There's XML, like I say, and SVGs, and lots of different
|
|
places. I just kind of appreciate the concept that this is a predictable, and I'm not going to
|
|
say easy to parse, because it's nothing's easy to parse. I'm just going to say it's predictable.
|
|
It's a predictable format, and there's a lot of support for it, and it is something that,
|
|
even when there's not a lot of support for that specific schema, you can usually kind of like,
|
|
you can figure it out, you know, because there are so many different ways to view that data,
|
|
there are lots of parsers for the data, and so you can generally transform that data into whatever
|
|
you need. It's not always easy, but I think in those cases, it's not easy because nothing would be
|
|
easy. It wouldn't matter whether it was JSON, or XML, or a spreadsheet, or whatever else. It's just
|
|
transforming data from one format to another. It can be hard sometimes, because you have to map
|
|
stuff, and you have to come up with equivalence of things, and that can be difficult. But the
|
|
the XML toolchain tends to be pretty robust, and I really appreciate that about XML.
|
|
Arguably, that's not XML. It's not XML's fault that it's well-supported. Like, I mean,
|
|
if the toolchain exists, then the toolchain exists, and I think if the toolchain went away,
|
|
I would probably still appreciate XML as a format. I don't know if I'd use it as much,
|
|
because the support is really nice, but the format, that explicit format, is quite nice.
|
|
And just to be clear, I'm not saying that you should only ever use XML for everything no matter
|
|
what. I'm just saying that of the many formats we have the luxury of choosing in the open
|
|
specification world, XML is a good one. So don't shy away from it if you see it. If you can kind of
|
|
get to the point where you kind of understand it, which hopefully this episode and my next one
|
|
on XML Starlit will help you kind of come to terms with it. It does become, you start to see
|
|
through all those messy tags and all the noise or what looks like noise, and you get through to
|
|
the actual data, and it starts to make sense. And I guess that's the other important thing to
|
|
realize about XML that I completely have ignored, is the context, the added context that it provides.
|
|
A lot of data formats, they have a perfectly well-structured way to store your data,
|
|
but it's just kind of, it's only because it is predictable, it's not self-descriptive,
|
|
whereas XML can be quite self-descriptive. It can, for instance, the document that says
|
|
planet, XML, soul, planet, name, mercury, albedoves, 0.11, that tells you exactly what data that is,
|
|
because you've read the tags and you see everything in relation to it. If you had some other
|
|
document, you might see a list of clearly identifiable planets, but you would see numbers, like 0.11,
|
|
0.7, 0.39. What is that? Is that their value in, I don't know, ounces of gold? Is it the rotational
|
|
tilts of the axis? Is it the average number of oxygen particles? I don't know, what is that number?
|
|
There's no way to necessarily tell unless, I mean, if someone designed a format or was careful
|
|
about how they listed it, I could imagine a JSON document listing, for instance, having key value
|
|
pairs, that would clearly identify, well, this is a planet, and this is the albedo value of that
|
|
planet. This is a planet, et cetera. So, I mean, it can be done, but I find that XML, more or less
|
|
by default, depends on the designer, but more or less by default, provides very, very important
|
|
context to the data that it stores, and I really, really, really love that as well. Okay, I think
|
|
those are all the reasons that I love XML. I hope I've kind of explained how XML sort of works,
|
|
at least in theory, and in the next episode, we'll talk about ways to parse it. What's your
|
|
favorite data format? Why don't you record an episode about that for Hacker Public Radio?
|
|
And if you need inspiration, I'll just say a couple of key phrases here. JSON is the worst
|
|
data format I've ever encountered. YAML, horrible. It's the worst data format other than JSON.
|
|
Markdown. What could be worse than Mark? It's a format for people who don't really want to write.
|
|
I can't think of anything else mean to say about other format, but hopefully this has angered
|
|
you and will compel you to record an episode telling me and everyone in the world why that format
|
|
is your favorite and as far as superior to XML. I don't know if you can start a flame war so
|
|
intentionally, but I mean, I'm trying here. If we could do this, Hacker Public Radio only stands
|
|
to benefit. Thanks for listening. Talk to you next time.
|
|
Hacker Public Radio was founded by the Digital Dove Pound and the Infonomicon Computer Club
|
|
and is part of the binary revolution at binwreff.com. If you have comments on today's show,
|
|
please email the host directly, leave a comment on the website or record a follow-up episode yourself.
|
|
Unless otherwise status, today's show is released on the Creative Commons,
|
|
Attribution, ShareLite, 3.0 license.
|