Episode: 3393 Title: HPR3393: We need to talk about XML Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3393/hpr3393.mp3 Transcribed: 2025-10-24 22:38:14 --- This is Hacker Public Radio Episode 3393 for Wendy, the 4th of August 2021. Today's show is entitled, We Need to Talk About XML It is Hosted by Clot 2, and is about 31 minutes long and carries a clean flag. The summary is an extensible markup language. This is too good to be true. This episode of HPR is brought to you by archive.org. Support universal access to all knowledge by heading over to archive.org forward slash donate. Hey everybody, thanks for listening to Hacker Public Radio. My name is Clot 2 and in this episode I want to talk about XML and in the next episode from me that you'll hear probably is going to be about XML Starlit. I didn't want to cover XML Starlit before I just kind of talked about XML in general because not everyone is super familiar with XML and even if you are there may be some things about XML that you don't really really know. You might sort of know it intuitively but you might not have ever thought about it so I'm going to talk about XML and try to get our heads wrapped around that before diving into how to then parse XML. XML is a hierarchical markup language. So it's a little bit like HTML and in fact there are moments where those two are more closely related than others. I'll explain that later. But it uses sort of the same kind of structure that HTML uses. If you've not ever seen HTML then just go to any web page such as hackerpublicradio.org and right-click and view page source. You'll see lots of HTML there. So it uses the same kind of opening and closing tags and it's used to store and exchange data. That was sort of the intent of the format as I understand it. It was intended for, well, certainly intended for data storage. I believe historically it was kind of considered an interchange format although that might have just been something that it sort of grew into. I'm not sure. I mean like specifically an interchange format. I don't know but XML uses those tags, open tag, type in some stuff, closed tag. That's kind of the XML way. It is extremely flexible. It is extensible. In fact I think that's what XML actually stands for extensible markup language. So it is extremely flexible and extensible and it is used in everything from documentation to graphics to the very way that you received this podcast, probably RSS or Adam. Those are broadcasting formats essentially over the internet that announce that new data exists. So really briefly here, let's look at a sample XML document or you'll listen to it anyway. I'll try to include this in the show notes if I remember. But it would open with the tag identifying this document as an XML document. And the simplest version of that is just angle bracket, XML, close angle bracket. So that would be the opening tag. Then you might have some tag underneath that that sort of delves down into what data you're actually storing. So let's make one of maybe the solar system. So we'll do a tag. I don't know. How about angle bracket, sole, S-O-L, close angle bracket. Now you might think surely XML does not have a tag specifically named for the solar system. Like that's, if that's true then there must be innumerable XML tags. And while you'd be both right and wrong, XML doesn't natively have a tag called sole for the solar system. It doesn't have a milky way tag. It doesn't have a, it doesn't have lots of tags. But in fact, XML natively doesn't really have tags. You get to make them up. So unlike HTML where you know, you open up your HTML document with bracket HTML, close angle bracket. And then you do the head and then the title and then the close head and then the body and then maybe a div and a paragraph tag and an image tag and so on. And you know kind of like what you have to choose from. You kind of have a menu of choices there. And you know that if you make up your own tag, HTML hopefully will produce an error. It will tell you that that's not a valid tag. I haven't tried that lately so I don't know how badly that would break things. HTML seems to be still pretty forgiving and it kind of defaults, you know, to sort of a quirks mode where it's just like, well, I don't really understand what that is. I'll skip over it and just try to ignore it. That sort of thing. So HTML is a little bit unpredictable like that. XML will get into whether what is policy is later. But in terms of what tags you have to choose from in XML, you can just make your own, you can make them up on your own. And as long as everyone working on that document, on the XML document that we're talking about, as long as everyone sort of agrees mutually, okay, those are the tags that exist in this, what we call a schema, then that's perfectly acceptable. And that's a very powerful mechanism that enables people to use XML for basically anything they want because there's no preset unless someone says this is the preset for this document. Okay, so we got angle bracket, XML, close angle bracket, angle bracket, soul, so well, close angle bracket, angle bracket, planet, close angle bracket, angle bracket, name, close angle bracket, and then we'll type in the word mercury because that's the first, the first planet from, from the sun, and then I guess that's the direction we'll go in right now. Close that name tag and that's an angle bracket, slash name, close angle bracket, and then we can do let's say angle bracket, albedo, close angle bracket, 0.11 because that's the albedo of mercury, angle bracket slash albedo, close angle bracket, and that those are two bits of information, right? We got the name of the name of the planet and the albedo of the planet, and then around those two bits of information, we're going to surround them in planet tags. So we've opened our planet, and then we put name in albedo, and then we're going to close the planet tag, so angle bracket slash planet, close angle bracket, and we could do that for every single one, we'd planet name, venus name, albedo, 0.7, albedo, close planet, planet name, terra, or earth, or whatever you want to call it, close name, albedo, 0.39, close albedo, albedo, close planet, let's say that's all we want to get to, we just wanted to get up to our own, the one, the place where we live, so we're there now, so we'll close the soul tag, angle bracket, slash soul, and then close the XML tag, angle bracket slash XML. And we're done, we have a complete XML document that ought to validate, and I guess I could check that, I'll do XML Lint sample dot XML, and yeah, it looks good, it looks like it's happy with my XML code, my document, so everything's good, and we're ready to process that XML with some application designed for processing XML, I would reasonably expect it to function, you know, as designed. Okay, so let's talk a little bit about the components of our sample XML document. First of all, the entity itself, the thing that we have when we've made an XML document is called a document, and that's an important term to kind of keep in mind because you'll see it be referenced in different places, sometimes you'll find something called the document object model or the DOM, and people call, I mean, you'll see that in relation to like JavaScript and stuff too, but in XML, there's a document object model, and that's kind of a different way, or that's a view of the data that you're representing in whatever you're representing to get in, in this case, XML. So the document, it's not just like the icon on your desktop after you've created the file, that's, yes, we could call that a document, we could put it in our documents folder, yes, that's a document, but the document in terms of what parsers are looking at and so on, that is, essentially, from the opening XML tag to the closing XML tag, that's the document. You'll find similar sort of definitions with, for instance, YAML, which I think I did an episode on, YAML, you know, it opens with those three dashes, according to YAML went, and that's like the beginning of a YAML document, and if you have three dashes somewhere else in that document, then you're starting a new document, so it's important for scoping to understand that this is the document, it contains the full tree of the data that we're representing. Okay, so that's document, the next idea in terms of parsing are nodes. A node would be, in this example, a node would be soul, a node would be planet, it could be name, it could be albedo, those are all nodes in this XML document, as we've designed our schema, those are nodes. So nodes, no, I'm sorry, name, name, and albedo I think would not be nodes, yeah, that's true, they would not be nodes, I mean, I kind of do feel like some people would call them nodes casually, but parsers generally would not consider those nodes because they don't contain other tags, and that's important because what they do contain is what's called content, which makes them elements, so the name tag and the albedo tag are elements in our document because they contain content, so soul is a node that contains other nodes and elements, planet is a node that contains elements, name is an element because it actually contains the data, it contains the word mercury, or venus, or tera, and albedo is an element because it contains content of zero dot one one, or zero dot seven, or zero dot three nine, whatever, so those are the four things that you need to think about when thinking about sort of looking at XML programmatically, the document itself, the nodes contained in the document, the elements contained in the nodes, and the content that are part of the elements, or the content of the elements, okay, so when you're coming up with your XML document, whatever, for whatever reason, you are designing or following a schema, and a schema, s-c-h-e-m-a, a schema is essentially just the agreed upon tags and, or a hierarchy of those tags within a document, and as we've just done, as I've just demonstrated to you, you can do that, you're on your own, you can say that this is a schema, our valid tags, our sole planet name, and albedo, but more than that, you could describe, which I'm not going to do right now, but we could write up a document defining our schema and describing, for instance, what, where the tags are allowed to appear, so for instance, maybe the name tag is a pretty flexible tag, like planets have names, but I mean, so does a solar system, so instead of having a sole tag, we could say, let's just do system, and then the system could have a name tag, and so name would be a valid tag within system, or within planet, but it would not be valid to put a name tag inside of, say, an albedo tag, that would be silly, or maybe not, maybe you do want a name tag to classify what you're measuring, what kind of albedo you're measuring, or you measuring, I don't know what the different kinds are, but I've seen different terminology, you know, ice versus liquid, and so on, so I don't know, specular, specular, specular sounds like a measurement of, or a type of albedo, I don't know, point is, it would not be valid there, but it would be valid here, so that kind of definition of hierarchy, and I guess the parent child relationship, that's a schema, and that's important to know about, because if you're using something that someone actually has defined, then you'll want to understand the schema, so that you don't violate it left and right, a great example of this would be SVG scalable vector graphics, which is I think a W3C specification, they've written out exactly what kind of tags can exist, and then which tags can contain other tags in a valid way, and that's important, because, you know, you have some kind of tag very specific, maybe, to circles that just doesn't apply, it could not apply to squares, I don't know if that's even mathematically reasonable, but let's pretend like it is, let's assume it is, there's something specific to a circle that's not, that wouldn't work with squares, and so you wouldn't want to have the tag even be allowed in, in this tag or that tag, so that's important to understand docbook is another one I've done an episode on docbook, I think, if not, I really should, use docbook all the time for documentation, and it's just some guy, Norm Walls, decided to invent a schema, and it got popular, and now it's kind of, it's a very, well, I don't know if it's very popular, but it is a pretty darn well-known technical documentation schema, and it has very strict rules as to where you can put tags, and where they're not allowed, and so on, so those are schema, schema, schema, and another thing to kind of wrap your head around is that data object model, or data object model, or document object model, I guess data object model probably, anyway, the DOM, it's important because it represents, it's another way of looking at your XML data, and this, that's the thing about XML, is once you've defined, once you've created your XML document, you've put data into it, and you've structured it in a way that is logical, there are lots of different ways to then view that data, one of them is as XML, you just look at the tags, you kind of see their relationship, maybe if you've got a friendly XML application, maybe you can see through indentation, sort of kind of the hierarchy of all the tags, and that's one way to look at it, and it's a valid way, and it's maybe perfectly suitable for you, but there are other ways, and one popular way is through almost like a, I think of it kind of as a flow chart, but it's actually probably better to think of it as a family tree, because that does show, it demonstrates in a pretty familiar way to us, because I mean, if you've ever looked at a family tree, then you get it, right, you see, oh, those are my parents, and here's all the kids that they had, which includes me, and here's the kids that I've had, and so on, and so this document, this DOM model, kind of, you can, it can sort of show you your data in a tree view, so again, just looking at our, at the sample, planet data, you might have it at the very top, your document, that's your entry point, the XML tag, below that, right now, we only have one solar system in our data base, as it were, so it would just be the sole tag, which again, probably not the optimal name for that on retrospect, it probably should have just been system, and then we could have done like name, sole, but for now, I'll just go with, with what we've got, so we got the sole tag, and within sole, if you'll recall, there were one, two, three planets, so under sole, we will have three iterations, three children of sole, and it's going to be planet, planet, planet, planet, and then under each of those nodes, the planet, planet, planet, then you have two little children, node, name, and albedo, name, and albedo, name, and albedo, so it kind of, it forks out pretty rapidly there, and then multiplies even further, because you've got more children of children of a common parent, and you can do that, you can do that with a whole document, like a properly structured XML document, you could map it out as, like I say, kind of a family tree, and represent as much data as you want, and certainly if you just do it like on a notepad, that would be one thing, where you just kind of use the name of the nodes and the elements, but in applications that have sort of dynamic ways of representing data, you might even be able to sort of crawl through your document almost as like index cards strung together, and you could see like the bits of data that you need in this larger sort of tree format, so lots of different ways to look at it. Another way, if you think about it to represent that data would be through the very familiar URI scheme that you and I use all the time for internet websites, such as, what's an example here, hacker, public radius, slash, index, underscore, full.php, that's actually not the greatest example, because it's very, very short actually, but you know, like example.com slash episodes slash mygreatepisode.html, that shows inheritance. We understand from that that we're on this domain called example.com, and then there's this, there's a child of that domain is episodes. There might be another child of that domain, which would be code, and you go to the code folder to see code samples, and you go to the episodes to hear the episode content, and then you go to another sibling of those might be videos, and you go there to see the sample videos of the code being written, whatever. So you've got all that sort of in a linear sort of view, and you can do the same thing with the sample document. So you might do like, again, you'd kind of start at the top, so you've got your document. Starting at the top in this format says to do a double slash. So slash slash, that's kind of like, go back to the beginning of the document, and at the beginning, so now we're in XML, we're in the document, and so the first tag we would look at would be soul, and then slash planet, and then you have to make your choice slash what planet you want to go to. I guess we'll go to, well, no, after that, you have to make your choice of what element you want to look at. So that would be, you want to look at the names, or you want to look at the albedo, because obviously in this linear fashion, you can't fork off to lots of different places. You have to choose your path to the element that you want to view. So you might have slash slash soul slash planet slash name, or slash slash soul slash planet slash albedo, and you would get either the name of the planet, or the albedo of the planet, depending on which directory as it were you went into. And that's a useful convention, because sometimes you know the data that you want in something called x-path, you'll use that kind of notation, so that you can zero in and on all elements within this node called name, show me the contents of those elements, and then you would get returned Mercury, Venus, and Earth, and then you know what planets you're dealing with, and so on. Okay. And I will get into that more with XML starlet. I just want to kind of introduce the idea that XML can be viewed in lots of different ways. And now I will close out the episode with sort of why I think XML is so great. I'm really fond of XML, and I know that's a hard sell, because it kind of has a reputation, a lot of people struggle with it, even people who use it a lot struggle with it. I struggle with it all the time. And believe it or not, that is one of the best selling points for me. I really, really like the strictness of XML. And I say this from a place of deep and earnest pain, having been dealt to me by formats like markdown and RST, and so on. And I love those formats. I honestly do. I use ASCII dock, which you know is markdown-ish. I use markdown, like I think they're great, because I love structure over in text files. I really like structure more than a lack of structure. I mean, I didn't mean to say only in text files. There are lots of places where I like structure, but I'm speaking specifically about text files. I like structure rather than lack of structure, because it's easier to parse something structured. And XML is highly structured, and that means that you can be very explicit about what you mean. So in XML, if you are typing out, I don't know, a bulleted list, then you can be very clear about when that bullet list ends, or how many paragraphs are meant to be within one bullet versus another bullet. And in markdown and RST and such, it's not uncommon to attempt to generate data like that, and then you have to break out of your bullet list format for a moment to do, I don't know, a code sample or a block code or something that you want in your bullet list, but your markdown format, your non-markup format decides, oh, you've thrown off the predictability of the parsing. You've done something that I didn't anticipate, and so we're going to break you out of your bullet list and then start a new bullet list later, or I guess when that's bullets, it's not that big of a deal, but when it's numbers, it's really inconvenient, or whatever. You can take your pick, there are lots of different examples of how implicit notation, while better than complete lack of structure and just doing whatever you want, whenever you want, and trying to make parsers just guess, or make exemptions while parsing. It does kind of fall apart sometimes if you go too far outside of expectations, whereas XML, you can just do exactly what you want. You still have to fit within the schema, you have to be writing valid XML, but if you've chosen your schema well, then you can still do what you want to do, it's just that you happen to be marking it up clearly while you're doing it, so I really like the structure of XML, I also like the strictness of XML, and that seems almost counterintuitive, but the fact that when I go to process an XML document, and I get a flat refusal from my parser to process it, because there's an error in it, that's one of the best things that could ever happen, because that means I catch the error before I push it out to the world, or to the next person in the pipeline, where it immediately becomes their problem. That's not what I want. I want to be able to catch the error before it leaves my desk and fix it, and then pass it on to the next step in the pipeline, whether that's posting it to the internet, or whether it is handing off a document to someone else for further processing, or whatever. It's just really good to be alerted about the bugs that you've accidentally introduced into your document, so I would much rather have that early frustration of, why aren't you doing what I want you to do parser? I'd rather much rather have that in the privacy of my own office than to post it onto the internet and break my podcast feed again, or push it over to the next person in the pipeline at work, and have someone ask me if I needed some time to go learn XML properly, because I've just broken the PDF build there, whatever. It's just better for me to suffer than everyone after me to suffer. I also really like XML, because it's really, really great at transforming, and this is arguably not that big of a benefit anymore. I mean, there's really, there are transformers, there are DIMA doesn't now. I mean, Pandock exists. It can transform practically anything. However, that said, XML actually really does, it lends itself well to that sort of dynamic interchange format, and for that reason, you find it in lots of places. You'll find database dumps offering XML as an output option. You'll find XML in LibreOffice documents, and in ePubs, there's XML in there, not a whole lot of it, but there is some. There's XML, like I say, and SVGs, and lots of different places. I just kind of appreciate the concept that this is a predictable, and I'm not going to say easy to parse, because it's nothing's easy to parse. I'm just going to say it's predictable. It's a predictable format, and there's a lot of support for it, and it is something that, even when there's not a lot of support for that specific schema, you can usually kind of like, you can figure it out, you know, because there are so many different ways to view that data, there are lots of parsers for the data, and so you can generally transform that data into whatever you need. It's not always easy, but I think in those cases, it's not easy because nothing would be easy. It wouldn't matter whether it was JSON, or XML, or a spreadsheet, or whatever else. It's just transforming data from one format to another. It can be hard sometimes, because you have to map stuff, and you have to come up with equivalence of things, and that can be difficult. But the the XML toolchain tends to be pretty robust, and I really appreciate that about XML. Arguably, that's not XML. It's not XML's fault that it's well-supported. Like, I mean, if the toolchain exists, then the toolchain exists, and I think if the toolchain went away, I would probably still appreciate XML as a format. I don't know if I'd use it as much, because the support is really nice, but the format, that explicit format, is quite nice. And just to be clear, I'm not saying that you should only ever use XML for everything no matter what. I'm just saying that of the many formats we have the luxury of choosing in the open specification world, XML is a good one. So don't shy away from it if you see it. If you can kind of get to the point where you kind of understand it, which hopefully this episode and my next one on XML Starlit will help you kind of come to terms with it. It does become, you start to see through all those messy tags and all the noise or what looks like noise, and you get through to the actual data, and it starts to make sense. And I guess that's the other important thing to realize about XML that I completely have ignored, is the context, the added context that it provides. A lot of data formats, they have a perfectly well-structured way to store your data, but it's just kind of, it's only because it is predictable, it's not self-descriptive, whereas XML can be quite self-descriptive. It can, for instance, the document that says planet, XML, soul, planet, name, mercury, albedoves, 0.11, that tells you exactly what data that is, because you've read the tags and you see everything in relation to it. If you had some other document, you might see a list of clearly identifiable planets, but you would see numbers, like 0.11, 0.7, 0.39. What is that? Is that their value in, I don't know, ounces of gold? Is it the rotational tilts of the axis? Is it the average number of oxygen particles? I don't know, what is that number? There's no way to necessarily tell unless, I mean, if someone designed a format or was careful about how they listed it, I could imagine a JSON document listing, for instance, having key value pairs, that would clearly identify, well, this is a planet, and this is the albedo value of that planet. This is a planet, et cetera. So, I mean, it can be done, but I find that XML, more or less by default, depends on the designer, but more or less by default, provides very, very important context to the data that it stores, and I really, really, really love that as well. Okay, I think those are all the reasons that I love XML. I hope I've kind of explained how XML sort of works, at least in theory, and in the next episode, we'll talk about ways to parse it. What's your favorite data format? Why don't you record an episode about that for Hacker Public Radio? And if you need inspiration, I'll just say a couple of key phrases here. JSON is the worst data format I've ever encountered. YAML, horrible. It's the worst data format other than JSON. Markdown. What could be worse than Mark? It's a format for people who don't really want to write. I can't think of anything else mean to say about other format, but hopefully this has angered you and will compel you to record an episode telling me and everyone in the world why that format is your favorite and as far as superior to XML. I don't know if you can start a flame war so intentionally, but I mean, I'm trying here. If we could do this, Hacker Public Radio only stands to benefit. Thanks for listening. Talk to you next time. Hacker Public Radio was founded by the Digital Dove Pound and the Infonomicon Computer Club and is part of the binary revolution at binwreff.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise status, today's show is released on the Creative Commons, Attribution, ShareLite, 3.0 license.