Files
hpr-knowledge-base/hpr_transcripts/hpr2013.txt

171 lines
13 KiB
Plaintext
Raw Normal View History

Episode: 2013
Title: HPR2013: Parsing XML in Python with Xmltodict
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2013/hpr2013.mp3
Transcribed: 2025-10-18 13:19:16
---
This in HPR episode 2013 entitled, Passing XML in Python with XML Toddict and in part
on the series, a little bit on Python, it is hosted by Klaatu and in about 14 minutes
long.
The summary is, a quick introduction to XML Toddict, an XML parser for Python.
This episode of HPR is brought to you by an honesthost.com.
With 15% discount on all shared hosting with the offer code, HPR15, that's HPR15.
Better web hosting that's honest and fair at an honesthost.com.
Hi everyone, this is Hacker Public Radio, my name is Klaatu.
I wanted to talk to you about parsing XML in Python with a module called XML Toddict.
In another episode, I talk about the module Untangle and I've also talked about JSON parsing
in Python.
This is just basically another option.
It's very much another option opposed to Untangles.
It basically has the same goal, which is to get you away from trying to parse XML with
really manual intensive labor type of tools like LXML or even beautiful soup, which
a beautiful soup does great for HTML.
I don't love it for XML.
XML Toddict is a little bit more polished than Untangle and it's a little bit more feature
rich because it does have namespace support and things like that, which I'm not going
to get into because I really haven't done that much of parsing with XML Toddict with namespace
yet.
I have something that I do need to parse with the namespace and I will be using XML Toddict
for that, but I haven't really implemented it yet.
It's right now in beautiful soup.
So let's take a quick example of XML.
Again, this will be the pseudo-dockbook code that is so close to my heart.
This will be, let's open up a book tag and then we'll give it a chapter ID equals prologue,
title, the beginning, close title, open a paragraph.
This is the first paragraph, close the paragraph, close chapter, and then another chapter
because most books have more than just one chapter.
So chapter ID equals end, title, the ending, close title, parra, last paragraph, close,
parra, close, chapter, close book.
So there's our little nice and tidy XML document.
We'll call it example.xml and then we'll install XML to Dict.
You can install XML to Dict and that is XML TO DICT and you can install that from your
distribution repository or from pip install XML to Dict or pip install dash, dash user XML
to Dict.
If you're doing it from a repository it's probably something like Python dash XML to Dict.
That's how you usually see that listed.
Okay.
We've got our example XML and we've got our parser installed and so here's how we do it.
So the idea of XML TO Dict as the name suggests is to take XML data and convert it into a dictionary
into a Python ordered dictionary specifically and that matters.
That's a little bit unique.
That's different than just a normal dictionary and if you've never used an ordered Dict
you might be in this for some unpleasant surprises.
It's a little bit more strict and structured than just a normal Python dictionary unfortunately.
So you have to learn kind of like how to use it even if you feel, oh, I know dictionaries.
You may not know ordered dictionaries.
Either way, it's very, very similar to JSON.
So essentially, I mean, this is almost an XML to JSON converter.
I mean, it really is.
In fact, there's even a module and I don't think or not a module, but a function in here
in XML TO Dict where you can actually just dump it out to JSON if I recall correctly.
So or maybe you have to do that with JSON, but either way, it turns into a Python dictionary
and from there you can basically just dump it into JSON.
So it's if you know JSON, then you basically know how to use XML TO Dict.
So we'll in Python, we would do an import XML TO Dict and then we would do the same basic steps
as we did with JSON or with untangle.
It would be with open parentheses, quote, example.xml, close, quote, close parentheses,
as in, in, in file colon and then one line under indented data equals XML to Dict.parse
parentheses, in file.read parentheses, parentheses, close parentheses.
There you go. That's just that just that takes all of our XML in and dumps it into a dictionary.
And you can see that for yourself, if you type in the word data and it will dump it all back out,
it'll tell you, hey, this is an ordered Dict.
And then it's like parentheses, square bracket, parentheses, quote, book, close, quote, comma,
ordered Dict, parentheses square on, and then curly brace again, quote,
chapter, and on and on and on. So it just keeps kind of going.
And that's, that's as easy as it is. You, you've got, you've got a JSON object right there.
It's kind of, kind of cool. So from that stage, it's really just a matter of acting as if
though you're dealing with JSON. It's like really, it's just data, square bracket, quote, book,
quote, square bracket. And then it, let's, and that would dump out the book element.
But we've already seen that essentially. So let's, let's do a chapter element.
So we could do data, square bracket, quote, book, quote,
quote, square bracket, square bracket, quote, chapter, close, quote,
close square bracket. And then you would get ordered Dict, blah, blah, blah, quote, at ID,
prologue, title, the beginning, para. This is the first paragraph, order Dict, quote, at ID,
end, quote, title, close, quote, the ending, para, last paragraph of the chapter.
So you get like little dictionaries just containing each chapter essentially.
Again, it's a lot of data to look at. But the good thing is that you can kind of,
you can keep drilling down into it just like with everything else.
And you can also, you know, I mean, again, since there are two chapters in this example file,
you can, you can specify which one you want. So data, square bracket, book, square bracket,
square bracket, chapter, square bracket, square bracket, square bracket, zero, square bracket.
We'll just show you the first, the zero with chapter. And the same thing with the one,
would show you the second or the first chapter and all of its elements.
Or as I was about to say, you can continue to narrow down your focus. So yes, I know that the
element is in book. I know that it's in a chapter tag, you know, it's like redundant to see
all that information when you've had to use that information in order to get to the point
that you're looking at. So data, square bracket, book, square bracket, and put quotes around those,
square bracket, chapters, square bracket, square bracket, zero, square bracket, square bracket,
square bracket, quote, para, quote, square bracket. That would show you just the contents of the
first chapter, the zero with chapter of the para tag. And that would be, or the para value, I guess,
that would be this is the first paragraph. And you can do the same thing for the first one.
It would be data, book, chapter one, para. And that would show you this is the last paragraph
for whatever we, we had. So it's, it's like the syntax to get to where you want to go is a little
bit verbose. I mean, I don't feel like it's a whole lot more verbose than, let's say, untangle.
And certainly not that much more verbose than JSON, you know, I mean, it's, it's a dictionary.
It's just you're, you're stepping through these elements one by one. And you're getting results.
And that's really, that's what we want. You can look straight at elements too. Of course,
this is XML. So it's not just a matter of like stepping through a dictionary like there are,
there are things embedded within the tags. For instance, in our chapter, we have an ID equals
prologue and we have an ID equals the end or, or ending or whatever I put in there.
So if you want to get the attribute of a chapter because maybe that matters, you know, it's like,
you want to look at all the chapters, but only the ones with an ID attribute. You can do that.
And the special notation in XML2 dict for attributes is the symbol for at the at symbol.
Makes sense if you think about it at attributes, at attributes. So let's say we, we know that our
chapter, some of our chapters have ID attributes. So we could say data, square bracket, book,
square bracket, square book, chapter, square bracket, square bracket, zero, square bracket,
square bracket, square bracket, quote, at ID, closed, quote, closed square bracket. And that would
return the word, the single word prologue. Or if we did that for one, it would, it would return
the word end. Or we could iterate through that. We could say something like for C in range,
parent zero comma to close parentheses, colon, next line, indented data, square bracket, quote,
book, close, quote, square bracket, square bracket, quote, chapter, close, quote,
square bracket, square bracket, C. So that's our iterator, our integer that we're iterating through,
closed, square bracket, square bracket, at ID, closed, square bracket, we would get prologue
and end because the two chapters that we have are prologue and end.
There you go. I don't know why I started with zero, really. I think I could have just done
it one and two in that range. I could be wrong. Anyway, you can do more. You can get the,
just the contents of all of this stuff too. So I think, I think we could do that with the
out as humans, the part of, you know, in the tags. If you want that, then you use the special
character hash. And that's, I have only ever seen that as hash text. I have never seen it called
anything else. And that's because there is no, there's no, there's no further tag element for it.
You know, it's, it's like, it's the C data. It's the, it's the thing that doesn't, that,
that's inside of all of these things that we're looking at. So it's, it's kind of this weird
nebulous sort of nether world. But yeah, it's hash text. And that gives you the strings within
whatever element it is. And obviously, I mean, I'm just using this kind of pseudo doc book example
and the only thing in this example that would have text. No, that's not true. The title,
no, the title has, the title has text. The para would have text. Actually, I think the para might
we might only have access to that with a hash symbol, if there was an attribute in there as well.
I haven't really tested that a whole lot. Either way, when you look at a dump of your dictionary,
you, you, you will see either one or two special, special class, special entries, keys, one with
the at symbol for attributes, and one possibly with a hash to denote that it's not an attribute.
It's the contents of the tag. So if our para had a class, for instance, para class equals quote
foo, close quote, and some text in the para element, like this is the last chapter, then we would want,
we could call our attribute with the at symbol and the text inside of para with the hash.
Something like title, which in my example, so far has never had any attribute,
the only thing that you can call from title is the contents, the beginning, the end, whatever I put
in the title. So you don't have to use a special notation for that. Oh, look, a little bit of a
quirk again. Yes, some of these modules that are that are simplifying XML do indeed have
some, some quirks, I guess, you know, some unique things about them that makes them a little bit
of a hack, but it just makes, again, it makes XML so much easier. It's honestly, honestly, honestly
worth it. It's worth every, every quirk, it's just, it's fine. It's, it's, it's a lot better.
It's a trade-off that you are probably willing to make if you've ever tried parsing XML with
anything. You, you, you probably already are kind of seeing how this could be very nice.
Now, on the other hand, obviously, if, if you're doing something where
you need more advanced features, then this might, this may be too simple. Like I say,
XML2Dict does have namespace support. It can dump back out into JSON. It can actually, you know,
go back to XML, of course. So I mean, it's, it's not as simple. I don't think it's untangle,
but it is definitely simpler than something like LXML. The trade-offs are there and it's kind of up
to you whether or not you want to use it. I find both untangle and XML2Dict a lot easier than the
defaults. I highly recommend them. I find them just really, really sublime when I'm parsing
fairly simple XML. It's just, it makes things really, really easy. So check them out. If you're
working with XML and Python, they're worth looking at. Hopefully this has been informative. Thank you
for listening and I'll talk to you next time.
You've been listening to HackerPublicRadio at HackerPublicRadio.org. We are a community podcast
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
then click on our contributing to find out how easy it really is. HackerPublicRadio was found
by the digital dog pound and the infonomicon computer club and it's part of the binary revolution
at binrev.com. If you have comments on today's show, please email the host directly, leave a comment
on the website or record a follow-up episode yourself. Unless otherwise status, today's show is
released on the creative comments, attribution, share a like, 3.0 license.