171 lines
13 KiB
Plaintext
171 lines
13 KiB
Plaintext
|
|
Episode: 2013
|
||
|
|
Title: HPR2013: Parsing XML in Python with Xmltodict
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2013/hpr2013.mp3
|
||
|
|
Transcribed: 2025-10-18 13:19:16
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This in HPR episode 2013 entitled, Passing XML in Python with XML Toddict and in part
|
||
|
|
on the series, a little bit on Python, it is hosted by Klaatu and in about 14 minutes
|
||
|
|
long.
|
||
|
|
The summary is, a quick introduction to XML Toddict, an XML parser for Python.
|
||
|
|
This episode of HPR is brought to you by an honesthost.com.
|
||
|
|
With 15% discount on all shared hosting with the offer code, HPR15, that's HPR15.
|
||
|
|
Better web hosting that's honest and fair at an honesthost.com.
|
||
|
|
Hi everyone, this is Hacker Public Radio, my name is Klaatu.
|
||
|
|
I wanted to talk to you about parsing XML in Python with a module called XML Toddict.
|
||
|
|
In another episode, I talk about the module Untangle and I've also talked about JSON parsing
|
||
|
|
in Python.
|
||
|
|
This is just basically another option.
|
||
|
|
It's very much another option opposed to Untangles.
|
||
|
|
It basically has the same goal, which is to get you away from trying to parse XML with
|
||
|
|
really manual intensive labor type of tools like LXML or even beautiful soup, which
|
||
|
|
a beautiful soup does great for HTML.
|
||
|
|
I don't love it for XML.
|
||
|
|
XML Toddict is a little bit more polished than Untangle and it's a little bit more feature
|
||
|
|
rich because it does have namespace support and things like that, which I'm not going
|
||
|
|
to get into because I really haven't done that much of parsing with XML Toddict with namespace
|
||
|
|
yet.
|
||
|
|
I have something that I do need to parse with the namespace and I will be using XML Toddict
|
||
|
|
for that, but I haven't really implemented it yet.
|
||
|
|
It's right now in beautiful soup.
|
||
|
|
So let's take a quick example of XML.
|
||
|
|
Again, this will be the pseudo-dockbook code that is so close to my heart.
|
||
|
|
This will be, let's open up a book tag and then we'll give it a chapter ID equals prologue,
|
||
|
|
title, the beginning, close title, open a paragraph.
|
||
|
|
This is the first paragraph, close the paragraph, close chapter, and then another chapter
|
||
|
|
because most books have more than just one chapter.
|
||
|
|
So chapter ID equals end, title, the ending, close title, parra, last paragraph, close,
|
||
|
|
parra, close, chapter, close book.
|
||
|
|
So there's our little nice and tidy XML document.
|
||
|
|
We'll call it example.xml and then we'll install XML to Dict.
|
||
|
|
You can install XML to Dict and that is XML TO DICT and you can install that from your
|
||
|
|
distribution repository or from pip install XML to Dict or pip install dash, dash user XML
|
||
|
|
to Dict.
|
||
|
|
If you're doing it from a repository it's probably something like Python dash XML to Dict.
|
||
|
|
That's how you usually see that listed.
|
||
|
|
Okay.
|
||
|
|
We've got our example XML and we've got our parser installed and so here's how we do it.
|
||
|
|
So the idea of XML TO Dict as the name suggests is to take XML data and convert it into a dictionary
|
||
|
|
into a Python ordered dictionary specifically and that matters.
|
||
|
|
That's a little bit unique.
|
||
|
|
That's different than just a normal dictionary and if you've never used an ordered Dict
|
||
|
|
you might be in this for some unpleasant surprises.
|
||
|
|
It's a little bit more strict and structured than just a normal Python dictionary unfortunately.
|
||
|
|
So you have to learn kind of like how to use it even if you feel, oh, I know dictionaries.
|
||
|
|
You may not know ordered dictionaries.
|
||
|
|
Either way, it's very, very similar to JSON.
|
||
|
|
So essentially, I mean, this is almost an XML to JSON converter.
|
||
|
|
I mean, it really is.
|
||
|
|
In fact, there's even a module and I don't think or not a module, but a function in here
|
||
|
|
in XML TO Dict where you can actually just dump it out to JSON if I recall correctly.
|
||
|
|
So or maybe you have to do that with JSON, but either way, it turns into a Python dictionary
|
||
|
|
and from there you can basically just dump it into JSON.
|
||
|
|
So it's if you know JSON, then you basically know how to use XML TO Dict.
|
||
|
|
So we'll in Python, we would do an import XML TO Dict and then we would do the same basic steps
|
||
|
|
as we did with JSON or with untangle.
|
||
|
|
It would be with open parentheses, quote, example.xml, close, quote, close parentheses,
|
||
|
|
as in, in, in file colon and then one line under indented data equals XML to Dict.parse
|
||
|
|
parentheses, in file.read parentheses, parentheses, close parentheses.
|
||
|
|
There you go. That's just that just that takes all of our XML in and dumps it into a dictionary.
|
||
|
|
And you can see that for yourself, if you type in the word data and it will dump it all back out,
|
||
|
|
it'll tell you, hey, this is an ordered Dict.
|
||
|
|
And then it's like parentheses, square bracket, parentheses, quote, book, close, quote, comma,
|
||
|
|
ordered Dict, parentheses square on, and then curly brace again, quote,
|
||
|
|
chapter, and on and on and on. So it just keeps kind of going.
|
||
|
|
And that's, that's as easy as it is. You, you've got, you've got a JSON object right there.
|
||
|
|
It's kind of, kind of cool. So from that stage, it's really just a matter of acting as if
|
||
|
|
though you're dealing with JSON. It's like really, it's just data, square bracket, quote, book,
|
||
|
|
quote, square bracket. And then it, let's, and that would dump out the book element.
|
||
|
|
But we've already seen that essentially. So let's, let's do a chapter element.
|
||
|
|
So we could do data, square bracket, quote, book, quote,
|
||
|
|
quote, square bracket, square bracket, quote, chapter, close, quote,
|
||
|
|
close square bracket. And then you would get ordered Dict, blah, blah, blah, quote, at ID,
|
||
|
|
prologue, title, the beginning, para. This is the first paragraph, order Dict, quote, at ID,
|
||
|
|
end, quote, title, close, quote, the ending, para, last paragraph of the chapter.
|
||
|
|
So you get like little dictionaries just containing each chapter essentially.
|
||
|
|
Again, it's a lot of data to look at. But the good thing is that you can kind of,
|
||
|
|
you can keep drilling down into it just like with everything else.
|
||
|
|
And you can also, you know, I mean, again, since there are two chapters in this example file,
|
||
|
|
you can, you can specify which one you want. So data, square bracket, book, square bracket,
|
||
|
|
square bracket, chapter, square bracket, square bracket, square bracket, zero, square bracket.
|
||
|
|
We'll just show you the first, the zero with chapter. And the same thing with the one,
|
||
|
|
would show you the second or the first chapter and all of its elements.
|
||
|
|
Or as I was about to say, you can continue to narrow down your focus. So yes, I know that the
|
||
|
|
element is in book. I know that it's in a chapter tag, you know, it's like redundant to see
|
||
|
|
all that information when you've had to use that information in order to get to the point
|
||
|
|
that you're looking at. So data, square bracket, book, square bracket, and put quotes around those,
|
||
|
|
square bracket, chapters, square bracket, square bracket, zero, square bracket, square bracket,
|
||
|
|
square bracket, quote, para, quote, square bracket. That would show you just the contents of the
|
||
|
|
first chapter, the zero with chapter of the para tag. And that would be, or the para value, I guess,
|
||
|
|
that would be this is the first paragraph. And you can do the same thing for the first one.
|
||
|
|
It would be data, book, chapter one, para. And that would show you this is the last paragraph
|
||
|
|
for whatever we, we had. So it's, it's like the syntax to get to where you want to go is a little
|
||
|
|
bit verbose. I mean, I don't feel like it's a whole lot more verbose than, let's say, untangle.
|
||
|
|
And certainly not that much more verbose than JSON, you know, I mean, it's, it's a dictionary.
|
||
|
|
It's just you're, you're stepping through these elements one by one. And you're getting results.
|
||
|
|
And that's really, that's what we want. You can look straight at elements too. Of course,
|
||
|
|
this is XML. So it's not just a matter of like stepping through a dictionary like there are,
|
||
|
|
there are things embedded within the tags. For instance, in our chapter, we have an ID equals
|
||
|
|
prologue and we have an ID equals the end or, or ending or whatever I put in there.
|
||
|
|
So if you want to get the attribute of a chapter because maybe that matters, you know, it's like,
|
||
|
|
you want to look at all the chapters, but only the ones with an ID attribute. You can do that.
|
||
|
|
And the special notation in XML2 dict for attributes is the symbol for at the at symbol.
|
||
|
|
Makes sense if you think about it at attributes, at attributes. So let's say we, we know that our
|
||
|
|
chapter, some of our chapters have ID attributes. So we could say data, square bracket, book,
|
||
|
|
square bracket, square book, chapter, square bracket, square bracket, zero, square bracket,
|
||
|
|
square bracket, square bracket, quote, at ID, closed, quote, closed square bracket. And that would
|
||
|
|
return the word, the single word prologue. Or if we did that for one, it would, it would return
|
||
|
|
the word end. Or we could iterate through that. We could say something like for C in range,
|
||
|
|
parent zero comma to close parentheses, colon, next line, indented data, square bracket, quote,
|
||
|
|
book, close, quote, square bracket, square bracket, quote, chapter, close, quote,
|
||
|
|
square bracket, square bracket, C. So that's our iterator, our integer that we're iterating through,
|
||
|
|
closed, square bracket, square bracket, at ID, closed, square bracket, we would get prologue
|
||
|
|
and end because the two chapters that we have are prologue and end.
|
||
|
|
There you go. I don't know why I started with zero, really. I think I could have just done
|
||
|
|
it one and two in that range. I could be wrong. Anyway, you can do more. You can get the,
|
||
|
|
just the contents of all of this stuff too. So I think, I think we could do that with the
|
||
|
|
out as humans, the part of, you know, in the tags. If you want that, then you use the special
|
||
|
|
character hash. And that's, I have only ever seen that as hash text. I have never seen it called
|
||
|
|
anything else. And that's because there is no, there's no, there's no further tag element for it.
|
||
|
|
You know, it's, it's like, it's the C data. It's the, it's the thing that doesn't, that,
|
||
|
|
that's inside of all of these things that we're looking at. So it's, it's kind of this weird
|
||
|
|
nebulous sort of nether world. But yeah, it's hash text. And that gives you the strings within
|
||
|
|
whatever element it is. And obviously, I mean, I'm just using this kind of pseudo doc book example
|
||
|
|
and the only thing in this example that would have text. No, that's not true. The title,
|
||
|
|
no, the title has, the title has text. The para would have text. Actually, I think the para might
|
||
|
|
we might only have access to that with a hash symbol, if there was an attribute in there as well.
|
||
|
|
I haven't really tested that a whole lot. Either way, when you look at a dump of your dictionary,
|
||
|
|
you, you, you will see either one or two special, special class, special entries, keys, one with
|
||
|
|
the at symbol for attributes, and one possibly with a hash to denote that it's not an attribute.
|
||
|
|
It's the contents of the tag. So if our para had a class, for instance, para class equals quote
|
||
|
|
foo, close quote, and some text in the para element, like this is the last chapter, then we would want,
|
||
|
|
we could call our attribute with the at symbol and the text inside of para with the hash.
|
||
|
|
Something like title, which in my example, so far has never had any attribute,
|
||
|
|
the only thing that you can call from title is the contents, the beginning, the end, whatever I put
|
||
|
|
in the title. So you don't have to use a special notation for that. Oh, look, a little bit of a
|
||
|
|
quirk again. Yes, some of these modules that are that are simplifying XML do indeed have
|
||
|
|
some, some quirks, I guess, you know, some unique things about them that makes them a little bit
|
||
|
|
of a hack, but it just makes, again, it makes XML so much easier. It's honestly, honestly, honestly
|
||
|
|
worth it. It's worth every, every quirk, it's just, it's fine. It's, it's, it's a lot better.
|
||
|
|
It's a trade-off that you are probably willing to make if you've ever tried parsing XML with
|
||
|
|
anything. You, you, you probably already are kind of seeing how this could be very nice.
|
||
|
|
Now, on the other hand, obviously, if, if you're doing something where
|
||
|
|
you need more advanced features, then this might, this may be too simple. Like I say,
|
||
|
|
XML2Dict does have namespace support. It can dump back out into JSON. It can actually, you know,
|
||
|
|
go back to XML, of course. So I mean, it's, it's not as simple. I don't think it's untangle,
|
||
|
|
but it is definitely simpler than something like LXML. The trade-offs are there and it's kind of up
|
||
|
|
to you whether or not you want to use it. I find both untangle and XML2Dict a lot easier than the
|
||
|
|
defaults. I highly recommend them. I find them just really, really sublime when I'm parsing
|
||
|
|
fairly simple XML. It's just, it makes things really, really easy. So check them out. If you're
|
||
|
|
working with XML and Python, they're worth looking at. Hopefully this has been informative. Thank you
|
||
|
|
for listening and I'll talk to you next time.
|
||
|
|
You've been listening to HackerPublicRadio at HackerPublicRadio.org. We are a community podcast
|
||
|
|
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
|
||
|
|
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
|
||
|
|
then click on our contributing to find out how easy it really is. HackerPublicRadio was found
|
||
|
|
by the digital dog pound and the infonomicon computer club and it's part of the binary revolution
|
||
|
|
at binrev.com. If you have comments on today's show, please email the host directly, leave a comment
|
||
|
|
on the website or record a follow-up episode yourself. Unless otherwise status, today's show is
|
||
|
|
released on the creative comments, attribution, share a like, 3.0 license.
|