Files

105 lines
18 KiB
Plaintext
Raw Permalink Normal View History

Episode: 2012
Title: HPR2012: Parsing XML in Python with Untangle
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2012/hpr2012.mp3
Transcribed: 2025-10-18 13:18:12
---
This is HPR episode 2012 entitled, Passing XML in Python with Untangle and in part of the series, a little bit of Python.
It is hosted by Klaatu and in about 21 minutes long, the summary is a quick introduction to Untangle, an XML parser for Python.
This episode of HPR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15, that's HPR15.
Better web hosting that's honest and fair at an honesthost.com.
Hi everyone, this is hacker folic radio, my name is Klaatu and I'm going to be talking about parsing XML in Python with a great little module.
The module called Untangle, so XML you probably know, if you don't know it, you certainly know one of the subsets of it or I don't know if they're, I don't know if that's the right terminology anymore, I don't want to get into political debates, but HTML is is very similar to XML, at one point it was going to be, you know, literally an implementation of it, it was XHTML.
It was really actually horrible, but yeah, if you've seen HTML, let's just say you've basically seen XML, and if you really, really are really wanting to see XML, just go open up the RSS feed of hacker public radio, you will see XML because RSS is, in fact, an implementation or a schema of XML.
Yeah, so it's Adam, so it's great, it's very, it's everywhere, a lot of people hate it actually, I'm, I'm, I'm, I will say that I'm a fan of XML, I know that's a crazy thing to say, but I really am, it's, it's got like a really sort of a strangely bad reputation, and I do actually, it's not one of those things where I don't understand why it has a bad reputation.
I, I see the badness of XML, I see like the, the labor intensive parsing of, of XML, and I understand that, and I see how verbose it is, and I understand why that is considered maybe a little bit unnecessary sometimes, but at the same time I also, I, I do respect XML a lot, and I'm actually quite fond of it in many ways.
And, and the thing about XML that I like really is the explicitness, it's, it's very, it doesn't leave a whole lot to, to guessing, you know, it's, it's, it tells you exactly where everything is in relation to everything else, I mean, it is, it is very explicit, it tells you exactly what things are,
it, it tells you the attributes of those things, you know, it's just, it's, it's very, I mean, that's why it's verbose, because it is just so, so descriptive, and very strict, it is infamously strict, if you do bad XML, and we all have, then parsers will break, they, it is not tolerant of quirks really at all.
It is, in fact, it's funny that one of the most popular sort of similar subsets of it HTML is so liberal and so tolerant of quirks, and XML is just so intolerant of that.
So that's what XML is, it's, it's out there, you're, you're going to encounter it, and sometimes you're going to actually want to use it, or at least that's how I've found things to be.
I've found use for XML on several occasions, where JSON just isn't, isn't robust enough or, or playing text, you know, just wouldn't do it justice or whatever.
So XML is, it has its uses, but as I said in, in, in, at the beginning, it, it can be a little bit difficult to, to parse really, and honestly, I have found in Python.
Some of the absolute most painful XML parsing I've ever done has been with the, the sort of the, the usual suspects of, of the Python module group, one of those usual suspects is L XML, which I think technically, I don't say that was the built in one, but I don't think it actually is, I think you have to add that in.
There's L XML, and then there's beautiful soup, a lot of people talk about beautiful soup a lot.
And, and those are fine tools, don't get me wrong, I just personally like, you know, when you open up the one that's supposed to be super user friendly and really great, I eat beautiful soup.
And when you ask the internet how to locate, you know, some tag within your XML tree, and it's giving you beautiful soup mixed with X path commands.
I don't know, to me, that misses the point of even using Python to parse X of, yeah, parse XML. I mean, if you have to do X path, I don't, why am I even bothering? I mean, that's how I feel.
And that's just my personal opinion. Other people, you know, especially if you're very used to X path, maybe you love it. Maybe that's just totally logical to you. To me, it's not. And so I went on a search for alternative parsers within Python.
And I found two good ones. And in this episode, I'm going to talk about untangle. So untangle takes an XML document and converts it into it into a Python class.
And so it's, I mean, it takes your XML and each element becomes an object essentially with a bunch of attributes. And you can probe that object, that class, for information about itself.
That probably doesn't make a whole lot of sense, but don't worry. It's actually really simple. And I can give you a really basic example. So let's say we have an XML document.
And I'm going to do a little bit of pseudo doc book here. I'm not going to, it's probably not correct exactly, but it's close enough. So we'll say, we have a book tag. And in the book tag, we have a chapter with an ID of prologue.
And then in that chapter, we have a title and the title is of the beginning. And then we close the title and we close the chapter and then we close the book.
So we have no real content here, except for a chapter and a title of that chapter. And all of that sits in a book object or a book in the book element.
Okay, so that's a really basic, mostly compliant doc book document. So if we take that into Python, then we can do an import untangle.
And untangle, by the way, is not included with Python. You will need to install that separately. And usually that's done or I find it easiest to do with pip pip install untangle or pip install dash dash user untangle.
Or of course, if your repository has on like Python dash untangle is usually what it's called, or I guess it could be like pie untangle. But I think I've usually seen it as pie dash untangle. Then you can just install it from from your repository as well. I use pip because at work.
I can't install things to the system. So I do pip install dash dash user, whatever.
So that's that's untangle installation. So yeah, if we take our little sample XML document into Python, we can do an import untangle.
And then we can assign the XML document that we've just created to a variable passing it through untangle parsing. So we'll do like data equals untangle dot parse parentheses quote example dot XML closed quote closed parentheses.
So now all that XML data has been dumped parsed and dumped into a variable called data.
And you can see it at this point, you can do data dot book. And why is it data dot book? Well, because data is the name of the
element that we've that we've created, right? We did that data equals untangle dot parse. So that's that's our object data.
And then the dot book is saying, well, I want to look in the in the attribute called book, which is our root element in the XML tree.
Because that was the first element. Remember, it was book chapter title. So data dot book would return the string element parentheses name equals book attributes equals curly race curly race.
C data equals blank and then close parentheses. So right there, you get in data dot book, you get the name of the element, which is book, any attributes contained in that element, which is none.
And any data like content data, see data in that element, which again is nothing, which seems a little bit weird, because you think, but there's a chapter and a title in that book.
And that is true. And that is a little bit weird. But it's giving you it's not giving you the entire tree, obviously, is looking at that element and looking at its at its contents essentially.
So since we do know that there's a chapter, you can do a data dot book dot chapter. And that would return element parentheses name equals chapter attributes equals curly brace ID colon pro log close curly brace.
C data equals blank close parentheses. So this time we got our chapter element. We called it that way. We said book dot chapter.
So we got the name of the chapter or the name of the element, which is chapter, we got the attributes, which is the word ID and the key that that's the key and the value of that is pro log.
And that's just something that I made up. And of course, there's still no no content here. It's it's all just element and and and attribute.
So we could drill down further and we could do data dot book dot chapter dot title, which would return the string element parentheses name equals title attributes equals curly brace curly brace and then see that I would equal the beginning and then close parentheses.
And the reason that it's that it's like that is because again, we've got the the tag itself, which is title, the element called title, we didn't give it any attributes in my example dot XML.
And the content is the only content that our little book contains. It's the words the beginning. So that's that's our book right now.
It's actually quite simple. As you can see, I mean, it just it lines everything up as classes and you can call them in the order that they appear in your document and get the data about them.
And that's quite nice. Now you can you can go in and extract individual bits of information to you don't have to you don't have to get the whole big string of everything.
You know, the getting the attributes returned when when there are no attributes present is kind of wasteful and getting the name of the element is a little bit weird.
If in order to call that element, you had to know the name. So let's say, for instance, that you want to get the chapter, you know that there's a chapter, but you want just the name of the attribute of it.
So you would do a data dot book dot chapter square bracket quote ID square bracket or a quote square bracket. So it's just like a dictionary. You're saying data book chapter, but just show me the ID element of of this of this thing that you're returning to me.
And that would of course tell you, oh, that that's prologue. And similarly, we could say, okay, well now data dot book dot chapter dot title.
Dot C data. And then I would do a dot strip parentheses parentheses and we would get the string the beginning.
There's some inconsistency there. It it seems because in the book and the chapter method, you're you're calling the ID key.
But to get the C data, you're treating it basically like another function of this sort of class thing that you've got. So that's that that's kind of I'm not saying that untangle is fantastic in terms of consistency.
I'm just saying it's a lot easier than learning, well, X path to be honest.
So what if you have more than just a chapter in a title. So we could we could expand our example dot XML into book chapter ID prologue equals prologue title, the beginning, closed title, and then open a paragraph.
And we could say this is the first paragraph and then close the paragraph and then close that chapter.
And then of course, most books have more than one chapter. So we could say, okay, so chapter ID equals end title, the ending, closed title, paragraph, paragraph last paragraph of this chapter of this book,
closed paragraph, closed chapter, closed book. So now we've got this book document with two chapters, meaning that it also therefore has two titles, and it also has two paragraphs.
It's a little bit more realistic. So what you can do there is you kind of return to this kind of pseudo, almost a dictionary syntax.
So you go data dot book dot chapter, okay, which chapter do you want? Well, we want the zero chapter. So we'll do a square bracket zero, closed square bracket.
And that would return element name equals chapter attributes equals ID prologue, see data, nothing. And then data dot book, chapter square bracket one square bracket would return element name chapter attributes ID equals end and see data again, nothing.
These don't have any, this doesn't have content still. And it won't the chapter will never have a content object because it doesn't, it's got it contains other elements.
You can do essentially, I guess the same thing like, well, let's say, again, like, so if you have, if you have more than, you know, we now have two chapters, as I said, two paragraphs. Yeah. So if we just said, okay, well, I want the data dot book, the dot, dot chapter, dot title, it's going to tell us that we have no idea that it has no idea what we're asking for.
But if we then say, okay, well, I mean, I want data dot book, chapter square bracket zero, closed square bracket, dot title, dot seed data, dot strip parentheses parentheses, then we would get the string of the beginning.
And similarly, if we said data dot book, dot chapter square bracket one square bracket, dot title, dot seed data, dot strip parentheses parentheses, then we would get the string, the ending.
So in other words, like to sum that little snippet up there, if you've got more, more than one element, and you usually will in your document, then you have to tell it which element parent you're looking at.
So if it's the first chapter has to be chapter zero, if it's the second chapter, chapter one, and so on.
You wouldn't, you know, you probably wouldn't, I mean, there's a point at which that stops being as important because like, well, even in HTML, you've got it, you usually, you can have a div inside of a div inside of a div.
But at some point, you know, you start, you run out of things, of nesting things. And then that's when you can kind of drop the specific specific specificity.
That's a hard word to say. So yeah, but just keep in mind that if, if, if you've got multiple parents, then you have to define which parent you're looking at when you want to look at certain children.
And you can, you can of course look at, you know, you don't have to do title, obviously, you can do data dot book dot chapter square bracket zero square bracket dot parra dot c data dot strip parenthesis parentheses and you'd get the strings.
This is the first paragraph. And if you did data dot book dot chapter square bracket one square bracket dot parra dot c data dot strip parentheses parentheses, then you would get last paragraph of the last or of the book or whatever I wrote in there.
And you can also, of course, iterate over all of this stuff. There are lots of different ways to do that. It gets a little bit complex to be honest.
So the easy super easy way would be doing something just like count equals square bracket zero comma one closed square bracket or, of course, in real life, you could do like a range or something.
And then you could say for tick and count colon and then indent print parentheses data dot book that chapter square bracket tick closed square bracket parentheses.
And that way you're, you're essentially printing first data dot book that chapter zero. And you would get all the information there. And then you would get data dot book that chapter one.
And you would do all the information from that. And obviously you could drill down further to just get certain elements or whatever you would need.
And I could see you needing to do that if you were parsing XML, you know, you would probably want to read in information from from specific tags from all the different occurrences of their of their parents.
So that's untangle. Is it perfect? I don't know if it's perfect. I really don't know. But I do know that it's a little bit wonky sometimes in terms of syntax, like I pointed out earlier, the fact that you can call it the chapter ID or, you know, some element by the attribute key without really saying, you know, without any, I don't know.
There's not a clear. I don't feel like there's it's very clear as to when you can do that when you can't, although it is. I mean, if you, if you look at the element and there's a dictionary in there, then yes, you can call it by dictionary key. I mean, technically speaking, yes, it does work.
The fact that see data has no, no apparent type, it just says like see data equals and then the strings that contain that are contained in the in the chapter contents or whatever, you know, the tag contents.
That's a little bit weird and and in XML, you know, very frequently you've got a lot of white space and strangely untangle kind of preserves a lot of that white space, which usually is not what you want.
I mean, that's kind of the point of having tags. You can you can structure your your document such that it looks nice to you, but the tags actually point to the important bits of information. So the fact that untangle does that is a little bit odd. And that's why I kept using dot strip parentheses parentheses to get rid of those the surrounding white space on on the on the tag contents. So there are quirks to untangle.
To the untangle module, but it's it's really not that bad once you get used to it.
And it is a heck of a lot easier than trying to navigate the XML tree with with really any other tool that I've that I've used so far.
Again, probably my use of it limits, you know, I mean, if you are going to do fancier things with with with XML, this might not work for you. This might be too simple.
And that is the trade off and it's acknowledged in the untangle documentation that it is the idea is to make this really, really simple rather than to, you know, have have like this super robust feature full XML parser. It's it's more just about having a parser that works.
So it's very, very convenient. It's very nice. You should check it out. It's called untangle. It's available on the internet. Thank you for listening to this episode. My name is Clot 2. This has been Hacker Public Radio. Talk to you next time.
You've been listening to Hacker Public Radio at Hacker Public Radio.org. We are a community podcast network that releases shows every weekday Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself. If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the Infonomicon Computer Club and is part of the binary revolution at binrev.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
On this other way, status, today's show is released on the create of comments, attribution, share a like, 3.0 license.