Episode: 3632 Title: HPR3632: Intro to web scraping with Python Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3632/hpr3632.mp3 Transcribed: 2025-10-25 02:31:44 --- This is Hacker Public Radio Episode 3,632. For Tuesday 5 July 2022, today's show is entitled Intro to Web Scraping with Python. It is part of the series programming 101. It is hosted by Klaatu and is about 32 minutes long. It carries a clean flag. The summary is using requests and beautiful soup to scrape websites. Hey everybody, this is Klaatu. At work, I was recently giving a lesson to an intern on how to get started doing some web scraping because part of the intern's self-appointed assignment, I don't really know how intern's work, I've never had an internship, but part of that intern experience was that they were going to learn how to do web scraping with Python and specifically the project centered around analyzing word count of web pages and image count on web pages, and I imagine probably other things like downloading times or something like that. I'm not really sure, but there was definitely a focus on sort of analyzing the elements, the different HTML elements of a website, and so that was the prompt given to me to provide a lesson on, and so I figured I could just record that lesson as well. This is a pretty straight translation of the lesson plan that I devised for this intern. In web scraping, I also learned in this experience that the term web scraping can have some negative connotations for some people, which I found interesting. It had never occurred to me before, but a lot of people, it seems, or some number of people it seems, in their mind, when they hear web scraping or when they've heard about web scraping, it was often, I have found that it is sometimes associated with data harvesting, essentially, and certainly that is one use of web scraping, but obviously there are lots of uses for web scraping, and in fact I made the argument, or the wasn't an argument, it was more of an explanation, I explained that web scraping really was no different than web browsing, and that the two terms are actually essentially the same. They are both downloading information off of the internet. One uses a graphical browser, and one uses code, Python code. Interestingly, web browsers are also written with code. It is just the interface by which you're looking at that content on the web. I found myself very frequently adapting my language as I spoke about this web scraping project to people, so that I would use web scraping, the term web scraping, but I would also insert verbiage, such as browsing the web with Python, or downloading information from the internet with Python, or web browsing, just straight up web browsing, except you're doing it with Python, or with code, whatever. That was something that I found interesting, a little bit of a culture shock, maybe, is a strong term for it, but a little wrinkle in culture, just from people who think about information on the internet in different ways, the term web scrape. Definitely, you can hear whether you agree with it or not, you can hear how that sounds abrasive scrape. That is not usually a pleasant thing to do, usually scraping is a bad thing. You don't like it when things scrape against other things, so web scraping sounds kind of almost malicious, but it doesn't have to be. This little tutorial that I'm about to share with you right now is going to use mostly beautiful soup, a little bit of requests, and of course, there are other libraries in Python that interface with the internet, so those aren't the only two avenues, but they are the avenues that I used, especially given the ultimate goal of parsing all those HTML tags. So web browsing the web, it is something that we all do all day, practically, sometimes, and it is something, generally, that's a very manual process. You open a web browser, you navigate to a URL, and you go to a web page, and you click on a link that you want to view, and then you view the page with your eyes, and you see all the text that you want to see or the images, and then you go somewhere else. You click somewhere else in your browser, and that's a legitimate way to browse the web, but it is a very, again, manual way to browse the web. It is something that you have to do manually. If you want to, I don't know, copy some text off of a web page, then you have to select all that text, and a lot of times selecting the text is a little bit tricky because there's text, other plate, you know, the layout might not be conducive to just a straight drag selection, so it can be a little bit tricky sometimes, or possibly just if you have a hundred things to download, then they can be very, very time consuming, and that can be problematic, and some people might think, well, there should be ways to automate that, and well, there are ways to automate that, and one of those ways is Python. Python has a couple of different libraries or modules that help with looking at information on the internet. So the most basic one is called Requests, R-E-Q-U-E-S-T-S, Requests. It's a library. I think it's built into Python. I don't remember having to install it or anything, but you can install it with Python-M-PIP install Requests, if it's not already there, and if you use, I have found that if you use an IDE, a good open source Python IDE, the one that I demonstrated was PyCharm, the Community Edition, don't get the non-community edition, that's not open source, get the Community Edition, it's got all the things that you're likely to need. Of course, there are other Python-IDEs as well. I used to use one called Ninja IDE, but last I looked, it hadn't been updated recently, but of course, you can do things like, I don't know, VS CodeM or Atom, even though Atom is kind of on its way out now, unfortunately, but yeah, there are lots of great little IDEs or text editors that you can use for quick and sort of helpful Python programming. So Requests, the most basic sort of web browsing process you can do would be a Python file, Import Request. So you've just imported the library requests and requests is going to get a URL for you. Now, the URL that you're going to get is entirely up to you at this stage, and for brevity, I'm going to just hard code it into this script, into this Python script. In real life, you would not want to do that. Probably, you would probably want to leave it open to an argument that you would pass to Python when you launch Python. So you would want to, you'd have my download script.py, and then some URL, HTTP colon slash slash, or well, you wouldn't even have to put all that, I guess, you could just put, like, I don't know, example.com. And then, and then your script, you would import whatever argument the user has provided, and then you would go to that URL. But I'm going to leave that as an exercise. That's a separate exercise. It's an exercise. If you don't know how to import, or rather read parameters or arguments from the, from the command, then look into that. It's a, it's a good trick to know. So I'm going to create a variable here called data, D-A-T-A equals quote, HTTPS, or HTTP colon slash slash, example.com, close quote. So that's a URL. Like I say, normally, you'd pass that in through a command, but for now, we're hard coding it in, and it's going to be put into a variable called data. The next one, I'm going to call page P-A-G-E, and I'm going to put, I'm going to set that equal to the results of requests.get parentheses data, close parentheses. So what that means is that page, the variable page, is going to receive the output of whatever requests.get. Now, that's a function.get is a method within the requests library requests.get parentheses data. So we're just telling the library requests to go get whatever is at whatever is in data, and that wasn't, I wasn't mispeaking. I was saying whatever is at whatever is in data, meaning Python knows to translate the variable of data, example.com, into a URL, and to go get it with requests. So finally, we're going to print parentheses page.text, close parentheses. So once again, it's a little tricky, but we're doing a print, and then the page.text, and the reason we're able to say a .text after the page variable is because page contains structured data. Requests didn't just grab a bunch of plain text and dump it into page. It provided structured data. And so we have access to segments of the page variable. We can look at just the text. We can look at the response code. Like if we want to see, was it a 404? Was it a 200? You could see just the response. I think it's page.response or something like that. So there are different segments of page, and the only reason that is is because requests is programmed to hand page that information. That's really tricky. Sometimes you want page to just be a bunch of text, because that's what you probably downloaded, right? I mean, you downloaded a bunch of text, but requests is a little bit more complex than that for better or for worse. I mean, it depends on your, I guess, your requirements, but requests is quite nice. It understands when it goes to a page, if it finds a 200 response, it stores that in a little segment of the variable. And if it encounters some text, it stores that and so on. And you can go to the Python documentation and look up the requests library, and find out what other segments there are. The only two I know off the top of my head are text and response. Okay, so you've just created a script that will download all of the contents on example.com and then dump it rather unceremoniously into your terminal. So if you save that script and run it as, you know, with Python.slash, my download script.py, then you'll get the results. You'll get the contents of example.com in your terminal. Notice that it is not terribly structured. I mean, it's as structured as HTML is, but the output itself is kind of probably all over the place. There's nothing necessarily pretty about the output. It really is just kind of a code dump of whatever it found that example.com. Okay, so that's useful. And requests is nice for for grabbing information. And certainly you could you could print the contents of a web page out like that and then maybe use a grip or awk to kind of parse your output. You could do that, but there is yet a more intelligent library that not only knows the difference between a response code and and the contents of a page, but also even understands the difference between an HTML tag like angle bracket p angle bracket or angle bracket div angle bracket and the contents of an HTML tag. So for instance, the angle bracket h1, close angle bracket, bracket, hello world, or example.com, whatever close close h1, it knows the difference between h1 and then the title like hello world or example.com or whatever's in whatever's in the h1. It knows the difference between the text and the style. Well, is that really the style? I don't know, the markup, the markup and the content. Okay, so to to get started with that, you need to go to beautiful soup and down or rather, you need to download and install beautiful soup as a as a Python module. And again, the sort of the manual way of doing that is doing a Python dash m pip install BS4 I think or maybe beautiful soup all one word. Yeah, I think it's a beautiful soup all one word. But honestly, if you use an IDE, a good IDE, it'll manage that for you, especially like something like PyCharm, Community Edition, you can, you can you set up a project and then when you type in an import, it offers to download that import for you locally within your project environment. So you're not sort of for lack of a better word, corrupting or polluting, let's say polluting the rest of your system with beautiful soup, which I mean, again, it's, it's not terribly, it's not bad to have beautiful soup. But you're, you're not installing beautiful soup on your system, you're, you're installing it within the virtual environment of this particular Python project, which is nice. It is legitimately nice because that way, you know exactly what your Python project absolutely requires to run, which is difficult if you've installed beautiful soup three years ago and start using beautiful soup modules without ever thinking that you've really installed it. You just kind of think, well, yeah, you know, it's, it's beautiful soup. Just everyone will have that, right? Well, no, they won't. Just like your virtual environment doesn't have it. Okay, so let's look at essentially the same script that we just did with requests, except with beautiful soup, except also with requests. Okay, so from BS4, like beautiful soup for, from BS4 import, beautiful soup capital B, capital S, import request. So we're getting both the beautiful soup modules and the requests module. And we'll do the same opening, essentially, except we'll, we'll kind of speed things up a little bit. So instead of making a separate variable for data and the contents of the data, we'll just do page equals requests.get. And then parentheses quote, HTTP colon slash slash example.com closed, quote, closed parentheses. So that's just going to, we're just downloading the page with requests and dumping all of that structured data into page. Then we're going to make some soup. So soup equals beautiful soup capital B capital S, parentheses page.text. So this is again, we're looking at the text part of the, of the downloaded content, comma, quote, HTML dot parser, closed, quote, closed parentheses. So here we're, we're doing a special beautiful soup call. And we're telling beautiful soup to look at the textual content of the variable page, which of course contains the results of what requests got or.get from example.com. And then we're saying to filter that through or, or maybe to interpret that through the HTML parser, which is built into beautiful soup. And the way that you would know to do that is you would go to beautiful soup documentation, which is something like a beautiful soups, beautiful soup dot read the docs.io. Is that it? No, it's beautiful dash soup, dash four, the number four dot read the docs.io. That's what that's what it is. So you would go there and you would look at, well, you could look at some tutorials. First of all, that would reiterate some of what I'm saying here. But you could also look at the, the, the documentation of what kind of parsers it has, what kinds of library functions it has. And, and then you could, you could use what you find in, in your own code. So, you know, in other words, I'm not, I'm not being comprehensive here. I'm just telling you the, the basic parsing abilities of beautiful soup, but there's much, much more that it can do. Okay, so for instance, well, no, actually, well, yeah, for instance, okay. So we'll do if, and this is my favorite favorite incantation from Python, that was sarcasm, you heard. If space underscore, underscore name underscore underscore, underscore space equals equals space, single quote underscore, underscore main underscore underscore main, I mean, underscore underscore single quote, colon, return indent print parentheses soup dot printify parentheses parentheses close parentheses. Okay, so what that was is it says that if, if, if a user is launching this script intentionally, just launching this script, in other words, this script is not being called as a library from some other application, it is actually being used as a script, then print soup dot printify. And soup, obviously, is our variable that contains the output of the results of beautiful soup page dot text HTML parser, but we're, we're running it through a little function, which we can kind of think of as a filter in this case, running it through little filter called printify PRE TTIFY. In order, in other words, make this pretty print it, but make it pretty first. It's a little bit like a Unix pipe in a way, you know, you're kind of sending it through a sort function or something like that. Except what you're really doing is you're just, you're telling beautiful soup to print the code out in such a way that the indentation is consistent. Each tag has its own line. I think that's probably it, but it is, it makes it look pretty, which in contrast to just sort of the raw output of requests can be very useful. And certainly it would be easier if you were going to for whatever reason parse this with an external tool like AUK or Grap, it would be a lot easier to do that from the output of beautiful soup PRE TTIFY than just the raw output of requests, which might have unpredictable indentation, unpredictable line breaks or no line breaks, and so on. So PRE TTIFY is a one of the sort of the basic, but kind of really pleasant functions of beautiful soup. But there is more. So for instance, what if you just wanted to and this is the use case of this lesson actually, but what if you wanted to, for instance, just find the paragraph tags. You just wanted to find the angle bracket P, close angle bracket, those elements you want to find those those HTML elements. I mean, you can see the content as well, but you want to sort of filter your output and have beautiful soup exclude everything but paragraphs. Well, beautiful soup is pretty well aware of HTML because it is using its HTML parser. So you can do a print, for example, print parentheses soup dot P, close parentheses. Now if you're used to Python, you already see the problem with this, but you can do that. So instead of print soup dot printify parentheses, parentheses, you do soup print soup dot P and you get a paragraph tag. You get a paragraph or a sentence that's surrounded by angle bracket P, close angle bracket. Now more than likely the web page that you are scraping doesn't just contain one paragraph tag. What you are seeing in your output is, I guess it would be the first paragraph tag encountered on that page, which can be sometimes revealing for SEO and stuff. You could look at your first paragraph and realize that as far as whatever search engine knows or knows from its scrape, the first line of your homepage might be, I don't know, follow us on face slam or whatever. Who knows? It could be completely irrelevant to your site. So you could diagnose that with web scraping. So what do you know? It's not just a malicious thing after all. So there's the paragraph tag, but it's but a single paragraph tag. If you want all the paragraph tags, then you can use a for loop. Now you might as well make this into a function in an ideal world. I think you would probably try to make a function that could sort of abstract away the element that you're looking for, but I didn't get that far. So I just kind of this is still kind of hard-coded, but you can make a function in Python with the beautiful prefix, death. Yeah, DEF, that's what it is. It actually stands for define, I guess. Why they don't just use the word function, I don't know. We'll never understand why program languages don't just say what they mean, but DEF, that's what we get. So DEF space, I don't know, loop it, parentheses, parentheses, colon, and then next line, indent once for tag in soup.find underscore all parentheses, single quote, p, close single quote, close single, close parentheses, colon, next line, indent, indent, print tag. Okay, so for tag, there's nothing magical about the word tag. It's just something that I chose. I could have chosen for I, for item, for, for penguin. It doesn't matter. It's just some, some dine, yeah, dynamically defined, quick and easy, disposable variable. We just need some place to hold what we find in soup with the function called find underscore all. That's a beautiful soup function. So soup dot find underscore all and then parentheses, quote, p, close, quote, close parentheses. So it's just saying for every, every time you find a paragraph, put it into a variable called tag and hey, if you have a tag, print it. So if you have that, if you have that function in your code and you, you know, you've got all the rest of the code that I've already talked about you, but you have that, then well, if you run it, nothing will happen because you're not actually calling that function yet. So in order to make that function run, you have to tell your Python program to execute the code within the function. That's kind of one of the advantages of functions is that they don't happen unless you explicitly tell it to happen and it doesn't happen until you tell it to happen. So in the if underscore underscore name underscore underscore space equal equal space single, quote, underscore underscore main underscore underscore underscore close single, quote, colon section, don't print something. Just do a loop it parentheses parentheses or whatever you call your function. I called it loop it because it seemed like an obvious name. And now run your code and then you'll see all the paragraph, well, all the paragraphs in in example.com. Now you can also get just the content. Remember I said that beautiful soup could separate the mark up like angle bracket p angle bracket or angle bracket div angle bracket or angle bracket image source equals blah, close angle bracket, well, backslash and then close or forward slash whatever that is. So it can tell mark up from the actual content. So you could just get sort of like the text, the strings, the words on the page. And the way that you could do that is in your function, your loop it function for tag in soup.find all p. Then just instead of just printing the tag. And remember the tag was just is a variable that I chose. It doesn't mean anything. It's just the con when when you find a p tag, put it in this variable called tag. So it could be item. It could be fish. It could be whatever you wanted to be, but I called it tag. So for tag in soup, find all p find print rather tag dot string string. Of course, being the Python, well, the programming sort of lingo for what we call a word for a string of letters, essentially, or characters. And then once again, you could, you know, I mean, so now you're printing the content. And so once you have the text of a web page, you could parse it further with standard Python string libraries. And in this particular case, what I was teaching the person was asking me for this lesson, that they wanted to learn how to count the words of a page. So the quick and dirty way to do that would be for tag in soup find all p if tag dot string is not none. So this is important. It's if and tag that's again, our variable dot string. So that's the words of the markup p element is not none with a capital in. And the reason that we have to say this is because beautiful soup recognizes sometimes that there's a paragraph tag without content, but it doesn't just not print it. It it assigns it a special value called none. Some other, some other programming languages call it null or nil. This is called none in O in E with a capital N. So we're saying as long as it's not none. So there is content here, then or not then because Python doesn't use the word then, but it's colon next line indent print parentheses, Len L. E in as in length parentheses tag dot string dot split parentheses parentheses close parentheses close parentheses. And what we're doing there is we're just counting the length of all the strings as we split the strings by the default character of a space. And that's essentially the word count. There might be cleaner ways to do that to catch, I don't know, odd little exceptions or whatever. But that's a pretty quick and dirty way to get like the the word count of all the of all the content of all the paragraph tags. Now there might be other tags that that that have content that you want to count like H1's and H2's or or I don't know some paragraphs or not paragraph some ordered lists and that don't use paragraphs. You can use paragraphs in order list, but let's say you didn't you didn't add an order list and then the list item and you just went without paragraphs who knows. So you would have to take all that into account. And of course by doing it the way that I just the quick and dirty way, you're not getting a total of the words you're getting it for each paragraph. So what if you wanted to do the total? Well, that's another loop trick where you do sort of a function or a def space loop it parentheses parentheses colon, next line indent and then you have to create a sort of a counter variable. And by counter variable, I don't mean like counter against anti. I mean counter like I'm going to count now. So let's call it num in UM as in number num equals zero. So we're setting our counter to zero. And then we do the same thing four tag in soup. Find all P if tag dot string is not num then num equals num. I said not num didn't I if tag dot string is not none in O in E colon, next line indent, indent, indent, num equals num plus length of the length alien of tag dot string dot split parentheses parentheses. So essentially we're doing the exact same thing. But instead of printing that number to your terminal every single time it encounters a p tag, we're adding the total back to the running total. So num equals num plus the length of the tag string split. And then at the end, so outside of your for loop, so you didn't back to your your original setting just under function, then you print the grand total is num. And now you have your grand total printed at the very end of the of the thing. Of course, there's a lot more information you can extract with beautiful soup and Python. You could for instance, like I said, I think probably top of the list eventually would be to accept input. You would want to be able to feed your little script a URL dynamically as you launch it so that you don't have to go into the code and update the variable or that the value for the variable every time you want to change the URL that you're you're downloading. So that would be something to look into. You could also count the number of images because again, that was something that was that was specifically asked how can we count images. So you would you would you know, you know, how to single out the paragraph tag. So you also really know how to single out the image tag. Now there's inheritance and children of parent tags and things like that that you might want to take into account. It might help you filter things out or filter things in. So you would want to learn about sort of the way that the way beautiful soup views or or or walks the the the the structure of the document. And of course, you could also count the number of images. Once you find the images that you want in a reliable way, you would maybe want to count them. And again, you know how to do that. Maybe that would be a separate function or maybe you could find a way to use the loop at function to not just find the paragraph tags, but also, you know, possibly to find other tags or or maybe you could use it to do both of those things. I don't know if that would be smart. You might want like a loop at P and a loop at image. I don't know. Obviously, it depends on what you're doing with your program. But I think this definitely gives you sort of a little bit of insight into how web scraping works. You could use this for instance, to find all of the links to pictures on a website or all of the links to media videos or something on a website. And you know, you would do that by zeroing in on a href tags or video tags or image tags or audio, some kind of audio tag, whatever. You would zero in on that and then you would you would find the content or the the attribute of, well, it depends on what you're looking at. But yeah, so you would you would you zero in on that and then you take a look at the the part of that element that you need to to look at whether it's the content or or or some other part of it. And then you could store that in a variable and then you could process that you could store it and then process it later. Who knows what you want to do. But yeah, this is the this is the start to web scraping with Python. It's it's a really pretty well documented system. I mean, beautiful soup is really, really strong for for that. Whether it's the best or not, I don't know. Maybe someone else will record a hacker public radio episode to tell us the way that they do web scraping. And and I don't do a lot of web scraping truth be told I do it in sort of sprints. You know, I I I'll do some web scraping for a week and then I'll walk away from it and never never bother again for another three years. So I do, but I don't. And and when I do, I do tend to either just shell script curl or use beautiful soup because they're both really, really good for different things, I think. So yeah, that's that's web scraping 101, I guess. I hope that was useful to some people or interesting. And thank you very much for listening. I'll talk to you next time. You have been listening to hacker public radio as hacker public radio does work. Today's show was contributed by a HBR listener like yourself. If you ever thought of recording podcasts, you click on our contribute link to find out how easy it really is. Hosting for HBR has been kindly provided by an honesthost.com, the internet archive and our sings.net. On the Sadois stages, today's show is released on their creative comments, attribution, 4.0 International