Episode: 3632
Title: HPR3632: Intro to web scraping with Python
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3632/hpr3632.mp3
Transcribed: 2025-10-25 02:31:44

---

This is Hacker Public Radio Episode 3,632.
For Tuesday 5 July 2022, today's show is entitled Intro to Web Scraping with Python.
It is part of the series programming 101.
It is hosted by Klaatu and is about 32 minutes long.
It carries a clean flag.
The summary is using requests and beautiful soup to scrape websites.
Hey everybody, this is Klaatu.
At work, I was recently giving a lesson to an intern on how to get started doing some
web scraping because part of the intern's self-appointed assignment, I don't really know how
intern's work, I've never had an internship, but part of that intern experience was that
they were going to learn how to do web scraping with Python and specifically the project centered
around analyzing word count of web pages and image count on web pages, and I imagine probably
other things like downloading times or something like that.
I'm not really sure, but there was definitely a focus on sort of analyzing the elements,
the different HTML elements of a website, and so that was the prompt given to me to
provide a lesson on, and so I figured I could just record that lesson as well.
This is a pretty straight translation of the lesson plan that I devised for this intern.
In web scraping, I also learned in this experience that the term web scraping can have some
negative connotations for some people, which I found interesting.
It had never occurred to me before, but a lot of people, it seems, or some number of people
it seems, in their mind, when they hear web scraping or when they've heard about web scraping,
it was often, I have found that it is sometimes associated with data harvesting, essentially,
and certainly that is one use of web scraping, but obviously there are lots of uses for
web scraping, and in fact I made the argument, or the wasn't an argument, it was more of
an explanation, I explained that web scraping really was no different than web browsing,
and that the two terms are actually essentially the same.
They are both downloading information off of the internet.
One uses a graphical browser, and one uses code, Python code.
Interestingly, web browsers are also written with code.
It is just the interface by which you're looking at that content on the web.
I found myself very frequently adapting my language as I spoke about this web scraping
project to people, so that I would use web scraping, the term web scraping, but I would
also insert verbiage, such as browsing the web with Python, or downloading information
from the internet with Python, or web browsing, just straight up web browsing, except you're
doing it with Python, or with code, whatever.
That was something that I found interesting, a little bit of a culture shock, maybe, is
a strong term for it, but a little wrinkle in culture, just from people who think about
information on the internet in different ways, the term web scrape.
Definitely, you can hear whether you agree with it or not, you can hear how that sounds
abrasive scrape.
That is not usually a pleasant thing to do, usually scraping is a bad thing.
You don't like it when things scrape against other things, so web scraping sounds kind
of almost malicious, but it doesn't have to be.
This little tutorial that I'm about to share with you right now is going to use mostly
beautiful soup, a little bit of requests, and of course, there are other libraries in
Python that interface with the internet, so those aren't the only two avenues, but they
are the avenues that I used, especially given the ultimate goal of parsing all those HTML
tags.
So web browsing the web, it is something that we all do all day, practically, sometimes,
and it is something, generally, that's a very manual process.
You open a web browser, you navigate to a URL, and you go to a web page, and you click
on a link that you want to view, and then you view the page with your eyes, and you see
all the text that you want to see or the images, and then you go somewhere else.
You click somewhere else in your browser, and that's a legitimate way to browse the web,
but it is a very, again, manual way to browse the web.
It is something that you have to do manually.
If you want to, I don't know, copy some text off of a web page, then you have to select
all that text, and a lot of times selecting the text is a little bit tricky because there's
text, other plate, you know, the layout might not be conducive to just a straight drag
selection, so it can be a little bit tricky sometimes, or possibly just if you have a hundred
things to download, then they can be very, very time consuming, and that can be problematic,
and some people might think, well, there should be ways to automate that, and well, there are
ways to automate that, and one of those ways is Python. Python has a couple of different
libraries or modules that help with looking at information on the internet. So the most basic
one is called Requests, R-E-Q-U-E-S-T-S, Requests. It's a library. I think it's built into Python.
I don't remember having to install it or anything, but you can install it with Python-M-PIP install
Requests, if it's not already there, and if you use, I have found that if you use an IDE,
a good open source Python IDE, the one that I demonstrated was PyCharm, the Community Edition,
don't get the non-community edition, that's not open source, get the Community Edition,
it's got all the things that you're likely to need. Of course, there are other Python-IDEs as well.
I used to use one called Ninja IDE, but last I looked, it hadn't been updated recently,
but of course, you can do things like, I don't know, VS CodeM or Atom, even though Atom is kind
of on its way out now, unfortunately, but yeah, there are lots of great little IDEs or text editors
that you can use for quick and sort of helpful Python programming. So Requests, the most basic
sort of web browsing process you can do would be a Python file, Import Request. So you've just imported
the library requests and requests is going to get a URL for you. Now, the URL that you're going to
get is entirely up to you at this stage, and for brevity, I'm going to just hard code it into
this script, into this Python script. In real life, you would not want to do that. Probably,
you would probably want to leave it open to an argument that you would pass to Python when you
launch Python. So you would want to, you'd have my download script.py, and then some URL, HTTP
colon slash slash, or well, you wouldn't even have to put all that, I guess, you could just put,
like, I don't know, example.com. And then, and then your script, you would import whatever argument
the user has provided, and then you would go to that URL. But I'm going to leave that as an
exercise. That's a separate exercise. It's an exercise. If you don't know how to import,
or rather read parameters or arguments from the, from the command, then look into that. It's a,
it's a good trick to know. So I'm going to create a variable here called data, D-A-T-A equals
quote, HTTPS, or HTTP colon slash slash, example.com, close quote. So that's a URL. Like I say, normally,
you'd pass that in through a command, but for now, we're hard coding it in, and it's going to be put
into a variable called data. The next one, I'm going to call page P-A-G-E, and I'm going to put,
I'm going to set that equal to the results of requests.get parentheses data, close parentheses.
So what that means is that page, the variable page, is going to receive the output of whatever
requests.get. Now, that's a function.get is a method within the requests library requests.get
parentheses data. So we're just telling the library requests to go get whatever is at whatever is
in data, and that wasn't, I wasn't mispeaking. I was saying whatever is at whatever is in data,
meaning Python knows to translate the variable of data, example.com, into a URL, and to go get it with
requests. So finally, we're going to print parentheses page.text, close parentheses. So once again,
it's a little tricky, but we're doing a print, and then the page.text, and the reason we're
able to say a .text after the page variable is because page contains structured data. Requests
didn't just grab a bunch of plain text and dump it into page. It provided structured data.
And so we have access to segments of the page variable. We can look at just the text. We can
look at the response code. Like if we want to see, was it a 404? Was it a 200? You could see just
the response. I think it's page.response or something like that. So there are different segments
of page, and the only reason that is is because requests is programmed to hand page that information.
That's really tricky. Sometimes you want page to just be a bunch of text, because that's what
you probably downloaded, right? I mean, you downloaded a bunch of text, but requests is a little bit
more complex than that for better or for worse. I mean, it depends on your, I guess, your requirements,
but requests is quite nice. It understands when it goes to a page, if it finds a 200 response,
it stores that in a little segment of the variable. And if it encounters some text, it stores that
and so on. And you can go to the Python documentation and look up the requests library,
and find out what other segments there are. The only two I know off the top of my head are text
and response. Okay, so you've just created a script that will download all of the contents on
example.com and then dump it rather unceremoniously into your terminal. So if you save that script and
run it as, you know, with Python.slash, my download script.py, then you'll get the results. You'll
get the contents of example.com in your terminal. Notice that it is not terribly structured. I mean,
it's as structured as HTML is, but the output itself is kind of probably all over the place. There's
nothing necessarily pretty about the output. It really is just kind of a code dump of whatever it
found that example.com. Okay, so that's useful. And requests is nice for for grabbing information.
And certainly you could you could print the contents of a web page out like that and then maybe
use a grip or awk to kind of parse your output. You could do that, but there is yet a more intelligent
library that not only knows the difference between a response code and and the contents of a page,
but also even understands the difference between an HTML tag like angle bracket p angle bracket or
angle bracket div angle bracket and the contents of an HTML tag. So for instance, the
angle bracket h1, close angle bracket, bracket, hello world, or example.com, whatever close
close h1, it knows the difference between h1 and then the title like hello world or example.com
or whatever's in whatever's in the h1. It knows the difference between the text and the style.
Well, is that really the style? I don't know, the markup, the markup and the content. Okay, so to to
get started with that, you need to go to beautiful soup and down or rather, you need to download
and install beautiful soup as a as a Python module. And again, the sort of the manual way of doing
that is doing a Python dash m pip install BS4 I think or maybe beautiful soup all one word. Yeah,
I think it's a beautiful soup all one word. But honestly, if you use an IDE, a good IDE,
it'll manage that for you, especially like something like PyCharm, Community Edition, you can,
you can you set up a project and then when you type in an import, it offers to download that import
for you locally within your project environment. So you're not sort of for lack of a better word,
corrupting or polluting, let's say polluting the rest of your system with beautiful soup,
which I mean, again, it's, it's not terribly, it's not bad to have beautiful soup. But you're,
you're not installing beautiful soup on your system, you're, you're installing it within the
virtual environment of this particular Python project, which is nice. It is legitimately nice
because that way, you know exactly what your Python project absolutely requires to run, which
is difficult if you've installed beautiful soup three years ago and start using beautiful
soup modules without ever thinking that you've really installed it. You just kind of think,
well, yeah, you know, it's, it's beautiful soup. Just everyone will have that, right? Well, no,
they won't. Just like your virtual environment doesn't have it. Okay, so let's look at essentially
the same script that we just did with requests, except with beautiful soup, except also with requests.
Okay, so from BS4, like beautiful soup for, from BS4 import, beautiful soup capital B,
capital S, import request. So we're getting both the beautiful soup modules and the requests module.
And we'll do the same opening, essentially, except we'll, we'll kind of speed things up a little
bit. So instead of making a separate variable for data and the contents of the data, we'll just do
page equals requests.get. And then parentheses quote, HTTP colon slash slash example.com closed,
quote, closed parentheses. So that's just going to, we're just downloading the page with requests
and dumping all of that structured data into page. Then we're going to make some soup. So soup
equals beautiful soup capital B capital S, parentheses page.text. So this is again, we're looking at
the text part of the, of the downloaded content, comma, quote, HTML dot parser, closed,
quote, closed parentheses. So here we're, we're doing a special beautiful soup call. And we're
telling beautiful soup to look at the textual content of the variable page, which of course
contains the results of what requests got or.get from example.com. And then we're saying to filter
that through or, or maybe to interpret that through the HTML parser, which is built into
beautiful soup. And the way that you would know to do that is you would go to beautiful soup
documentation, which is something like a beautiful soups, beautiful soup dot read the docs.io.
Is that it? No, it's beautiful dash soup, dash four, the number four dot read the docs.io.
That's what that's what it is. So you would go there and you would look at, well, you could look
at some tutorials. First of all, that would reiterate some of what I'm saying here. But you could
also look at the, the, the documentation of what kind of parsers it has, what kinds of library
functions it has. And, and then you could, you could use what you find in, in your own code.
So, you know, in other words, I'm not, I'm not being comprehensive here. I'm just telling you the,
the basic parsing abilities of beautiful soup, but there's much, much more that it can do. Okay,
so for instance, well, no, actually, well, yeah, for instance, okay. So we'll do if, and this
is my favorite favorite incantation from Python, that was sarcasm, you heard. If space underscore,
underscore name underscore underscore, underscore space equals equals space, single quote underscore,
underscore main underscore underscore main, I mean, underscore underscore single quote, colon,
return indent print parentheses soup dot printify parentheses parentheses close parentheses.
Okay, so what that was is it says that if, if, if a user is launching this script intentionally,
just launching this script, in other words, this script is not being called as a library from some
other application, it is actually being used as a script, then print soup dot printify. And soup,
obviously, is our variable that contains the output of the results of beautiful soup page dot
text HTML parser, but we're, we're running it through a little function, which we can kind of
think of as a filter in this case, running it through little filter called printify PRE TTIFY.
In order, in other words, make this pretty print it, but make it pretty first. It's a little bit
like a Unix pipe in a way, you know, you're kind of sending it through a sort function or something
like that. Except what you're really doing is you're just, you're telling beautiful soup to print
the code out in such a way that the indentation is consistent. Each tag has its own line. I think
that's probably it, but it is, it makes it look pretty, which in contrast to just sort of the
raw output of requests can be very useful. And certainly it would be easier if you were going to
for whatever reason parse this with an external tool like AUK or Grap, it would be a lot easier
to do that from the output of beautiful soup PRE TTIFY than just the raw output of requests,
which might have unpredictable indentation, unpredictable line breaks or no line breaks,
and so on. So PRE TTIFY is a one of the sort of the basic, but kind of really pleasant functions
of beautiful soup. But there is more. So for instance, what if you just wanted to and this is the
use case of this lesson actually, but what if you wanted to, for instance, just find the paragraph tags.
You just wanted to find the angle bracket P, close angle bracket, those elements you want
to find those those HTML elements. I mean, you can see the content as well, but you want to
sort of filter your output and have beautiful soup exclude everything but paragraphs.
Well, beautiful soup is pretty well aware of HTML because it is using its HTML parser.
So you can do a print, for example, print parentheses soup dot P, close parentheses. Now if you're
used to Python, you already see the problem with this, but you can do that. So instead of
print soup dot printify parentheses, parentheses, you do soup print soup dot P and you get a paragraph
tag. You get a paragraph or a sentence that's surrounded by angle bracket P, close angle bracket.
Now more than likely the web page that you are scraping doesn't just contain one paragraph tag.
What you are seeing in your output is, I guess it would be the first paragraph tag encountered
on that page, which can be sometimes revealing for SEO and stuff. You could look at your first
paragraph and realize that as far as whatever search engine knows or knows from its scrape,
the first line of your homepage might be, I don't know, follow us on face slam or whatever.
Who knows? It could be completely irrelevant to your site. So you could diagnose that with
web scraping. So what do you know? It's not just a malicious thing after all. So there's the paragraph
tag, but it's but a single paragraph tag. If you want all the paragraph tags, then you can use
a for loop. Now you might as well make this into a function in an ideal world. I think you would
probably try to make a function that could sort of abstract away the element that you're looking
for, but I didn't get that far. So I just kind of this is still kind of hard-coded, but you can
make a function in Python with the beautiful prefix, death. Yeah, DEF, that's what it is. It actually
stands for define, I guess. Why they don't just use the word function, I don't know. We'll never
understand why program languages don't just say what they mean, but DEF, that's what we get. So DEF
space, I don't know, loop it, parentheses, parentheses, colon, and then next line, indent once for tag
in soup.find underscore all parentheses, single quote, p, close single quote, close single,
close parentheses, colon, next line, indent, indent, print tag. Okay, so for tag, there's nothing
magical about the word tag. It's just something that I chose. I could have chosen for I, for item,
for, for penguin. It doesn't matter. It's just some, some dine, yeah, dynamically defined, quick
and easy, disposable variable. We just need some place to hold what we find in soup with the
function called find underscore all. That's a beautiful soup function. So soup dot find underscore
all and then parentheses, quote, p, close, quote, close parentheses. So it's just saying for every,
every time you find a paragraph, put it into a variable called tag and hey, if you have a tag,
print it. So if you have that, if you have that function in your code and you, you know,
you've got all the rest of the code that I've already talked about you, but you have that,
then well, if you run it, nothing will happen because you're not actually calling that function yet.
So in order to make that function run, you have to tell your Python program to execute the code
within the function. That's kind of one of the advantages of functions is that they don't happen
unless you explicitly tell it to happen and it doesn't happen until you tell it to happen. So
in the if underscore underscore name underscore underscore space equal equal space single,
quote, underscore underscore main underscore underscore underscore close single, quote, colon section,
don't print something. Just do a loop it parentheses parentheses or whatever you call your
function. I called it loop it because it seemed like an obvious name. And now run your code and
then you'll see all the paragraph, well, all the paragraphs in in example.com. Now you can also
get just the content. Remember I said that beautiful soup could separate the mark up like angle
bracket p angle bracket or angle bracket div angle bracket or angle bracket image source equals
blah, close angle bracket, well, backslash and then close or forward slash whatever that is.
So it can tell mark up from the actual content. So you could just get sort of like the text,
the strings, the words on the page. And the way that you could do that is in your function,
your loop it function for tag in soup.find all p. Then just instead of just printing the tag.
And remember the tag was just is a variable that I chose. It doesn't mean anything. It's just
the con when when you find a p tag, put it in this variable called tag. So it could be item.
It could be fish. It could be whatever you wanted to be, but I called it tag. So for tag in soup,
find all p find print rather tag dot string string. Of course, being the Python, well, the programming
sort of lingo for what we call a word for a string of letters, essentially, or characters.
And then once again, you could, you know, I mean, so now you're printing the content. And so once
you have the text of a web page, you could parse it further with standard Python string libraries.
And in this particular case, what I was teaching the person was asking me for this lesson,
that they wanted to learn how to count the words of a page. So the quick and dirty
way to do that would be for tag in soup find all p if tag dot string is not none. So this is
important. It's if and tag that's again, our variable dot string. So that's the words of the
markup p element is not none with a capital in. And the reason that we have to say this is because
beautiful soup recognizes sometimes that there's a paragraph tag without content, but it doesn't
just not print it. It it assigns it a special value called none. Some other, some other programming
languages call it null or nil. This is called none in O in E with a capital N. So we're saying
as long as it's not none. So there is content here, then or not then because Python doesn't use
the word then, but it's colon next line indent print parentheses, Len L. E in as in length
parentheses tag dot string dot split parentheses parentheses close parentheses close parentheses.
And what we're doing there is we're just counting the length of all the strings as we split the
strings by the default character of a space. And that's essentially the word count. There might be
cleaner ways to do that to catch, I don't know, odd little exceptions or whatever. But that's a
pretty quick and dirty way to get like the the word count of all the of all the content of all the
paragraph tags. Now there might be other tags that that that have content that you want to count
like H1's and H2's or or I don't know some paragraphs or not paragraph some ordered lists and
that don't use paragraphs. You can use paragraphs in order list, but let's say you didn't you
didn't add an order list and then the list item and you just went without paragraphs who knows.
So you would have to take all that into account. And of course by doing it the way that I just the
quick and dirty way, you're not getting a total of the words you're getting it for each paragraph.
So what if you wanted to do the total? Well, that's another loop trick where you do sort of a
function or a def space loop it parentheses parentheses colon, next line indent and then you have
to create a sort of a counter variable. And by counter variable, I don't mean like counter against
anti. I mean counter like I'm going to count now. So let's call it num in UM as in number num equals
zero. So we're setting our counter to zero. And then we do the same thing four tag in soup. Find
all P if tag dot string is not num then num equals num. I said not num didn't I if tag dot string
is not none in O in E colon, next line indent, indent, indent, num equals num plus length of the length
alien of tag dot string dot split parentheses parentheses. So essentially we're doing the exact same
thing. But instead of printing that number to your terminal every single time it encounters a p tag,
we're adding the total back to the running total. So num equals num plus the length of the tag
string split. And then at the end, so outside of your for loop, so you didn't back to your your
original setting just under function, then you print the grand total is num. And now you have your
grand total printed at the very end of the of the thing. Of course, there's a lot more information you
can extract with beautiful soup and Python. You could for instance, like I said, I think probably
top of the list eventually would be to accept input. You would want to be able to feed your little
script a URL dynamically as you launch it so that you don't have to go into the code and update the
variable or that the value for the variable every time you want to change the URL that you're you're
downloading. So that would be something to look into. You could also count the number of images
because again, that was something that was that was specifically asked how can we count images.
So you would you would you know, you know, how to single out the paragraph tag. So you also really
know how to single out the image tag. Now there's inheritance and children of parent tags and things
like that that you might want to take into account. It might help you filter things out or filter things
in. So you would want to learn about sort of the way that the way beautiful soup views or or or
walks the the the the structure of the document. And of course, you could also count the number of
images. Once you find the images that you want in a reliable way, you would maybe want to count them.
And again, you know how to do that. Maybe that would be a separate function or maybe you could
find a way to use the loop at function to not just find the paragraph tags, but also, you know,
possibly to find other tags or or maybe you could use it to do both of those things. I don't know
if that would be smart. You might want like a loop at P and a loop at image. I don't know.
Obviously, it depends on what you're doing with your program. But I think this definitely gives you
sort of a little bit of insight into how web scraping works. You could use this for instance,
to find all of the links to pictures on a website or all of the links to media videos or something
on a website. And you know, you would do that by zeroing in on a href tags or video tags or
image tags or audio, some kind of audio tag, whatever. You would zero in on that and then you would
you would find the content or the the attribute of, well, it depends on what you're looking at. But
yeah, so you would you would you zero in on that and then you take a look at the the part of that
element that you need to to look at whether it's the content or or or some other part of it. And
then you could store that in a variable and then you could process that you could store it and
then process it later. Who knows what you want to do. But yeah, this is the this is the start
to web scraping with Python. It's it's a really pretty well documented system. I mean, beautiful
soup is really, really strong for for that. Whether it's the best or not, I don't know. Maybe someone
else will record a hacker public radio episode to tell us the way that they do web scraping.
And and I don't do a lot of web scraping truth be told I do it in sort of sprints. You know,
I I I'll do some web scraping for a week and then I'll walk away from it and never never bother
again for another three years. So I do, but I don't. And and when I do, I do tend to either
just shell script curl or use beautiful soup because they're both really, really good for different
things, I think. So yeah, that's that's web scraping 101, I guess. I hope that was useful to some
people or interesting. And thank you very much for listening. I'll talk to you next time.
You have been listening to hacker public radio as hacker public radio does work. Today's show was
contributed by a HBR listener like yourself. If you ever thought of recording podcasts,
you click on our contribute link to find out how easy it really is. Hosting for HBR has been
kindly provided by an honesthost.com, the internet archive and our sings.net. On the Sadois
stages, today's show is released on their creative comments, attribution, 4.0 International