Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
324
hpr_transcripts/hpr3632.txt
Normal file
324
hpr_transcripts/hpr3632.txt
Normal file
@@ -0,0 +1,324 @@
|
||||
Episode: 3632
|
||||
Title: HPR3632: Intro to web scraping with Python
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3632/hpr3632.mp3
|
||||
Transcribed: 2025-10-25 02:31:44
|
||||
|
||||
---
|
||||
|
||||
This is Hacker Public Radio Episode 3,632.
|
||||
For Tuesday 5 July 2022, today's show is entitled Intro to Web Scraping with Python.
|
||||
It is part of the series programming 101.
|
||||
It is hosted by Klaatu and is about 32 minutes long.
|
||||
It carries a clean flag.
|
||||
The summary is using requests and beautiful soup to scrape websites.
|
||||
Hey everybody, this is Klaatu.
|
||||
At work, I was recently giving a lesson to an intern on how to get started doing some
|
||||
web scraping because part of the intern's self-appointed assignment, I don't really know how
|
||||
intern's work, I've never had an internship, but part of that intern experience was that
|
||||
they were going to learn how to do web scraping with Python and specifically the project centered
|
||||
around analyzing word count of web pages and image count on web pages, and I imagine probably
|
||||
other things like downloading times or something like that.
|
||||
I'm not really sure, but there was definitely a focus on sort of analyzing the elements,
|
||||
the different HTML elements of a website, and so that was the prompt given to me to
|
||||
provide a lesson on, and so I figured I could just record that lesson as well.
|
||||
This is a pretty straight translation of the lesson plan that I devised for this intern.
|
||||
In web scraping, I also learned in this experience that the term web scraping can have some
|
||||
negative connotations for some people, which I found interesting.
|
||||
It had never occurred to me before, but a lot of people, it seems, or some number of people
|
||||
it seems, in their mind, when they hear web scraping or when they've heard about web scraping,
|
||||
it was often, I have found that it is sometimes associated with data harvesting, essentially,
|
||||
and certainly that is one use of web scraping, but obviously there are lots of uses for
|
||||
web scraping, and in fact I made the argument, or the wasn't an argument, it was more of
|
||||
an explanation, I explained that web scraping really was no different than web browsing,
|
||||
and that the two terms are actually essentially the same.
|
||||
They are both downloading information off of the internet.
|
||||
One uses a graphical browser, and one uses code, Python code.
|
||||
Interestingly, web browsers are also written with code.
|
||||
It is just the interface by which you're looking at that content on the web.
|
||||
I found myself very frequently adapting my language as I spoke about this web scraping
|
||||
project to people, so that I would use web scraping, the term web scraping, but I would
|
||||
also insert verbiage, such as browsing the web with Python, or downloading information
|
||||
from the internet with Python, or web browsing, just straight up web browsing, except you're
|
||||
doing it with Python, or with code, whatever.
|
||||
That was something that I found interesting, a little bit of a culture shock, maybe, is
|
||||
a strong term for it, but a little wrinkle in culture, just from people who think about
|
||||
information on the internet in different ways, the term web scrape.
|
||||
Definitely, you can hear whether you agree with it or not, you can hear how that sounds
|
||||
abrasive scrape.
|
||||
That is not usually a pleasant thing to do, usually scraping is a bad thing.
|
||||
You don't like it when things scrape against other things, so web scraping sounds kind
|
||||
of almost malicious, but it doesn't have to be.
|
||||
This little tutorial that I'm about to share with you right now is going to use mostly
|
||||
beautiful soup, a little bit of requests, and of course, there are other libraries in
|
||||
Python that interface with the internet, so those aren't the only two avenues, but they
|
||||
are the avenues that I used, especially given the ultimate goal of parsing all those HTML
|
||||
tags.
|
||||
So web browsing the web, it is something that we all do all day, practically, sometimes,
|
||||
and it is something, generally, that's a very manual process.
|
||||
You open a web browser, you navigate to a URL, and you go to a web page, and you click
|
||||
on a link that you want to view, and then you view the page with your eyes, and you see
|
||||
all the text that you want to see or the images, and then you go somewhere else.
|
||||
You click somewhere else in your browser, and that's a legitimate way to browse the web,
|
||||
but it is a very, again, manual way to browse the web.
|
||||
It is something that you have to do manually.
|
||||
If you want to, I don't know, copy some text off of a web page, then you have to select
|
||||
all that text, and a lot of times selecting the text is a little bit tricky because there's
|
||||
text, other plate, you know, the layout might not be conducive to just a straight drag
|
||||
selection, so it can be a little bit tricky sometimes, or possibly just if you have a hundred
|
||||
things to download, then they can be very, very time consuming, and that can be problematic,
|
||||
and some people might think, well, there should be ways to automate that, and well, there are
|
||||
ways to automate that, and one of those ways is Python. Python has a couple of different
|
||||
libraries or modules that help with looking at information on the internet. So the most basic
|
||||
one is called Requests, R-E-Q-U-E-S-T-S, Requests. It's a library. I think it's built into Python.
|
||||
I don't remember having to install it or anything, but you can install it with Python-M-PIP install
|
||||
Requests, if it's not already there, and if you use, I have found that if you use an IDE,
|
||||
a good open source Python IDE, the one that I demonstrated was PyCharm, the Community Edition,
|
||||
don't get the non-community edition, that's not open source, get the Community Edition,
|
||||
it's got all the things that you're likely to need. Of course, there are other Python-IDEs as well.
|
||||
I used to use one called Ninja IDE, but last I looked, it hadn't been updated recently,
|
||||
but of course, you can do things like, I don't know, VS CodeM or Atom, even though Atom is kind
|
||||
of on its way out now, unfortunately, but yeah, there are lots of great little IDEs or text editors
|
||||
that you can use for quick and sort of helpful Python programming. So Requests, the most basic
|
||||
sort of web browsing process you can do would be a Python file, Import Request. So you've just imported
|
||||
the library requests and requests is going to get a URL for you. Now, the URL that you're going to
|
||||
get is entirely up to you at this stage, and for brevity, I'm going to just hard code it into
|
||||
this script, into this Python script. In real life, you would not want to do that. Probably,
|
||||
you would probably want to leave it open to an argument that you would pass to Python when you
|
||||
launch Python. So you would want to, you'd have my download script.py, and then some URL, HTTP
|
||||
colon slash slash, or well, you wouldn't even have to put all that, I guess, you could just put,
|
||||
like, I don't know, example.com. And then, and then your script, you would import whatever argument
|
||||
the user has provided, and then you would go to that URL. But I'm going to leave that as an
|
||||
exercise. That's a separate exercise. It's an exercise. If you don't know how to import,
|
||||
or rather read parameters or arguments from the, from the command, then look into that. It's a,
|
||||
it's a good trick to know. So I'm going to create a variable here called data, D-A-T-A equals
|
||||
quote, HTTPS, or HTTP colon slash slash, example.com, close quote. So that's a URL. Like I say, normally,
|
||||
you'd pass that in through a command, but for now, we're hard coding it in, and it's going to be put
|
||||
into a variable called data. The next one, I'm going to call page P-A-G-E, and I'm going to put,
|
||||
I'm going to set that equal to the results of requests.get parentheses data, close parentheses.
|
||||
So what that means is that page, the variable page, is going to receive the output of whatever
|
||||
requests.get. Now, that's a function.get is a method within the requests library requests.get
|
||||
parentheses data. So we're just telling the library requests to go get whatever is at whatever is
|
||||
in data, and that wasn't, I wasn't mispeaking. I was saying whatever is at whatever is in data,
|
||||
meaning Python knows to translate the variable of data, example.com, into a URL, and to go get it with
|
||||
requests. So finally, we're going to print parentheses page.text, close parentheses. So once again,
|
||||
it's a little tricky, but we're doing a print, and then the page.text, and the reason we're
|
||||
able to say a .text after the page variable is because page contains structured data. Requests
|
||||
didn't just grab a bunch of plain text and dump it into page. It provided structured data.
|
||||
And so we have access to segments of the page variable. We can look at just the text. We can
|
||||
look at the response code. Like if we want to see, was it a 404? Was it a 200? You could see just
|
||||
the response. I think it's page.response or something like that. So there are different segments
|
||||
of page, and the only reason that is is because requests is programmed to hand page that information.
|
||||
That's really tricky. Sometimes you want page to just be a bunch of text, because that's what
|
||||
you probably downloaded, right? I mean, you downloaded a bunch of text, but requests is a little bit
|
||||
more complex than that for better or for worse. I mean, it depends on your, I guess, your requirements,
|
||||
but requests is quite nice. It understands when it goes to a page, if it finds a 200 response,
|
||||
it stores that in a little segment of the variable. And if it encounters some text, it stores that
|
||||
and so on. And you can go to the Python documentation and look up the requests library,
|
||||
and find out what other segments there are. The only two I know off the top of my head are text
|
||||
and response. Okay, so you've just created a script that will download all of the contents on
|
||||
example.com and then dump it rather unceremoniously into your terminal. So if you save that script and
|
||||
run it as, you know, with Python.slash, my download script.py, then you'll get the results. You'll
|
||||
get the contents of example.com in your terminal. Notice that it is not terribly structured. I mean,
|
||||
it's as structured as HTML is, but the output itself is kind of probably all over the place. There's
|
||||
nothing necessarily pretty about the output. It really is just kind of a code dump of whatever it
|
||||
found that example.com. Okay, so that's useful. And requests is nice for for grabbing information.
|
||||
And certainly you could you could print the contents of a web page out like that and then maybe
|
||||
use a grip or awk to kind of parse your output. You could do that, but there is yet a more intelligent
|
||||
library that not only knows the difference between a response code and and the contents of a page,
|
||||
but also even understands the difference between an HTML tag like angle bracket p angle bracket or
|
||||
angle bracket div angle bracket and the contents of an HTML tag. So for instance, the
|
||||
angle bracket h1, close angle bracket, bracket, hello world, or example.com, whatever close
|
||||
close h1, it knows the difference between h1 and then the title like hello world or example.com
|
||||
or whatever's in whatever's in the h1. It knows the difference between the text and the style.
|
||||
Well, is that really the style? I don't know, the markup, the markup and the content. Okay, so to to
|
||||
get started with that, you need to go to beautiful soup and down or rather, you need to download
|
||||
and install beautiful soup as a as a Python module. And again, the sort of the manual way of doing
|
||||
that is doing a Python dash m pip install BS4 I think or maybe beautiful soup all one word. Yeah,
|
||||
I think it's a beautiful soup all one word. But honestly, if you use an IDE, a good IDE,
|
||||
it'll manage that for you, especially like something like PyCharm, Community Edition, you can,
|
||||
you can you set up a project and then when you type in an import, it offers to download that import
|
||||
for you locally within your project environment. So you're not sort of for lack of a better word,
|
||||
corrupting or polluting, let's say polluting the rest of your system with beautiful soup,
|
||||
which I mean, again, it's, it's not terribly, it's not bad to have beautiful soup. But you're,
|
||||
you're not installing beautiful soup on your system, you're, you're installing it within the
|
||||
virtual environment of this particular Python project, which is nice. It is legitimately nice
|
||||
because that way, you know exactly what your Python project absolutely requires to run, which
|
||||
is difficult if you've installed beautiful soup three years ago and start using beautiful
|
||||
soup modules without ever thinking that you've really installed it. You just kind of think,
|
||||
well, yeah, you know, it's, it's beautiful soup. Just everyone will have that, right? Well, no,
|
||||
they won't. Just like your virtual environment doesn't have it. Okay, so let's look at essentially
|
||||
the same script that we just did with requests, except with beautiful soup, except also with requests.
|
||||
Okay, so from BS4, like beautiful soup for, from BS4 import, beautiful soup capital B,
|
||||
capital S, import request. So we're getting both the beautiful soup modules and the requests module.
|
||||
And we'll do the same opening, essentially, except we'll, we'll kind of speed things up a little
|
||||
bit. So instead of making a separate variable for data and the contents of the data, we'll just do
|
||||
page equals requests.get. And then parentheses quote, HTTP colon slash slash example.com closed,
|
||||
quote, closed parentheses. So that's just going to, we're just downloading the page with requests
|
||||
and dumping all of that structured data into page. Then we're going to make some soup. So soup
|
||||
equals beautiful soup capital B capital S, parentheses page.text. So this is again, we're looking at
|
||||
the text part of the, of the downloaded content, comma, quote, HTML dot parser, closed,
|
||||
quote, closed parentheses. So here we're, we're doing a special beautiful soup call. And we're
|
||||
telling beautiful soup to look at the textual content of the variable page, which of course
|
||||
contains the results of what requests got or.get from example.com. And then we're saying to filter
|
||||
that through or, or maybe to interpret that through the HTML parser, which is built into
|
||||
beautiful soup. And the way that you would know to do that is you would go to beautiful soup
|
||||
documentation, which is something like a beautiful soups, beautiful soup dot read the docs.io.
|
||||
Is that it? No, it's beautiful dash soup, dash four, the number four dot read the docs.io.
|
||||
That's what that's what it is. So you would go there and you would look at, well, you could look
|
||||
at some tutorials. First of all, that would reiterate some of what I'm saying here. But you could
|
||||
also look at the, the, the documentation of what kind of parsers it has, what kinds of library
|
||||
functions it has. And, and then you could, you could use what you find in, in your own code.
|
||||
So, you know, in other words, I'm not, I'm not being comprehensive here. I'm just telling you the,
|
||||
the basic parsing abilities of beautiful soup, but there's much, much more that it can do. Okay,
|
||||
so for instance, well, no, actually, well, yeah, for instance, okay. So we'll do if, and this
|
||||
is my favorite favorite incantation from Python, that was sarcasm, you heard. If space underscore,
|
||||
underscore name underscore underscore, underscore space equals equals space, single quote underscore,
|
||||
underscore main underscore underscore main, I mean, underscore underscore single quote, colon,
|
||||
return indent print parentheses soup dot printify parentheses parentheses close parentheses.
|
||||
Okay, so what that was is it says that if, if, if a user is launching this script intentionally,
|
||||
just launching this script, in other words, this script is not being called as a library from some
|
||||
other application, it is actually being used as a script, then print soup dot printify. And soup,
|
||||
obviously, is our variable that contains the output of the results of beautiful soup page dot
|
||||
text HTML parser, but we're, we're running it through a little function, which we can kind of
|
||||
think of as a filter in this case, running it through little filter called printify PRE TTIFY.
|
||||
In order, in other words, make this pretty print it, but make it pretty first. It's a little bit
|
||||
like a Unix pipe in a way, you know, you're kind of sending it through a sort function or something
|
||||
like that. Except what you're really doing is you're just, you're telling beautiful soup to print
|
||||
the code out in such a way that the indentation is consistent. Each tag has its own line. I think
|
||||
that's probably it, but it is, it makes it look pretty, which in contrast to just sort of the
|
||||
raw output of requests can be very useful. And certainly it would be easier if you were going to
|
||||
for whatever reason parse this with an external tool like AUK or Grap, it would be a lot easier
|
||||
to do that from the output of beautiful soup PRE TTIFY than just the raw output of requests,
|
||||
which might have unpredictable indentation, unpredictable line breaks or no line breaks,
|
||||
and so on. So PRE TTIFY is a one of the sort of the basic, but kind of really pleasant functions
|
||||
of beautiful soup. But there is more. So for instance, what if you just wanted to and this is the
|
||||
use case of this lesson actually, but what if you wanted to, for instance, just find the paragraph tags.
|
||||
You just wanted to find the angle bracket P, close angle bracket, those elements you want
|
||||
to find those those HTML elements. I mean, you can see the content as well, but you want to
|
||||
sort of filter your output and have beautiful soup exclude everything but paragraphs.
|
||||
Well, beautiful soup is pretty well aware of HTML because it is using its HTML parser.
|
||||
So you can do a print, for example, print parentheses soup dot P, close parentheses. Now if you're
|
||||
used to Python, you already see the problem with this, but you can do that. So instead of
|
||||
print soup dot printify parentheses, parentheses, you do soup print soup dot P and you get a paragraph
|
||||
tag. You get a paragraph or a sentence that's surrounded by angle bracket P, close angle bracket.
|
||||
Now more than likely the web page that you are scraping doesn't just contain one paragraph tag.
|
||||
What you are seeing in your output is, I guess it would be the first paragraph tag encountered
|
||||
on that page, which can be sometimes revealing for SEO and stuff. You could look at your first
|
||||
paragraph and realize that as far as whatever search engine knows or knows from its scrape,
|
||||
the first line of your homepage might be, I don't know, follow us on face slam or whatever.
|
||||
Who knows? It could be completely irrelevant to your site. So you could diagnose that with
|
||||
web scraping. So what do you know? It's not just a malicious thing after all. So there's the paragraph
|
||||
tag, but it's but a single paragraph tag. If you want all the paragraph tags, then you can use
|
||||
a for loop. Now you might as well make this into a function in an ideal world. I think you would
|
||||
probably try to make a function that could sort of abstract away the element that you're looking
|
||||
for, but I didn't get that far. So I just kind of this is still kind of hard-coded, but you can
|
||||
make a function in Python with the beautiful prefix, death. Yeah, DEF, that's what it is. It actually
|
||||
stands for define, I guess. Why they don't just use the word function, I don't know. We'll never
|
||||
understand why program languages don't just say what they mean, but DEF, that's what we get. So DEF
|
||||
space, I don't know, loop it, parentheses, parentheses, colon, and then next line, indent once for tag
|
||||
in soup.find underscore all parentheses, single quote, p, close single quote, close single,
|
||||
close parentheses, colon, next line, indent, indent, print tag. Okay, so for tag, there's nothing
|
||||
magical about the word tag. It's just something that I chose. I could have chosen for I, for item,
|
||||
for, for penguin. It doesn't matter. It's just some, some dine, yeah, dynamically defined, quick
|
||||
and easy, disposable variable. We just need some place to hold what we find in soup with the
|
||||
function called find underscore all. That's a beautiful soup function. So soup dot find underscore
|
||||
all and then parentheses, quote, p, close, quote, close parentheses. So it's just saying for every,
|
||||
every time you find a paragraph, put it into a variable called tag and hey, if you have a tag,
|
||||
print it. So if you have that, if you have that function in your code and you, you know,
|
||||
you've got all the rest of the code that I've already talked about you, but you have that,
|
||||
then well, if you run it, nothing will happen because you're not actually calling that function yet.
|
||||
So in order to make that function run, you have to tell your Python program to execute the code
|
||||
within the function. That's kind of one of the advantages of functions is that they don't happen
|
||||
unless you explicitly tell it to happen and it doesn't happen until you tell it to happen. So
|
||||
in the if underscore underscore name underscore underscore space equal equal space single,
|
||||
quote, underscore underscore main underscore underscore underscore close single, quote, colon section,
|
||||
don't print something. Just do a loop it parentheses parentheses or whatever you call your
|
||||
function. I called it loop it because it seemed like an obvious name. And now run your code and
|
||||
then you'll see all the paragraph, well, all the paragraphs in in example.com. Now you can also
|
||||
get just the content. Remember I said that beautiful soup could separate the mark up like angle
|
||||
bracket p angle bracket or angle bracket div angle bracket or angle bracket image source equals
|
||||
blah, close angle bracket, well, backslash and then close or forward slash whatever that is.
|
||||
So it can tell mark up from the actual content. So you could just get sort of like the text,
|
||||
the strings, the words on the page. And the way that you could do that is in your function,
|
||||
your loop it function for tag in soup.find all p. Then just instead of just printing the tag.
|
||||
And remember the tag was just is a variable that I chose. It doesn't mean anything. It's just
|
||||
the con when when you find a p tag, put it in this variable called tag. So it could be item.
|
||||
It could be fish. It could be whatever you wanted to be, but I called it tag. So for tag in soup,
|
||||
find all p find print rather tag dot string string. Of course, being the Python, well, the programming
|
||||
sort of lingo for what we call a word for a string of letters, essentially, or characters.
|
||||
And then once again, you could, you know, I mean, so now you're printing the content. And so once
|
||||
you have the text of a web page, you could parse it further with standard Python string libraries.
|
||||
And in this particular case, what I was teaching the person was asking me for this lesson,
|
||||
that they wanted to learn how to count the words of a page. So the quick and dirty
|
||||
way to do that would be for tag in soup find all p if tag dot string is not none. So this is
|
||||
important. It's if and tag that's again, our variable dot string. So that's the words of the
|
||||
markup p element is not none with a capital in. And the reason that we have to say this is because
|
||||
beautiful soup recognizes sometimes that there's a paragraph tag without content, but it doesn't
|
||||
just not print it. It it assigns it a special value called none. Some other, some other programming
|
||||
languages call it null or nil. This is called none in O in E with a capital N. So we're saying
|
||||
as long as it's not none. So there is content here, then or not then because Python doesn't use
|
||||
the word then, but it's colon next line indent print parentheses, Len L. E in as in length
|
||||
parentheses tag dot string dot split parentheses parentheses close parentheses close parentheses.
|
||||
And what we're doing there is we're just counting the length of all the strings as we split the
|
||||
strings by the default character of a space. And that's essentially the word count. There might be
|
||||
cleaner ways to do that to catch, I don't know, odd little exceptions or whatever. But that's a
|
||||
pretty quick and dirty way to get like the the word count of all the of all the content of all the
|
||||
paragraph tags. Now there might be other tags that that that have content that you want to count
|
||||
like H1's and H2's or or I don't know some paragraphs or not paragraph some ordered lists and
|
||||
that don't use paragraphs. You can use paragraphs in order list, but let's say you didn't you
|
||||
didn't add an order list and then the list item and you just went without paragraphs who knows.
|
||||
So you would have to take all that into account. And of course by doing it the way that I just the
|
||||
quick and dirty way, you're not getting a total of the words you're getting it for each paragraph.
|
||||
So what if you wanted to do the total? Well, that's another loop trick where you do sort of a
|
||||
function or a def space loop it parentheses parentheses colon, next line indent and then you have
|
||||
to create a sort of a counter variable. And by counter variable, I don't mean like counter against
|
||||
anti. I mean counter like I'm going to count now. So let's call it num in UM as in number num equals
|
||||
zero. So we're setting our counter to zero. And then we do the same thing four tag in soup. Find
|
||||
all P if tag dot string is not num then num equals num. I said not num didn't I if tag dot string
|
||||
is not none in O in E colon, next line indent, indent, indent, num equals num plus length of the length
|
||||
alien of tag dot string dot split parentheses parentheses. So essentially we're doing the exact same
|
||||
thing. But instead of printing that number to your terminal every single time it encounters a p tag,
|
||||
we're adding the total back to the running total. So num equals num plus the length of the tag
|
||||
string split. And then at the end, so outside of your for loop, so you didn't back to your your
|
||||
original setting just under function, then you print the grand total is num. And now you have your
|
||||
grand total printed at the very end of the of the thing. Of course, there's a lot more information you
|
||||
can extract with beautiful soup and Python. You could for instance, like I said, I think probably
|
||||
top of the list eventually would be to accept input. You would want to be able to feed your little
|
||||
script a URL dynamically as you launch it so that you don't have to go into the code and update the
|
||||
variable or that the value for the variable every time you want to change the URL that you're you're
|
||||
downloading. So that would be something to look into. You could also count the number of images
|
||||
because again, that was something that was that was specifically asked how can we count images.
|
||||
So you would you would you know, you know, how to single out the paragraph tag. So you also really
|
||||
know how to single out the image tag. Now there's inheritance and children of parent tags and things
|
||||
like that that you might want to take into account. It might help you filter things out or filter things
|
||||
in. So you would want to learn about sort of the way that the way beautiful soup views or or or
|
||||
walks the the the the structure of the document. And of course, you could also count the number of
|
||||
images. Once you find the images that you want in a reliable way, you would maybe want to count them.
|
||||
And again, you know how to do that. Maybe that would be a separate function or maybe you could
|
||||
find a way to use the loop at function to not just find the paragraph tags, but also, you know,
|
||||
possibly to find other tags or or maybe you could use it to do both of those things. I don't know
|
||||
if that would be smart. You might want like a loop at P and a loop at image. I don't know.
|
||||
Obviously, it depends on what you're doing with your program. But I think this definitely gives you
|
||||
sort of a little bit of insight into how web scraping works. You could use this for instance,
|
||||
to find all of the links to pictures on a website or all of the links to media videos or something
|
||||
on a website. And you know, you would do that by zeroing in on a href tags or video tags or
|
||||
image tags or audio, some kind of audio tag, whatever. You would zero in on that and then you would
|
||||
you would find the content or the the attribute of, well, it depends on what you're looking at. But
|
||||
yeah, so you would you would you zero in on that and then you take a look at the the part of that
|
||||
element that you need to to look at whether it's the content or or or some other part of it. And
|
||||
then you could store that in a variable and then you could process that you could store it and
|
||||
then process it later. Who knows what you want to do. But yeah, this is the this is the start
|
||||
to web scraping with Python. It's it's a really pretty well documented system. I mean, beautiful
|
||||
soup is really, really strong for for that. Whether it's the best or not, I don't know. Maybe someone
|
||||
else will record a hacker public radio episode to tell us the way that they do web scraping.
|
||||
And and I don't do a lot of web scraping truth be told I do it in sort of sprints. You know,
|
||||
I I I'll do some web scraping for a week and then I'll walk away from it and never never bother
|
||||
again for another three years. So I do, but I don't. And and when I do, I do tend to either
|
||||
just shell script curl or use beautiful soup because they're both really, really good for different
|
||||
things, I think. So yeah, that's that's web scraping 101, I guess. I hope that was useful to some
|
||||
people or interesting. And thank you very much for listening. I'll talk to you next time.
|
||||
You have been listening to hacker public radio as hacker public radio does work. Today's show was
|
||||
contributed by a HBR listener like yourself. If you ever thought of recording podcasts,
|
||||
you click on our contribute link to find out how easy it really is. Hosting for HBR has been
|
||||
kindly provided by an honesthost.com, the internet archive and our sings.net. On the Sadois
|
||||
stages, today's show is released on their creative comments, attribution, 4.0 International
|
||||
Reference in New Issue
Block a user