Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr3632.txt
+++ b/hpr_transcripts/hpr3632.txt
@@ -0,0 +1,324 @@
+Episode: 3632
+Title: HPR3632: Intro to web scraping with Python
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3632/hpr3632.mp3
+Transcribed: 2025-10-25 02:31:44
+
+---
+
+This is Hacker Public Radio Episode 3,632.
+For Tuesday 5 July 2022, today's show is entitled Intro to Web Scraping with Python.
+It is part of the series programming 101.
+It is hosted by Klaatu and is about 32 minutes long.
+It carries a clean flag.
+The summary is using requests and beautiful soup to scrape websites.
+Hey everybody, this is Klaatu.
+At work, I was recently giving a lesson to an intern on how to get started doing some
+web scraping because part of the intern's self-appointed assignment, I don't really know how
+intern's work, I've never had an internship, but part of that intern experience was that
+they were going to learn how to do web scraping with Python and specifically the project centered
+around analyzing word count of web pages and image count on web pages, and I imagine probably
+other things like downloading times or something like that.
+I'm not really sure, but there was definitely a focus on sort of analyzing the elements,
+the different HTML elements of a website, and so that was the prompt given to me to
+provide a lesson on, and so I figured I could just record that lesson as well.
+This is a pretty straight translation of the lesson plan that I devised for this intern.
+In web scraping, I also learned in this experience that the term web scraping can have some
+negative connotations for some people, which I found interesting.
+It had never occurred to me before, but a lot of people, it seems, or some number of people
+it seems, in their mind, when they hear web scraping or when they've heard about web scraping,
+it was often, I have found that it is sometimes associated with data harvesting, essentially,
+and certainly that is one use of web scraping, but obviously there are lots of uses for
+web scraping, and in fact I made the argument, or the wasn't an argument, it was more of
+an explanation, I explained that web scraping really was no different than web browsing,
+and that the two terms are actually essentially the same.
+They are both downloading information off of the internet.
+One uses a graphical browser, and one uses code, Python code.
+Interestingly, web browsers are also written with code.
+It is just the interface by which you're looking at that content on the web.
+I found myself very frequently adapting my language as I spoke about this web scraping
+project to people, so that I would use web scraping, the term web scraping, but I would
+also insert verbiage, such as browsing the web with Python, or downloading information
+from the internet with Python, or web browsing, just straight up web browsing, except you're
+doing it with Python, or with code, whatever.
+That was something that I found interesting, a little bit of a culture shock, maybe, is
+a strong term for it, but a little wrinkle in culture, just from people who think about
+information on the internet in different ways, the term web scrape.
+Definitely, you can hear whether you agree with it or not, you can hear how that sounds
+abrasive scrape.
+That is not usually a pleasant thing to do, usually scraping is a bad thing.
+You don't like it when things scrape against other things, so web scraping sounds kind
+of almost malicious, but it doesn't have to be.
+This little tutorial that I'm about to share with you right now is going to use mostly
+beautiful soup, a little bit of requests, and of course, there are other libraries in
+Python that interface with the internet, so those aren't the only two avenues, but they
+are the avenues that I used, especially given the ultimate goal of parsing all those HTML
+tags.
+So web browsing the web, it is something that we all do all day, practically, sometimes,
+and it is something, generally, that's a very manual process.
+You open a web browser, you navigate to a URL, and you go to a web page, and you click
+on a link that you want to view, and then you view the page with your eyes, and you see
+all the text that you want to see or the images, and then you go somewhere else.
+You click somewhere else in your browser, and that's a legitimate way to browse the web,
+but it is a very, again, manual way to browse the web.
+It is something that you have to do manually.
+If you want to, I don't know, copy some text off of a web page, then you have to select
+all that text, and a lot of times selecting the text is a little bit tricky because there's
+text, other plate, you know, the layout might not be conducive to just a straight drag
+selection, so it can be a little bit tricky sometimes, or possibly just if you have a hundred
+things to download, then they can be very, very time consuming, and that can be problematic,
+and some people might think, well, there should be ways to automate that, and well, there are
+ways to automate that, and one of those ways is Python. Python has a couple of different
+libraries or modules that help with looking at information on the internet. So the most basic
+one is called Requests, R-E-Q-U-E-S-T-S, Requests. It's a library. I think it's built into Python.
+I don't remember having to install it or anything, but you can install it with Python-M-PIP install
+Requests, if it's not already there, and if you use, I have found that if you use an IDE,
+a good open source Python IDE, the one that I demonstrated was PyCharm, the Community Edition,
+don't get the non-community edition, that's not open source, get the Community Edition,
+it's got all the things that you're likely to need. Of course, there are other Python-IDEs as well.
+I used to use one called Ninja IDE, but last I looked, it hadn't been updated recently,
+but of course, you can do things like, I don't know, VS CodeM or Atom, even though Atom is kind
+of on its way out now, unfortunately, but yeah, there are lots of great little IDEs or text editors
+that you can use for quick and sort of helpful Python programming. So Requests, the most basic
+sort of web browsing process you can do would be a Python file, Import Request. So you've just imported
+the library requests and requests is going to get a URL for you. Now, the URL that you're going to
+get is entirely up to you at this stage, and for brevity, I'm going to just hard code it into
+this script, into this Python script. In real life, you would not want to do that. Probably,
+you would probably want to leave it open to an argument that you would pass to Python when you
+launch Python. So you would want to, you'd have my download script.py, and then some URL, HTTP
+colon slash slash, or well, you wouldn't even have to put all that, I guess, you could just put,
+like, I don't know, example.com. And then, and then your script, you would import whatever argument
+the user has provided, and then you would go to that URL. But I'm going to leave that as an
+exercise. That's a separate exercise. It's an exercise. If you don't know how to import,
+or rather read parameters or arguments from the, from the command, then look into that. It's a,
+it's a good trick to know. So I'm going to create a variable here called data, D-A-T-A equals
+quote, HTTPS, or HTTP colon slash slash, example.com, close quote. So that's a URL. Like I say, normally,
+you'd pass that in through a command, but for now, we're hard coding it in, and it's going to be put
+into a variable called data. The next one, I'm going to call page P-A-G-E, and I'm going to put,
+I'm going to set that equal to the results of requests.get parentheses data, close parentheses.
+So what that means is that page, the variable page, is going to receive the output of whatever
+requests.get. Now, that's a function.get is a method within the requests library requests.get
+parentheses data. So we're just telling the library requests to go get whatever is at whatever is
+in data, and that wasn't, I wasn't mispeaking. I was saying whatever is at whatever is in data,
+meaning Python knows to translate the variable of data, example.com, into a URL, and to go get it with
+requests. So finally, we're going to print parentheses page.text, close parentheses. So once again,
+it's a little tricky, but we're doing a print, and then the page.text, and the reason we're
+able to say a .text after the page variable is because page contains structured data. Requests
+didn't just grab a bunch of plain text and dump it into page. It provided structured data.
+And so we have access to segments of the page variable. We can look at just the text. We can
+look at the response code. Like if we want to see, was it a 404? Was it a 200? You could see just
+the response. I think it's page.response or something like that. So there are different segments
+of page, and the only reason that is is because requests is programmed to hand page that information.
+That's really tricky. Sometimes you want page to just be a bunch of text, because that's what
+you probably downloaded, right? I mean, you downloaded a bunch of text, but requests is a little bit
+more complex than that for better or for worse. I mean, it depends on your, I guess, your requirements,
+but requests is quite nice. It understands when it goes to a page, if it finds a 200 response,
+it stores that in a little segment of the variable. And if it encounters some text, it stores that
+and so on. And you can go to the Python documentation and look up the requests library,
+and find out what other segments there are. The only two I know off the top of my head are text
+and response. Okay, so you've just created a script that will download all of the contents on
+example.com and then dump it rather unceremoniously into your terminal. So if you save that script and
+run it as, you know, with Python.slash, my download script.py, then you'll get the results. You'll
+get the contents of example.com in your terminal. Notice that it is not terribly structured. I mean,
+it's as structured as HTML is, but the output itself is kind of probably all over the place. There's
+nothing necessarily pretty about the output. It really is just kind of a code dump of whatever it
+found that example.com. Okay, so that's useful. And requests is nice for for grabbing information.
+And certainly you could you could print the contents of a web page out like that and then maybe
+use a grip or awk to kind of parse your output. You could do that, but there is yet a more intelligent
+library that not only knows the difference between a response code and and the contents of a page,
+but also even understands the difference between an HTML tag like angle bracket p angle bracket or
+angle bracket div angle bracket and the contents of an HTML tag. So for instance, the
+angle bracket h1, close angle bracket, bracket, hello world, or example.com, whatever close
+close h1, it knows the difference between h1 and then the title like hello world or example.com
+or whatever's in whatever's in the h1. It knows the difference between the text and the style.
+Well, is that really the style? I don't know, the markup, the markup and the content. Okay, so to to
+get started with that, you need to go to beautiful soup and down or rather, you need to download
+and install beautiful soup as a as a Python module. And again, the sort of the manual way of doing
+that is doing a Python dash m pip install BS4 I think or maybe beautiful soup all one word. Yeah,
+I think it's a beautiful soup all one word. But honestly, if you use an IDE, a good IDE,
+it'll manage that for you, especially like something like PyCharm, Community Edition, you can,
+you can you set up a project and then when you type in an import, it offers to download that import
+for you locally within your project environment. So you're not sort of for lack of a better word,
+corrupting or polluting, let's say polluting the rest of your system with beautiful soup,
+which I mean, again, it's, it's not terribly, it's not bad to have beautiful soup. But you're,
+you're not installing beautiful soup on your system, you're, you're installing it within the
+virtual environment of this particular Python project, which is nice. It is legitimately nice
+because that way, you know exactly what your Python project absolutely requires to run, which
+is difficult if you've installed beautiful soup three years ago and start using beautiful
+soup modules without ever thinking that you've really installed it. You just kind of think,
+well, yeah, you know, it's, it's beautiful soup. Just everyone will have that, right? Well, no,
+they won't. Just like your virtual environment doesn't have it. Okay, so let's look at essentially
+the same script that we just did with requests, except with beautiful soup, except also with requests.
+Okay, so from BS4, like beautiful soup for, from BS4 import, beautiful soup capital B,
+capital S, import request. So we're getting both the beautiful soup modules and the requests module.
+And we'll do the same opening, essentially, except we'll, we'll kind of speed things up a little
+bit. So instead of making a separate variable for data and the contents of the data, we'll just do
+page equals requests.get. And then parentheses quote, HTTP colon slash slash example.com closed,
+quote, closed parentheses. So that's just going to, we're just downloading the page with requests
+and dumping all of that structured data into page. Then we're going to make some soup. So soup
+equals beautiful soup capital B capital S, parentheses page.text. So this is again, we're looking at
+the text part of the, of the downloaded content, comma, quote, HTML dot parser, closed,
+quote, closed parentheses. So here we're, we're doing a special beautiful soup call. And we're
+telling beautiful soup to look at the textual content of the variable page, which of course
+contains the results of what requests got or.get from example.com. And then we're saying to filter
+that through or, or maybe to interpret that through the HTML parser, which is built into
+beautiful soup. And the way that you would know to do that is you would go to beautiful soup
+documentation, which is something like a beautiful soups, beautiful soup dot read the docs.io.
+Is that it? No, it's beautiful dash soup, dash four, the number four dot read the docs.io.
+That's what that's what it is. So you would go there and you would look at, well, you could look
+at some tutorials. First of all, that would reiterate some of what I'm saying here. But you could
+also look at the, the, the documentation of what kind of parsers it has, what kinds of library
+functions it has. And, and then you could, you could use what you find in, in your own code.
+So, you know, in other words, I'm not, I'm not being comprehensive here. I'm just telling you the,
+the basic parsing abilities of beautiful soup, but there's much, much more that it can do. Okay,
+so for instance, well, no, actually, well, yeah, for instance, okay. So we'll do if, and this
+is my favorite favorite incantation from Python, that was sarcasm, you heard. If space underscore,
+underscore name underscore underscore, underscore space equals equals space, single quote underscore,
+underscore main underscore underscore main, I mean, underscore underscore single quote, colon,
+return indent print parentheses soup dot printify parentheses parentheses close parentheses.
+Okay, so what that was is it says that if, if, if a user is launching this script intentionally,
+just launching this script, in other words, this script is not being called as a library from some
+other application, it is actually being used as a script, then print soup dot printify. And soup,
+obviously, is our variable that contains the output of the results of beautiful soup page dot
+text HTML parser, but we're, we're running it through a little function, which we can kind of
+think of as a filter in this case, running it through little filter called printify PRE TTIFY.
+In order, in other words, make this pretty print it, but make it pretty first. It's a little bit
+like a Unix pipe in a way, you know, you're kind of sending it through a sort function or something
+like that. Except what you're really doing is you're just, you're telling beautiful soup to print
+the code out in such a way that the indentation is consistent. Each tag has its own line. I think
+that's probably it, but it is, it makes it look pretty, which in contrast to just sort of the
+raw output of requests can be very useful. And certainly it would be easier if you were going to
+for whatever reason parse this with an external tool like AUK or Grap, it would be a lot easier
+to do that from the output of beautiful soup PRE TTIFY than just the raw output of requests,
+which might have unpredictable indentation, unpredictable line breaks or no line breaks,
+and so on. So PRE TTIFY is a one of the sort of the basic, but kind of really pleasant functions
+of beautiful soup. But there is more. So for instance, what if you just wanted to and this is the
+use case of this lesson actually, but what if you wanted to, for instance, just find the paragraph tags.
+You just wanted to find the angle bracket P, close angle bracket, those elements you want
+to find those those HTML elements. I mean, you can see the content as well, but you want to
+sort of filter your output and have beautiful soup exclude everything but paragraphs.
+Well, beautiful soup is pretty well aware of HTML because it is using its HTML parser.
+So you can do a print, for example, print parentheses soup dot P, close parentheses. Now if you're
+used to Python, you already see the problem with this, but you can do that. So instead of
+print soup dot printify parentheses, parentheses, you do soup print soup dot P and you get a paragraph
+tag. You get a paragraph or a sentence that's surrounded by angle bracket P, close angle bracket.
+Now more than likely the web page that you are scraping doesn't just contain one paragraph tag.
+What you are seeing in your output is, I guess it would be the first paragraph tag encountered
+on that page, which can be sometimes revealing for SEO and stuff. You could look at your first
+paragraph and realize that as far as whatever search engine knows or knows from its scrape,
+the first line of your homepage might be, I don't know, follow us on face slam or whatever.
+Who knows? It could be completely irrelevant to your site. So you could diagnose that with
+web scraping. So what do you know? It's not just a malicious thing after all. So there's the paragraph
+tag, but it's but a single paragraph tag. If you want all the paragraph tags, then you can use
+a for loop. Now you might as well make this into a function in an ideal world. I think you would
+probably try to make a function that could sort of abstract away the element that you're looking
+for, but I didn't get that far. So I just kind of this is still kind of hard-coded, but you can
+make a function in Python with the beautiful prefix, death. Yeah, DEF, that's what it is. It actually
+stands for define, I guess. Why they don't just use the word function, I don't know. We'll never
+understand why program languages don't just say what they mean, but DEF, that's what we get. So DEF
+space, I don't know, loop it, parentheses, parentheses, colon, and then next line, indent once for tag
+in soup.find underscore all parentheses, single quote, p, close single quote, close single,
+close parentheses, colon, next line, indent, indent, print tag. Okay, so for tag, there's nothing
+magical about the word tag. It's just something that I chose. I could have chosen for I, for item,
+for, for penguin. It doesn't matter. It's just some, some dine, yeah, dynamically defined, quick
+and easy, disposable variable. We just need some place to hold what we find in soup with the
+function called find underscore all. That's a beautiful soup function. So soup dot find underscore
+all and then parentheses, quote, p, close, quote, close parentheses. So it's just saying for every,
+every time you find a paragraph, put it into a variable called tag and hey, if you have a tag,
+print it. So if you have that, if you have that function in your code and you, you know,
+you've got all the rest of the code that I've already talked about you, but you have that,
+then well, if you run it, nothing will happen because you're not actually calling that function yet.
+So in order to make that function run, you have to tell your Python program to execute the code
+within the function. That's kind of one of the advantages of functions is that they don't happen
+unless you explicitly tell it to happen and it doesn't happen until you tell it to happen. So
+in the if underscore underscore name underscore underscore space equal equal space single,
+quote, underscore underscore main underscore underscore underscore close single, quote, colon section,
+don't print something. Just do a loop it parentheses parentheses or whatever you call your
+function. I called it loop it because it seemed like an obvious name. And now run your code and
+then you'll see all the paragraph, well, all the paragraphs in in example.com. Now you can also
+get just the content. Remember I said that beautiful soup could separate the mark up like angle
+bracket p angle bracket or angle bracket div angle bracket or angle bracket image source equals
+blah, close angle bracket, well, backslash and then close or forward slash whatever that is.
+So it can tell mark up from the actual content. So you could just get sort of like the text,
+the strings, the words on the page. And the way that you could do that is in your function,
+your loop it function for tag in soup.find all p. Then just instead of just printing the tag.
+And remember the tag was just is a variable that I chose. It doesn't mean anything. It's just
+the con when when you find a p tag, put it in this variable called tag. So it could be item.
+It could be fish. It could be whatever you wanted to be, but I called it tag. So for tag in soup,
+find all p find print rather tag dot string string. Of course, being the Python, well, the programming
+sort of lingo for what we call a word for a string of letters, essentially, or characters.
+And then once again, you could, you know, I mean, so now you're printing the content. And so once
+you have the text of a web page, you could parse it further with standard Python string libraries.
+And in this particular case, what I was teaching the person was asking me for this lesson,
+that they wanted to learn how to count the words of a page. So the quick and dirty
+way to do that would be for tag in soup find all p if tag dot string is not none. So this is
+important. It's if and tag that's again, our variable dot string. So that's the words of the
+markup p element is not none with a capital in. And the reason that we have to say this is because
+beautiful soup recognizes sometimes that there's a paragraph tag without content, but it doesn't
+just not print it. It it assigns it a special value called none. Some other, some other programming
+languages call it null or nil. This is called none in O in E with a capital N. So we're saying
+as long as it's not none. So there is content here, then or not then because Python doesn't use
+the word then, but it's colon next line indent print parentheses, Len L. E in as in length
+parentheses tag dot string dot split parentheses parentheses close parentheses close parentheses.
+And what we're doing there is we're just counting the length of all the strings as we split the
+strings by the default character of a space. And that's essentially the word count. There might be
+cleaner ways to do that to catch, I don't know, odd little exceptions or whatever. But that's a
+pretty quick and dirty way to get like the the word count of all the of all the content of all the
+paragraph tags. Now there might be other tags that that that have content that you want to count
+like H1's and H2's or or I don't know some paragraphs or not paragraph some ordered lists and
+that don't use paragraphs. You can use paragraphs in order list, but let's say you didn't you
+didn't add an order list and then the list item and you just went without paragraphs who knows.
+So you would have to take all that into account. And of course by doing it the way that I just the
+quick and dirty way, you're not getting a total of the words you're getting it for each paragraph.
+So what if you wanted to do the total? Well, that's another loop trick where you do sort of a
+function or a def space loop it parentheses parentheses colon, next line indent and then you have
+to create a sort of a counter variable. And by counter variable, I don't mean like counter against
+anti. I mean counter like I'm going to count now. So let's call it num in UM as in number num equals
+zero. So we're setting our counter to zero. And then we do the same thing four tag in soup. Find
+all P if tag dot string is not num then num equals num. I said not num didn't I if tag dot string
+is not none in O in E colon, next line indent, indent, indent, num equals num plus length of the length
+alien of tag dot string dot split parentheses parentheses. So essentially we're doing the exact same
+thing. But instead of printing that number to your terminal every single time it encounters a p tag,
+we're adding the total back to the running total. So num equals num plus the length of the tag
+string split. And then at the end, so outside of your for loop, so you didn't back to your your
+original setting just under function, then you print the grand total is num. And now you have your
+grand total printed at the very end of the of the thing. Of course, there's a lot more information you
+can extract with beautiful soup and Python. You could for instance, like I said, I think probably
+top of the list eventually would be to accept input. You would want to be able to feed your little
+script a URL dynamically as you launch it so that you don't have to go into the code and update the
+variable or that the value for the variable every time you want to change the URL that you're you're
+downloading. So that would be something to look into. You could also count the number of images
+because again, that was something that was that was specifically asked how can we count images.
+So you would you would you know, you know, how to single out the paragraph tag. So you also really
+know how to single out the image tag. Now there's inheritance and children of parent tags and things
+like that that you might want to take into account. It might help you filter things out or filter things
+in. So you would want to learn about sort of the way that the way beautiful soup views or or or
+walks the the the the structure of the document. And of course, you could also count the number of
+images. Once you find the images that you want in a reliable way, you would maybe want to count them.
+And again, you know how to do that. Maybe that would be a separate function or maybe you could
+find a way to use the loop at function to not just find the paragraph tags, but also, you know,
+possibly to find other tags or or maybe you could use it to do both of those things. I don't know
+if that would be smart. You might want like a loop at P and a loop at image. I don't know.
+Obviously, it depends on what you're doing with your program. But I think this definitely gives you
+sort of a little bit of insight into how web scraping works. You could use this for instance,
+to find all of the links to pictures on a website or all of the links to media videos or something
+on a website. And you know, you would do that by zeroing in on a href tags or video tags or
+image tags or audio, some kind of audio tag, whatever. You would zero in on that and then you would
+you would find the content or the the attribute of, well, it depends on what you're looking at. But
+yeah, so you would you would you zero in on that and then you take a look at the the part of that
+element that you need to to look at whether it's the content or or or some other part of it. And
+then you could store that in a variable and then you could process that you could store it and
+then process it later. Who knows what you want to do. But yeah, this is the this is the start
+to web scraping with Python. It's it's a really pretty well documented system. I mean, beautiful
+soup is really, really strong for for that. Whether it's the best or not, I don't know. Maybe someone
+else will record a hacker public radio episode to tell us the way that they do web scraping.
+And and I don't do a lot of web scraping truth be told I do it in sort of sprints. You know,
+I I I'll do some web scraping for a week and then I'll walk away from it and never never bother
+again for another three years. So I do, but I don't. And and when I do, I do tend to either
+just shell script curl or use beautiful soup because they're both really, really good for different
+things, I think. So yeah, that's that's web scraping 101, I guess. I hope that was useful to some
+people or interesting. And thank you very much for listening. I'll talk to you next time.
+You have been listening to hacker public radio as hacker public radio does work. Today's show was
+contributed by a HBR listener like yourself. If you ever thought of recording podcasts,
+you click on our contribute link to find out how easy it really is. Hosting for HBR has been
+kindly provided by an honesthost.com, the internet archive and our sings.net. On the Sadois
+stages, today's show is released on their creative comments, attribution, 4.0 International