Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
164
hpr_transcripts/hpr1694.txt
Normal file
164
hpr_transcripts/hpr1694.txt
Normal file
@@ -0,0 +1,164 @@
|
||||
Episode: 1694
|
||||
Title: HPR1694: My APOD downloader
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1694/hpr1694.mp3
|
||||
Transcribed: 2025-10-18 07:48:34
|
||||
|
||||
---
|
||||
|
||||
This is HPR Episode 1694 entitled My Apot Downloader.
|
||||
It is hosted by Dave Morris' and is about 22 minutes long.
|
||||
The summary is, my simple portal script to download the astronomy picture of the day each day.
|
||||
This episode of HPR is brought to you by an honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code HPR15.
|
||||
That's HPR15.
|
||||
Better web hosting that's honest and fair at An Honesthost.com.
|
||||
Hello everyone, this is Dave Morris.
|
||||
My HPR episode today is called My Apod Downloader, which is pretty cryptic.
|
||||
APODA pod stands for Astronomy Picture of the Day.
|
||||
You've probably heard of the astronomy picture of the day.
|
||||
It's a website. It's existed since 1995 and it's provided by NASA
|
||||
in combination with Michigan Technological University and it's created and managed by Robert
|
||||
Nemeroff and Jerry Bonnell. The FAQ on the site says,
|
||||
the APODA archive contains the largest collection of annotated astronomical images on the internet.
|
||||
And I think it's pretty cool and I really like some of the images being a bit of an enthusiast
|
||||
for things astronomical. So let me tell you about the downloader.
|
||||
I'm a KDE user and as a consequence I quite like a moderate amount of bling.
|
||||
I'm also old fashioned so I suppose that fits as well.
|
||||
I quite like to have a picture on my desktop and I like to use KDE's ability to rotate
|
||||
the wallpaper pictures every so often. So I want a collection of images.
|
||||
So to this end I download the Astronomy Picture of the Day on my server every day and make
|
||||
the images available through an NFS mounted volume so I can see them on various machines I have.
|
||||
So in 2012 I wrote a pulse script to do this downloading.
|
||||
This is one of the most early forays that I made into scraping of websites and excuse the noises
|
||||
off the cat instrument she wants to join in with me and I've closed the door on her.
|
||||
So she's definitely getting.
|
||||
So I used a fairly primitive HTML parsing technique.
|
||||
I'm not a great fan of web things. Again shows I'm fairly old fashioned I suppose.
|
||||
Seems a clunky way to get older of information programmatically.
|
||||
Anyway this was a challenge I wanted to take up and so I've been improving this script over
|
||||
the intervening years and now I use a pearl module called HTML TreeBuilder which I think is a lot
|
||||
better at parsing HTML. The version of the script that I actually use myself includes a pearl module
|
||||
image magic which interventions to the awesome image magic image manipulation. I can't say that
|
||||
word image manipulation software suite. If you've never looked at this this is amazingly cool.
|
||||
It's got lots of tools in it. It's a library and it's also got loads of commands. Let you do
|
||||
some pretty amazing things with images. Build gifts split gifts apart all sorts of animated
|
||||
gift size and transform images in all manner of wonderful ways. So in the version I use I annotate
|
||||
the downloaded image with the title of the image which it's which have parsed from the HTML
|
||||
and I do that so I can when I see them come up on my screen and know what they are.
|
||||
But the script I'm offering here is called CollectApodSimple and it doesn't use image magic.
|
||||
I've thought it best to emit this and give you a simpler version because installing image magic
|
||||
can be a little bit difficult and more I guess the installation of the pearl module. I might be
|
||||
wrong I don't know but certainly I've had problems with it and I thought it was probably best
|
||||
not to give you if you wanted to to follow this to give you the task of fiddling around with this
|
||||
stuff. There's also the fact that I've maybe not perfected this annotation stuff as well as I
|
||||
could have done and there are issues with it. If the image is a sort of reasonable size
|
||||
not a very great resolution then the title looks great but if it's a very very high resolution
|
||||
image then the title is absolutely minute you can't read it and I haven't yet worked out how
|
||||
to fix that. So this more advanced script called CollectApod and the one I'm talking about today the
|
||||
simple one are both available on the Gatorius repository and there's a link in the show notes
|
||||
to where you can get them. They're actually in a repository with various other odds and ends
|
||||
that I've written for HPR over the years so you'd probably be best to download the whole
|
||||
lot, clone the whole git repository is not very big and then either pick out the bits you want
|
||||
and throw the rest away or just live with the with this small amount of space being used up.
|
||||
So let's talk about the code then. If you if you're a pearl user you have any
|
||||
understanding of a pearl if you sort of talk you'll probably look at this script and think
|
||||
it's pretty simple it is it is pretty simple. Basically all it does is work out the date for which
|
||||
you want the image normally this would default to today's date but you can also ask for dates in
|
||||
the past or in the future but you won't get them. It downloads the HTML after having built
|
||||
URL from the date and the the other details and it having pulled in that HTML that's the thing
|
||||
that contains the title which is not relevant in this case but because you have one use the other
|
||||
one and it contains the image or at least a link to the image and the script then finds this image
|
||||
in the various links in the page and downloads the image but it's where the where you have defined
|
||||
the drop place to be. So what I've done is I've included a listing of the script with annotations.
|
||||
It's pretty heavily commented anyway but the annotations are there to try and explain what
|
||||
the different sections do. You can't really use the the script as it stands. I suppose you could
|
||||
cut it cut and paste it if you wanted to but you could just go and get the get repository the
|
||||
getorious repository if you want to try running it. So I'm just going to read through the various
|
||||
additions I've made the annotations I've made to this so that you can get some idea of what it's
|
||||
what it's doing. Build scripts that I write always start with a standard preamble and you can skip
|
||||
over that it's just a big comment and there are three modules that are required by the script.
|
||||
There's one called LWP user agent and this is a vintage pearl module for performing web downloads
|
||||
and all manner of web related activities. This one actually identifies itself as an agent when
|
||||
it's doing the download. Date time is the next module that's just a thing for generating dates in
|
||||
various formats. That's another bendrable module and the one I mentioned before HTML tree builder
|
||||
is the the parser for the HTML. So that's that's just the preamble. There's a bit of
|
||||
other stuff that follows on from this. There's various variables that are used to
|
||||
give you the the what the critical one should you ever want to use this is a a variable called image
|
||||
base and it defines where you want the image to be placed. It's a directory it should be a
|
||||
directory. In my particular case it's using the environment variable home concatenating it with
|
||||
the directory backgrounds slash a pod. So all of my images get dropped in there and that's
|
||||
actually the mount point. That's the mountable volume the mountable directory I should say
|
||||
that I use on my server. So the script collects potentially connects a date looks for a date
|
||||
on the command line. If it's not defined then it will just build one. The date must be in
|
||||
YYDMMDD format in other words two digit year two digit month two digit day. So if you type it
|
||||
in yourself it's got to be in that fixed format. If if the script generates it generates that
|
||||
form from the current day. If it doesn't get the script doesn't get a date in this format then it
|
||||
will abort. So this date is then used to build the URL which simply contains that the last
|
||||
sort of element of it consists of the letters ap followed by this date dot hdml. In case you're
|
||||
interested when you actually click on the a pod site itself the URL you see is astropix dot hdml
|
||||
ends with that. The format of that is slightly different from the one that the script is going for.
|
||||
So I don't actually download that one with the script. There's another version of this
|
||||
which is in the sort of archival format because all of these pictures are archived back to the
|
||||
the original original one back in 95 to let's say it was. They're all archived on the side so
|
||||
they all conform to this ap date stamp dot hdml format. So I'll be an example of what it might
|
||||
look like in the in the show notes there and the annotation to the script. So having constructed
|
||||
the URL the the script there's a there's a lot of declarations and generally fiddling about.
|
||||
There's a there's a lot of debug stuff in here which is switched on in the the released version
|
||||
so you get to see what it's going to download and where it's going to put it and everything.
|
||||
You can easily search that off by editing the thing to change the variable debug capital
|
||||
d e b u g on line 44 you can change that to zero and it'll shut up. So having gone through all this
|
||||
stuff then we then come on to the bit which does the download. That's lines 111 to 114 and it's
|
||||
using this LWP user agent that I mentioned. It pulls the page down if it was successful if the
|
||||
download was successful then the hdml is actually in a data structure in memory and it's simply
|
||||
passed to hdml tree builder which builds this rather exciting multi-layered structure of
|
||||
pearl data which can then be examined. So assuming that all of that has worked the download was
|
||||
successful and the parse has begun. The script then loops through looking for the for links
|
||||
a tags in hdml. There are going to be lots of them because there's lots and lots of links out of
|
||||
the document usually and one of them which is actually part of an image tag usually contains
|
||||
a pointer to the image. The image that you see on the website is not the one that I'm actually
|
||||
interested in. It's a smaller version of it so it's actually embedded in the page. There's usually
|
||||
a much much bigger version of this image that you get if you click on the image on the web page.
|
||||
So it's that one I'm after. So I simply find all of the links and look at each one to see if it
|
||||
contains .jpg or .png on the end of it. And if it does then the loop stops because we reckon we
|
||||
found it. Obviously this is fairly primitive. If there are other images of any sort on the page
|
||||
it will only get the first one which might not be the one you want. But it's been pretty reliable.
|
||||
I've been running this for years and it's done its job pretty well.
|
||||
So you're welcome to go and hack this around if you wish and let me know what you do if you do
|
||||
do that. I'd be interested. So this loop is from lines 141 to 148 and we hope at the end of that
|
||||
we've either not found an image at all which is possible because the page might contain a
|
||||
a gif animated gif, gif, you know how you say that, or a video. We're not interested in either of those.
|
||||
So there's a check down in lines 153, 155 that says if the loop stopped but didn't stop because it
|
||||
hit her an image then we can't go on there's nothing else to do exit abort. So if on the other hand
|
||||
we have a one looks like an image then we will pick out the URL and get ready to download it.
|
||||
So there's some stuff, some statements which are preparing the URL. One of the things I do just
|
||||
because I'm fussy in that way is that in some cases for some reason or other the images end
|
||||
in JPG and capitals and I always convert them to in lowercase partly because the viewers that I
|
||||
use seem to ignore capital JPGs but at least they have done at some point in the past. So do that
|
||||
and then the having done that then all we need to do is to make a request an HTTP request.
|
||||
This is using the LWP library I mentioned before the module. Make a request to download this
|
||||
this particular file. This image I should say and this is simply downloaded straight to the file
|
||||
that it's destined for and fairly didn't actually explain that too well. The image file the file that
|
||||
is going to get the image stored in it is made up from the path that I mentioned earlier on
|
||||
with the name of the file stuck on the end of it and the name of the file is extracted from the
|
||||
last element of the URL so it's not pretty nothing very sophisticated there. So this will either
|
||||
succeed or fail if it succeeds a message is printed saying in debug mode that is explaining
|
||||
the files downloaded if it failed then the script will abort with an error message so really that's
|
||||
all there is to it so I normally run this from a cron a cron job on my server which runs 24-7
|
||||
and it runs some weird time the day I can't remember when I run it I think I think I discovered that
|
||||
the image doesn't actually get put up until sometime the early morning UTC and I think I run this
|
||||
at about seven in the morning or something so this plenty of time for it seven UTC that is
|
||||
so there's plenty of time for it to have been put up and settled down and everything
|
||||
and I download it then so really that's that's all there is to it so I hope you find that
|
||||
interesting and possibly useful and get to play around with it there's a bunch of links
|
||||
in the show notes to the various things I've mentioned and a link to the Gatorius
|
||||
repository where all of this stuff lives so I hope you find it interesting if you do let me know
|
||||
okay thanks bye now
|
||||
you've been listening to Hacker Public Radio at HackerPublicRadio.org
|
||||
we are a community podcast network that releases shows every weekday Monday through Friday
|
||||
today's show like all our shows was contributed by an HBR listener like yourself
|
||||
if you ever thought of recording a podcast then click on our contributing to find out
|
||||
how easy it really is Hacker Public Radio was founded by the digital dog pound and the
|
||||
infonomicum computer club and it's part of the binary revolution at binrev.com if you have
|
||||
comments on today's show please email the host directly leave a comment on the website or record
|
||||
a follow-up episode yourself unless otherwise stated today's show is released under creative
|
||||
comments, attribution, share a light 3.0 license
|
||||
Reference in New Issue
Block a user