- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
165 lines
14 KiB
Plaintext
165 lines
14 KiB
Plaintext
Episode: 1694
|
|
Title: HPR1694: My APOD downloader
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1694/hpr1694.mp3
|
|
Transcribed: 2025-10-18 07:48:34
|
|
|
|
---
|
|
|
|
This is HPR Episode 1694 entitled My Apot Downloader.
|
|
It is hosted by Dave Morris' and is about 22 minutes long.
|
|
The summary is, my simple portal script to download the astronomy picture of the day each day.
|
|
This episode of HPR is brought to you by an honesthost.com.
|
|
Get 15% discount on all shared hosting with the offer code HPR15.
|
|
That's HPR15.
|
|
Better web hosting that's honest and fair at An Honesthost.com.
|
|
Hello everyone, this is Dave Morris.
|
|
My HPR episode today is called My Apod Downloader, which is pretty cryptic.
|
|
APODA pod stands for Astronomy Picture of the Day.
|
|
You've probably heard of the astronomy picture of the day.
|
|
It's a website. It's existed since 1995 and it's provided by NASA
|
|
in combination with Michigan Technological University and it's created and managed by Robert
|
|
Nemeroff and Jerry Bonnell. The FAQ on the site says,
|
|
the APODA archive contains the largest collection of annotated astronomical images on the internet.
|
|
And I think it's pretty cool and I really like some of the images being a bit of an enthusiast
|
|
for things astronomical. So let me tell you about the downloader.
|
|
I'm a KDE user and as a consequence I quite like a moderate amount of bling.
|
|
I'm also old fashioned so I suppose that fits as well.
|
|
I quite like to have a picture on my desktop and I like to use KDE's ability to rotate
|
|
the wallpaper pictures every so often. So I want a collection of images.
|
|
So to this end I download the Astronomy Picture of the Day on my server every day and make
|
|
the images available through an NFS mounted volume so I can see them on various machines I have.
|
|
So in 2012 I wrote a pulse script to do this downloading.
|
|
This is one of the most early forays that I made into scraping of websites and excuse the noises
|
|
off the cat instrument she wants to join in with me and I've closed the door on her.
|
|
So she's definitely getting.
|
|
So I used a fairly primitive HTML parsing technique.
|
|
I'm not a great fan of web things. Again shows I'm fairly old fashioned I suppose.
|
|
Seems a clunky way to get older of information programmatically.
|
|
Anyway this was a challenge I wanted to take up and so I've been improving this script over
|
|
the intervening years and now I use a pearl module called HTML TreeBuilder which I think is a lot
|
|
better at parsing HTML. The version of the script that I actually use myself includes a pearl module
|
|
image magic which interventions to the awesome image magic image manipulation. I can't say that
|
|
word image manipulation software suite. If you've never looked at this this is amazingly cool.
|
|
It's got lots of tools in it. It's a library and it's also got loads of commands. Let you do
|
|
some pretty amazing things with images. Build gifts split gifts apart all sorts of animated
|
|
gift size and transform images in all manner of wonderful ways. So in the version I use I annotate
|
|
the downloaded image with the title of the image which it's which have parsed from the HTML
|
|
and I do that so I can when I see them come up on my screen and know what they are.
|
|
But the script I'm offering here is called CollectApodSimple and it doesn't use image magic.
|
|
I've thought it best to emit this and give you a simpler version because installing image magic
|
|
can be a little bit difficult and more I guess the installation of the pearl module. I might be
|
|
wrong I don't know but certainly I've had problems with it and I thought it was probably best
|
|
not to give you if you wanted to to follow this to give you the task of fiddling around with this
|
|
stuff. There's also the fact that I've maybe not perfected this annotation stuff as well as I
|
|
could have done and there are issues with it. If the image is a sort of reasonable size
|
|
not a very great resolution then the title looks great but if it's a very very high resolution
|
|
image then the title is absolutely minute you can't read it and I haven't yet worked out how
|
|
to fix that. So this more advanced script called CollectApod and the one I'm talking about today the
|
|
simple one are both available on the Gatorius repository and there's a link in the show notes
|
|
to where you can get them. They're actually in a repository with various other odds and ends
|
|
that I've written for HPR over the years so you'd probably be best to download the whole
|
|
lot, clone the whole git repository is not very big and then either pick out the bits you want
|
|
and throw the rest away or just live with the with this small amount of space being used up.
|
|
So let's talk about the code then. If you if you're a pearl user you have any
|
|
understanding of a pearl if you sort of talk you'll probably look at this script and think
|
|
it's pretty simple it is it is pretty simple. Basically all it does is work out the date for which
|
|
you want the image normally this would default to today's date but you can also ask for dates in
|
|
the past or in the future but you won't get them. It downloads the HTML after having built
|
|
URL from the date and the the other details and it having pulled in that HTML that's the thing
|
|
that contains the title which is not relevant in this case but because you have one use the other
|
|
one and it contains the image or at least a link to the image and the script then finds this image
|
|
in the various links in the page and downloads the image but it's where the where you have defined
|
|
the drop place to be. So what I've done is I've included a listing of the script with annotations.
|
|
It's pretty heavily commented anyway but the annotations are there to try and explain what
|
|
the different sections do. You can't really use the the script as it stands. I suppose you could
|
|
cut it cut and paste it if you wanted to but you could just go and get the get repository the
|
|
getorious repository if you want to try running it. So I'm just going to read through the various
|
|
additions I've made the annotations I've made to this so that you can get some idea of what it's
|
|
what it's doing. Build scripts that I write always start with a standard preamble and you can skip
|
|
over that it's just a big comment and there are three modules that are required by the script.
|
|
There's one called LWP user agent and this is a vintage pearl module for performing web downloads
|
|
and all manner of web related activities. This one actually identifies itself as an agent when
|
|
it's doing the download. Date time is the next module that's just a thing for generating dates in
|
|
various formats. That's another bendrable module and the one I mentioned before HTML tree builder
|
|
is the the parser for the HTML. So that's that's just the preamble. There's a bit of
|
|
other stuff that follows on from this. There's various variables that are used to
|
|
give you the the what the critical one should you ever want to use this is a a variable called image
|
|
base and it defines where you want the image to be placed. It's a directory it should be a
|
|
directory. In my particular case it's using the environment variable home concatenating it with
|
|
the directory backgrounds slash a pod. So all of my images get dropped in there and that's
|
|
actually the mount point. That's the mountable volume the mountable directory I should say
|
|
that I use on my server. So the script collects potentially connects a date looks for a date
|
|
on the command line. If it's not defined then it will just build one. The date must be in
|
|
YYDMMDD format in other words two digit year two digit month two digit day. So if you type it
|
|
in yourself it's got to be in that fixed format. If if the script generates it generates that
|
|
form from the current day. If it doesn't get the script doesn't get a date in this format then it
|
|
will abort. So this date is then used to build the URL which simply contains that the last
|
|
sort of element of it consists of the letters ap followed by this date dot hdml. In case you're
|
|
interested when you actually click on the a pod site itself the URL you see is astropix dot hdml
|
|
ends with that. The format of that is slightly different from the one that the script is going for.
|
|
So I don't actually download that one with the script. There's another version of this
|
|
which is in the sort of archival format because all of these pictures are archived back to the
|
|
the original original one back in 95 to let's say it was. They're all archived on the side so
|
|
they all conform to this ap date stamp dot hdml format. So I'll be an example of what it might
|
|
look like in the in the show notes there and the annotation to the script. So having constructed
|
|
the URL the the script there's a there's a lot of declarations and generally fiddling about.
|
|
There's a there's a lot of debug stuff in here which is switched on in the the released version
|
|
so you get to see what it's going to download and where it's going to put it and everything.
|
|
You can easily search that off by editing the thing to change the variable debug capital
|
|
d e b u g on line 44 you can change that to zero and it'll shut up. So having gone through all this
|
|
stuff then we then come on to the bit which does the download. That's lines 111 to 114 and it's
|
|
using this LWP user agent that I mentioned. It pulls the page down if it was successful if the
|
|
download was successful then the hdml is actually in a data structure in memory and it's simply
|
|
passed to hdml tree builder which builds this rather exciting multi-layered structure of
|
|
pearl data which can then be examined. So assuming that all of that has worked the download was
|
|
successful and the parse has begun. The script then loops through looking for the for links
|
|
a tags in hdml. There are going to be lots of them because there's lots and lots of links out of
|
|
the document usually and one of them which is actually part of an image tag usually contains
|
|
a pointer to the image. The image that you see on the website is not the one that I'm actually
|
|
interested in. It's a smaller version of it so it's actually embedded in the page. There's usually
|
|
a much much bigger version of this image that you get if you click on the image on the web page.
|
|
So it's that one I'm after. So I simply find all of the links and look at each one to see if it
|
|
contains .jpg or .png on the end of it. And if it does then the loop stops because we reckon we
|
|
found it. Obviously this is fairly primitive. If there are other images of any sort on the page
|
|
it will only get the first one which might not be the one you want. But it's been pretty reliable.
|
|
I've been running this for years and it's done its job pretty well.
|
|
So you're welcome to go and hack this around if you wish and let me know what you do if you do
|
|
do that. I'd be interested. So this loop is from lines 141 to 148 and we hope at the end of that
|
|
we've either not found an image at all which is possible because the page might contain a
|
|
a gif animated gif, gif, you know how you say that, or a video. We're not interested in either of those.
|
|
So there's a check down in lines 153, 155 that says if the loop stopped but didn't stop because it
|
|
hit her an image then we can't go on there's nothing else to do exit abort. So if on the other hand
|
|
we have a one looks like an image then we will pick out the URL and get ready to download it.
|
|
So there's some stuff, some statements which are preparing the URL. One of the things I do just
|
|
because I'm fussy in that way is that in some cases for some reason or other the images end
|
|
in JPG and capitals and I always convert them to in lowercase partly because the viewers that I
|
|
use seem to ignore capital JPGs but at least they have done at some point in the past. So do that
|
|
and then the having done that then all we need to do is to make a request an HTTP request.
|
|
This is using the LWP library I mentioned before the module. Make a request to download this
|
|
this particular file. This image I should say and this is simply downloaded straight to the file
|
|
that it's destined for and fairly didn't actually explain that too well. The image file the file that
|
|
is going to get the image stored in it is made up from the path that I mentioned earlier on
|
|
with the name of the file stuck on the end of it and the name of the file is extracted from the
|
|
last element of the URL so it's not pretty nothing very sophisticated there. So this will either
|
|
succeed or fail if it succeeds a message is printed saying in debug mode that is explaining
|
|
the files downloaded if it failed then the script will abort with an error message so really that's
|
|
all there is to it so I normally run this from a cron a cron job on my server which runs 24-7
|
|
and it runs some weird time the day I can't remember when I run it I think I think I discovered that
|
|
the image doesn't actually get put up until sometime the early morning UTC and I think I run this
|
|
at about seven in the morning or something so this plenty of time for it seven UTC that is
|
|
so there's plenty of time for it to have been put up and settled down and everything
|
|
and I download it then so really that's that's all there is to it so I hope you find that
|
|
interesting and possibly useful and get to play around with it there's a bunch of links
|
|
in the show notes to the various things I've mentioned and a link to the Gatorius
|
|
repository where all of this stuff lives so I hope you find it interesting if you do let me know
|
|
okay thanks bye now
|
|
you've been listening to Hacker Public Radio at HackerPublicRadio.org
|
|
we are a community podcast network that releases shows every weekday Monday through Friday
|
|
today's show like all our shows was contributed by an HBR listener like yourself
|
|
if you ever thought of recording a podcast then click on our contributing to find out
|
|
how easy it really is Hacker Public Radio was founded by the digital dog pound and the
|
|
infonomicum computer club and it's part of the binary revolution at binrev.com if you have
|
|
comments on today's show please email the host directly leave a comment on the website or record
|
|
a follow-up episode yourself unless otherwise stated today's show is released under creative
|
|
comments, attribution, share a light 3.0 license
|