Episode: 1694 Title: HPR1694: My APOD downloader Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1694/hpr1694.mp3 Transcribed: 2025-10-18 07:48:34 --- This is HPR Episode 1694 entitled My Apot Downloader. It is hosted by Dave Morris' and is about 22 minutes long. The summary is, my simple portal script to download the astronomy picture of the day each day. This episode of HPR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at An Honesthost.com. Hello everyone, this is Dave Morris. My HPR episode today is called My Apod Downloader, which is pretty cryptic. APODA pod stands for Astronomy Picture of the Day. You've probably heard of the astronomy picture of the day. It's a website. It's existed since 1995 and it's provided by NASA in combination with Michigan Technological University and it's created and managed by Robert Nemeroff and Jerry Bonnell. The FAQ on the site says, the APODA archive contains the largest collection of annotated astronomical images on the internet. And I think it's pretty cool and I really like some of the images being a bit of an enthusiast for things astronomical. So let me tell you about the downloader. I'm a KDE user and as a consequence I quite like a moderate amount of bling. I'm also old fashioned so I suppose that fits as well. I quite like to have a picture on my desktop and I like to use KDE's ability to rotate the wallpaper pictures every so often. So I want a collection of images. So to this end I download the Astronomy Picture of the Day on my server every day and make the images available through an NFS mounted volume so I can see them on various machines I have. So in 2012 I wrote a pulse script to do this downloading. This is one of the most early forays that I made into scraping of websites and excuse the noises off the cat instrument she wants to join in with me and I've closed the door on her. So she's definitely getting. So I used a fairly primitive HTML parsing technique. I'm not a great fan of web things. Again shows I'm fairly old fashioned I suppose. Seems a clunky way to get older of information programmatically. Anyway this was a challenge I wanted to take up and so I've been improving this script over the intervening years and now I use a pearl module called HTML TreeBuilder which I think is a lot better at parsing HTML. The version of the script that I actually use myself includes a pearl module image magic which interventions to the awesome image magic image manipulation. I can't say that word image manipulation software suite. If you've never looked at this this is amazingly cool. It's got lots of tools in it. It's a library and it's also got loads of commands. Let you do some pretty amazing things with images. Build gifts split gifts apart all sorts of animated gift size and transform images in all manner of wonderful ways. So in the version I use I annotate the downloaded image with the title of the image which it's which have parsed from the HTML and I do that so I can when I see them come up on my screen and know what they are. But the script I'm offering here is called CollectApodSimple and it doesn't use image magic. I've thought it best to emit this and give you a simpler version because installing image magic can be a little bit difficult and more I guess the installation of the pearl module. I might be wrong I don't know but certainly I've had problems with it and I thought it was probably best not to give you if you wanted to to follow this to give you the task of fiddling around with this stuff. There's also the fact that I've maybe not perfected this annotation stuff as well as I could have done and there are issues with it. If the image is a sort of reasonable size not a very great resolution then the title looks great but if it's a very very high resolution image then the title is absolutely minute you can't read it and I haven't yet worked out how to fix that. So this more advanced script called CollectApod and the one I'm talking about today the simple one are both available on the Gatorius repository and there's a link in the show notes to where you can get them. They're actually in a repository with various other odds and ends that I've written for HPR over the years so you'd probably be best to download the whole lot, clone the whole git repository is not very big and then either pick out the bits you want and throw the rest away or just live with the with this small amount of space being used up. So let's talk about the code then. If you if you're a pearl user you have any understanding of a pearl if you sort of talk you'll probably look at this script and think it's pretty simple it is it is pretty simple. Basically all it does is work out the date for which you want the image normally this would default to today's date but you can also ask for dates in the past or in the future but you won't get them. It downloads the HTML after having built URL from the date and the the other details and it having pulled in that HTML that's the thing that contains the title which is not relevant in this case but because you have one use the other one and it contains the image or at least a link to the image and the script then finds this image in the various links in the page and downloads the image but it's where the where you have defined the drop place to be. So what I've done is I've included a listing of the script with annotations. It's pretty heavily commented anyway but the annotations are there to try and explain what the different sections do. You can't really use the the script as it stands. I suppose you could cut it cut and paste it if you wanted to but you could just go and get the get repository the getorious repository if you want to try running it. So I'm just going to read through the various additions I've made the annotations I've made to this so that you can get some idea of what it's what it's doing. Build scripts that I write always start with a standard preamble and you can skip over that it's just a big comment and there are three modules that are required by the script. There's one called LWP user agent and this is a vintage pearl module for performing web downloads and all manner of web related activities. This one actually identifies itself as an agent when it's doing the download. Date time is the next module that's just a thing for generating dates in various formats. That's another bendrable module and the one I mentioned before HTML tree builder is the the parser for the HTML. So that's that's just the preamble. There's a bit of other stuff that follows on from this. There's various variables that are used to give you the the what the critical one should you ever want to use this is a a variable called image base and it defines where you want the image to be placed. It's a directory it should be a directory. In my particular case it's using the environment variable home concatenating it with the directory backgrounds slash a pod. So all of my images get dropped in there and that's actually the mount point. That's the mountable volume the mountable directory I should say that I use on my server. So the script collects potentially connects a date looks for a date on the command line. If it's not defined then it will just build one. The date must be in YYDMMDD format in other words two digit year two digit month two digit day. So if you type it in yourself it's got to be in that fixed format. If if the script generates it generates that form from the current day. If it doesn't get the script doesn't get a date in this format then it will abort. So this date is then used to build the URL which simply contains that the last sort of element of it consists of the letters ap followed by this date dot hdml. In case you're interested when you actually click on the a pod site itself the URL you see is astropix dot hdml ends with that. The format of that is slightly different from the one that the script is going for. So I don't actually download that one with the script. There's another version of this which is in the sort of archival format because all of these pictures are archived back to the the original original one back in 95 to let's say it was. They're all archived on the side so they all conform to this ap date stamp dot hdml format. So I'll be an example of what it might look like in the in the show notes there and the annotation to the script. So having constructed the URL the the script there's a there's a lot of declarations and generally fiddling about. There's a there's a lot of debug stuff in here which is switched on in the the released version so you get to see what it's going to download and where it's going to put it and everything. You can easily search that off by editing the thing to change the variable debug capital d e b u g on line 44 you can change that to zero and it'll shut up. So having gone through all this stuff then we then come on to the bit which does the download. That's lines 111 to 114 and it's using this LWP user agent that I mentioned. It pulls the page down if it was successful if the download was successful then the hdml is actually in a data structure in memory and it's simply passed to hdml tree builder which builds this rather exciting multi-layered structure of pearl data which can then be examined. So assuming that all of that has worked the download was successful and the parse has begun. The script then loops through looking for the for links a tags in hdml. There are going to be lots of them because there's lots and lots of links out of the document usually and one of them which is actually part of an image tag usually contains a pointer to the image. The image that you see on the website is not the one that I'm actually interested in. It's a smaller version of it so it's actually embedded in the page. There's usually a much much bigger version of this image that you get if you click on the image on the web page. So it's that one I'm after. So I simply find all of the links and look at each one to see if it contains .jpg or .png on the end of it. And if it does then the loop stops because we reckon we found it. Obviously this is fairly primitive. If there are other images of any sort on the page it will only get the first one which might not be the one you want. But it's been pretty reliable. I've been running this for years and it's done its job pretty well. So you're welcome to go and hack this around if you wish and let me know what you do if you do do that. I'd be interested. So this loop is from lines 141 to 148 and we hope at the end of that we've either not found an image at all which is possible because the page might contain a a gif animated gif, gif, you know how you say that, or a video. We're not interested in either of those. So there's a check down in lines 153, 155 that says if the loop stopped but didn't stop because it hit her an image then we can't go on there's nothing else to do exit abort. So if on the other hand we have a one looks like an image then we will pick out the URL and get ready to download it. So there's some stuff, some statements which are preparing the URL. One of the things I do just because I'm fussy in that way is that in some cases for some reason or other the images end in JPG and capitals and I always convert them to in lowercase partly because the viewers that I use seem to ignore capital JPGs but at least they have done at some point in the past. So do that and then the having done that then all we need to do is to make a request an HTTP request. This is using the LWP library I mentioned before the module. Make a request to download this this particular file. This image I should say and this is simply downloaded straight to the file that it's destined for and fairly didn't actually explain that too well. The image file the file that is going to get the image stored in it is made up from the path that I mentioned earlier on with the name of the file stuck on the end of it and the name of the file is extracted from the last element of the URL so it's not pretty nothing very sophisticated there. So this will either succeed or fail if it succeeds a message is printed saying in debug mode that is explaining the files downloaded if it failed then the script will abort with an error message so really that's all there is to it so I normally run this from a cron a cron job on my server which runs 24-7 and it runs some weird time the day I can't remember when I run it I think I think I discovered that the image doesn't actually get put up until sometime the early morning UTC and I think I run this at about seven in the morning or something so this plenty of time for it seven UTC that is so there's plenty of time for it to have been put up and settled down and everything and I download it then so really that's that's all there is to it so I hope you find that interesting and possibly useful and get to play around with it there's a bunch of links in the show notes to the various things I've mentioned and a link to the Gatorius repository where all of this stuff lives so I hope you find it interesting if you do let me know okay thanks bye now you've been listening to Hacker Public Radio at HackerPublicRadio.org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an HBR listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is Hacker Public Radio was founded by the digital dog pound and the infonomicum computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise stated today's show is released under creative comments, attribution, share a light 3.0 license