You have probably heard of the Astronomy Picture of the Day (APOD) site. It has existed since 1995, is provided by NASA and Michigan Technological University (MTU) and is created and managed by Robert Nemiroff and Jerry Bonnell. The FAQ on the site says "The APOD archive contains the largest collection of annotated astronomical images on the internet".
Being a KDE user I quite like a moderate amount of bling, and I particularly like to have a picture on my desktop. I like to rotate my wallpaper pictures every so often, so I want to have a collection of images. To this end I download the APOD on my server every day and make the images available through an NFS-mounted volume.
In 2012 I wrote a Perl script to perform the download, using a fairly primitive HTML parsing method. This script has been improved over the intervening years and now uses the Perl module HTML::TreeBuilder which I believe is much better at parsing HTML.
The version of the script I use myself also includes the Perl module Image::Magick which interfaces to the awesome ImageMagick image manipulation software suite. I use this to annotate the downloaded image with the title parsed from the HTML so I know what it is.
The script I am presenting here is called collect_apod_simple and does not use ImageMagick. I chose to omit it because the installation of this suite and the related Perl module can be difficult. Also, I do not feel that the annotation always works as well as it could, and I have not yet found the time to correct this shortcoming.
A version of the more advanced script (called collect_apod) is available in the same place as collect_apod_simple should you wish to give it a try. Both scripts are available on GitLab under the link https://gitlab.com/davmo/hprmisc.
If you are acquainted with Perl you'll probably find this script quite simple. All it really does is:
Get or compute the date string for building the APOD URL
Download the HTML on the selected APOD page
Look for an image being used as a link
Download the image being linked to and save it where requested
The following is a numbered listing with annotations. There are a several comments in the script itself, but the annotations are there to try and make the various sections as clear as possible.
1 #!/usr/bin/env perl
2 #===============================================================================
3 #
4 # FILE: collect_apod_simple
5 #
6 # USAGE: ./collect_apod_simple [YYMMDD]
7 #
8 # DESCRIPTION: Downloads the current Astronomy Picture of the Day or that
9 # relating to the formatted date provided as an argument. In
10 # this context "current" can mean two URLs: .../astropix.html or
11 # .../apYYMMDD.html. We now *do not* download the
12 # .../astropix.html version since it has a different HTML
13 # layout.
14 #
15 # OPTIONS: ---
16 # REQUIREMENTS: ---
17 # BUGS: ---
18 # NOTES: Based on 'collect_apod' but without the Image::Magick stuff,
19 # for simplicity and for release to the HPR community
20 # AUTHOR: Dave Morriss (djm), Dave.Morriss@gmail.com
21 # VERSION: 0.0.1
22 # CREATED: 2015-01-02 19:58:01
23 # REVISION: 2015-01-03 23:00:27
24 #
25 #===============================================================================
26
27 use 5.010;
28 use strict;
29 use warnings;
30 use utf8;
31
32 use LWP::UserAgent;
33 use DateTime;
34 use HTML::TreeBuilder 5 -weak;
35
Lines 32-34 define the modules the script uses:
LWP::UserAgent used to perform the web downloadsDateTime generates and formats the default dateHTML::TreeBuilder parses HTML 36 #
37 # Version number (manually incremented)
38 #
39 our $VERSION = '0.0.1';
40
41 #
42 # Set to 0 to be more silent
43 #
44 my $DEBUG = 1;
45
46 #
47 # Script name
48 #
49 ( my $PROG = $0 ) =~ s|.*/||mx;
50
51 #-------------------------------------------------------------------------------
52 # Edit this to your needs
53 #-------------------------------------------------------------------------------
54 #
55 # Where the script will download the picture. Edit this to where you want
56 #
57 my $image_base = "$ENV{HOME}/Backgrounds/apod";
58
59 #-------------------------------------------------------------------------------
60 # Nothing needs editing below here
61 #-------------------------------------------------------------------------------
62
63 #
64 # Get the argument or default it
65 #
66 my $arg = shift;
67 unless ( defined($arg) ) {
68 #
69 # APOD wants a date in YYMMDD format
70 #
71 my $dt = DateTime->now;
72 $arg = sprintf( "%02i%02i%02i",
73 substr( $dt->year, -2 ),
74 $dt->month, $dt->day );
75 }
76
77 #
78 # Check the argument is a valid date in YYMMDD format
79 #
80 die "Usage: $PROG [YYMMDD]\n" unless ( $arg =~ /^\d{6}$/ );
81
Lines 66-80 collect the date from the command line, or if none is given generate the correctly formatted date. If a date in an invalid format is given the script aborts.
82 #
83 # Make an URL depending on the argument
84 #
85 my $apod_base = "http://apod.nasa.gov/apod";
86 my $apod_URL = "$apod_base/ap$arg.html";
87
Lines 85-86 define the APOD URL for the chosen date. This will look like http://apod.nasa.gov/apod/ap150106.html for 2015-01-06 for example.
88 #
89 # General declarations
90 #
91 my ( $image_URL, $image_file );
92 my ( $tree, $title );
93 my ( $url, $element, $attr, $tag );
94
95 #
96 # Enable Unicode mode
97 #
98 binmode STDOUT, ":encoding(UTF-8)";
99 binmode STDERR, ":encoding(UTF-8)";
100
101 if ($DEBUG) {
102 print "Base URL: $apod_base\n";
103 print "APOD URL: $apod_URL\n";
104 print "Image base: $image_base\n";
105 print "\n";
106 }
107
108 #
109 # Get the HTML page, pretending to be some unknown User Agent
110 #
111 my $ua = LWP::UserAgent->new;
112 $ua->agent("MyApp/0.1");
113
114 my $req = HTTP::Request->new( GET => $apod_URL );
115
116 my $res = $ua->request($req);
117 if ( $res->is_success ) {
118 print "GET request successful\n" if $DEBUG;
119
120 #
121 # Parse the HTML we got back
122 #
123 $tree = HTML::TreeBuilder->new;
124 $tree->parse_content( $res->content_ref );
125
Lines 111-114 set up and download the APOD web page. If the download was successful then the HTML is parsed with HTML::TreeBuilder in lines 123 and 124.
126 #
127 # Get and display the title in debug mode
128 #
129 if ($DEBUG) {
130 if ( $title = $tree->look_down( _tag => 'title' ) ) {
131 $title = $title->as_trimmed_text();
132 print "Found title: $title\n" if $title;
133 }
134 }
135
136 #
137 # Look for the image. This is expected to be the href attribute of an <a>
138 # tag. The image we see on the page is merely a link to this (usually)
139 # larger image.
140 #
141 for ( @{ $tree->extract_links('a') } ) {
142 ( $url, $element, $attr, $tag ) = @$_;
143 if ($DEBUG) {
144 print "Found: $url\n" if $url;
145 }
146 last unless defined($url);
147 last if ( $url =~ /\.(jpg|png)$/i );
148 }
149
Lines 141-148 consist of a loop which walks through the parsed HTML looking for tags. The loop ends if the tag references an image URL.
150 #
151 # Abort if no image (it might be a video or a GIF)
152 #
153 die "Image URL not found\n"
154 unless defined($url)
155 && $url =~ /\.(jpg|png)$/i;
156
Lines 153-155 check that an image URL was actually found. Some days the APOD site might host a YouTube video or some other animated display. The script is not interested in these since they are no use as wallpaper.
157 $image_URL = "$apod_base/$url";
158
159 #
160 # Extract the final part of the URL for the file name. We usually get
161 # a JPEG, sometimes with a shouty extension, which we change.
162 #
163 ( $image_file = $image_URL ) =~ s|.*/||mx;
164 ( $image_file = "$image_base/$image_file" ) =~ s/JPG$/jpg/mx;
165
166 if ($DEBUG) {
167 print "Image URL: $image_URL\n";
168 print "Image file: $image_file\n";
169 }
170
171 #
172 # Abort if the file already exists (the script already ran?)
173 #
174 die "File $image_file already exists\n" if ( -f $image_file );
175
Lines 157-174 prepare the image URL and make a file name to hold the image.
176 #
177 # Set up the GET request for the image
178 #
179 $req = HTTP::Request->new( GET => $image_URL );
180
181 #
182 # Download the image to the (possibly renamed) image file
183 #
184 $res = $ua->request( $req, $image_file );
185 if ( $res->is_success ) {
186 print "Downloaded to $image_file\n" if $DEBUG;
187 }
188 else {
189 #
190 # The image download failed
191 #
192 die $res->status_line, " ($image_URL)\n";
193 }
194
Lines 179-193 download the image to a file
195 }
196 else {
197 #
198 # We failed to get the web page
199 #
200 die $res->status_line, " ($apod_URL)\n";
201 }
202
203 exit;
204
205 # vim: syntax=perl:ts=8:sw=4:et:ai:tw=78:fo=tcrqn21:fdm=marker
I hope you find the script interesting and/or useful.
HTML::TreeBuilder Perl module http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/TreeBuilder.pmImageMagick image manipulation software suite http://www.imagemagick.org/