284 lines
14 KiB
HTML
284 lines
14 KiB
HTML
|
|
<!DOCTYPE html>
|
||
|
|
<html>
|
||
|
|
<head>
|
||
|
|
<meta charset="utf-8">
|
||
|
|
<meta name="generator" content="pandoc">
|
||
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
|
||
|
|
<title></title>
|
||
|
|
<style type="text/css">code{white-space: pre;}</style>
|
||
|
|
<!--[if lt IE 9]>
|
||
|
|
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
|
||
|
|
<![endif]-->
|
||
|
|
</head>
|
||
|
|
<body>
|
||
|
|
<h1 id="my-apod-downloader">My APOD Downloader</h1>
|
||
|
|
<h2 id="astronomy-picture-of-the-day">Astronomy Picture of the Day</h2>
|
||
|
|
<p>You have probably heard of the <a href="http://apod.nasa.gov/apod/astropix.html">Astronomy Picture of the Day (APOD)</a> site. It has existed since 1995, is provided by <a href="http://en.wikipedia.org/wiki/NASA">NASA</a> and <a href="http://en.wikipedia.org/wiki/Michigan_Technological_University">Michigan Technological University (MTU)</a> and is created and managed by <a href="http://www.mtu.edu/physics/department/faculty/nemiroff/">Robert Nemiroff</a> and <a href="http://antwrp.gsfc.nasa.gov/htmltest/jbonnell/www/bonnell.html">Jerry Bonnell</a>. The FAQ on the site says <em>"The APOD archive contains the largest collection of annotated astronomical images on the internet"</em>.</p>
|
||
|
|
<h2 id="the-downloader">The Downloader</h2>
|
||
|
|
<p>Being a KDE user I quite like a moderate amount of bling, and I particularly like to have a picture on my desktop. I like to rotate my wallpaper pictures every so often, so I want to have a collection of images. To this end I download the APOD on my server every day and make the images available through an NFS-mounted volume.</p>
|
||
|
|
<p>In 2012 I wrote a Perl script to perform the download, using a fairly primitive HTML parsing method. This script has been improved over the intervening years and now uses the Perl module <a href="http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/TreeBuilder.pm"><code>HTML::TreeBuilder</code></a> which I believe is much better at parsing HTML.</p>
|
||
|
|
<p>The version of the script I use myself also includes the Perl module <code>Image::Magick</code> which interfaces to the awesome <a href="http://www.imagemagick.org/"><code>ImageMagick</code></a> image manipulation software suite. I use this to annotate the downloaded image with the title parsed from the HTML so I know what it is.</p>
|
||
|
|
<p>The script I am presenting here is called <code>collect_apod_simple</code> and does not use <code>ImageMagick</code>. I chose to omit it because the installation of this suite and the related Perl module can be difficult. Also, I do not feel that the annotation always works as well as it could, and I have not yet found the time to correct this shortcoming.</p>
|
||
|
|
<p>A version of the more advanced script (called <code>collect_apod</code>) is available in the same place as <code>collect_apod_simple</code> should you wish to give it a try. Both scripts are available on <em>GitLab</em> under the link <a href="https://gitlab.com/davmo/hprmisc" class="uri">https://gitlab.com/davmo/hprmisc</a>.</p>
|
||
|
|
<h2 id="the-code">The Code</h2>
|
||
|
|
<p>If you are acquainted with Perl you'll probably find this script quite simple. All it really does is:</p>
|
||
|
|
<ul>
|
||
|
|
<li><p>Get or compute the date string for building the APOD URL</p></li>
|
||
|
|
<li><p>Download the HTML on the selected APOD page</p></li>
|
||
|
|
<li><p>Look for an image being used as a link</p></li>
|
||
|
|
<li><p>Download the image being linked to and save it where requested</p></li>
|
||
|
|
</ul>
|
||
|
|
<p>The following is a numbered listing with annotations. There are a several comments in the script itself, but the annotations are there to try and make the various sections as clear as possible.</p>
|
||
|
|
<pre><code> 1 #!/usr/bin/env perl
|
||
|
|
2 #===============================================================================
|
||
|
|
3 #
|
||
|
|
4 # FILE: collect_apod_simple
|
||
|
|
5 #
|
||
|
|
6 # USAGE: ./collect_apod_simple [YYMMDD]
|
||
|
|
7 #
|
||
|
|
8 # DESCRIPTION: Downloads the current Astronomy Picture of the Day or that
|
||
|
|
9 # relating to the formatted date provided as an argument. In
|
||
|
|
10 # this context "current" can mean two URLs: .../astropix.html or
|
||
|
|
11 # .../apYYMMDD.html. We now *do not* download the
|
||
|
|
12 # .../astropix.html version since it has a different HTML
|
||
|
|
13 # layout.
|
||
|
|
14 #
|
||
|
|
15 # OPTIONS: ---
|
||
|
|
16 # REQUIREMENTS: ---
|
||
|
|
17 # BUGS: ---
|
||
|
|
18 # NOTES: Based on 'collect_apod' but without the Image::Magick stuff,
|
||
|
|
19 # for simplicity and for release to the HPR community
|
||
|
|
20 # AUTHOR: Dave Morriss (djm), Dave.Morriss@gmail.com
|
||
|
|
21 # VERSION: 0.0.1
|
||
|
|
22 # CREATED: 2015-01-02 19:58:01
|
||
|
|
23 # REVISION: 2015-01-03 23:00:27
|
||
|
|
24 #
|
||
|
|
25 #===============================================================================
|
||
|
|
26
|
||
|
|
27 use 5.010;
|
||
|
|
28 use strict;
|
||
|
|
29 use warnings;
|
||
|
|
30 use utf8;
|
||
|
|
31
|
||
|
|
32 use LWP::UserAgent;
|
||
|
|
33 use DateTime;
|
||
|
|
34 use HTML::TreeBuilder 5 -weak;
|
||
|
|
35 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 32-34 define the modules the script uses:</p>
|
||
|
|
<ul>
|
||
|
|
<li><a href="http://search.cpan.org/dist/libwww-perl/lib/LWP/UserAgent.pm"><code>LWP::UserAgent</code></a> used to perform the web downloads</li>
|
||
|
|
<li><a href="http://search.cpan.org/~drolsky/DateTime-1.18/lib/DateTime.pm"><code>DateTime</code></a> generates and formats the default date</li>
|
||
|
|
<li><a href="http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/TreeBuilder.pm"><code>HTML::TreeBuilder</code></a> parses HTML</li>
|
||
|
|
</ul>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 36 #
|
||
|
|
37 # Version number (manually incremented)
|
||
|
|
38 #
|
||
|
|
39 our $VERSION = '0.0.1';
|
||
|
|
40
|
||
|
|
41 #
|
||
|
|
42 # Set to 0 to be more silent
|
||
|
|
43 #
|
||
|
|
44 my $DEBUG = 1;
|
||
|
|
45
|
||
|
|
46 #
|
||
|
|
47 # Script name
|
||
|
|
48 #
|
||
|
|
49 ( my $PROG = $0 ) =~ s|.*/||mx;
|
||
|
|
50
|
||
|
|
51 #-------------------------------------------------------------------------------
|
||
|
|
52 # Edit this to your needs
|
||
|
|
53 #-------------------------------------------------------------------------------
|
||
|
|
54 #
|
||
|
|
55 # Where the script will download the picture. Edit this to where you want
|
||
|
|
56 #
|
||
|
|
57 my $image_base = "$ENV{HOME}/Backgrounds/apod";
|
||
|
|
58
|
||
|
|
59 #-------------------------------------------------------------------------------
|
||
|
|
60 # Nothing needs editing below here
|
||
|
|
61 #-------------------------------------------------------------------------------
|
||
|
|
62
|
||
|
|
63 #
|
||
|
|
64 # Get the argument or default it
|
||
|
|
65 #
|
||
|
|
66 my $arg = shift;
|
||
|
|
67 unless ( defined($arg) ) {
|
||
|
|
68 #
|
||
|
|
69 # APOD wants a date in YYMMDD format
|
||
|
|
70 #
|
||
|
|
71 my $dt = DateTime->now;
|
||
|
|
72 $arg = sprintf( "%02i%02i%02i",
|
||
|
|
73 substr( $dt->year, -2 ),
|
||
|
|
74 $dt->month, $dt->day );
|
||
|
|
75 }
|
||
|
|
76
|
||
|
|
77 #
|
||
|
|
78 # Check the argument is a valid date in YYMMDD format
|
||
|
|
79 #
|
||
|
|
80 die "Usage: $PROG [YYMMDD]\n" unless ( $arg =~ /^\d{6}$/ );
|
||
|
|
81 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 66-80 collect the date from the command line, or if none is given generate the correctly formatted date. If a date in an invalid format is given the script aborts.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 82 #
|
||
|
|
83 # Make an URL depending on the argument
|
||
|
|
84 #
|
||
|
|
85 my $apod_base = "http://apod.nasa.gov/apod";
|
||
|
|
86 my $apod_URL = "$apod_base/ap$arg.html";
|
||
|
|
87 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 85-86 define the APOD URL for the chosen date. This will look like http://apod.nasa.gov/apod/ap150106.html for 2015-01-06 for example.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 88 #
|
||
|
|
89 # General declarations
|
||
|
|
90 #
|
||
|
|
91 my ( $image_URL, $image_file );
|
||
|
|
92 my ( $tree, $title );
|
||
|
|
93 my ( $url, $element, $attr, $tag );
|
||
|
|
94
|
||
|
|
95 #
|
||
|
|
96 # Enable Unicode mode
|
||
|
|
97 #
|
||
|
|
98 binmode STDOUT, ":encoding(UTF-8)";
|
||
|
|
99 binmode STDERR, ":encoding(UTF-8)";
|
||
|
|
100
|
||
|
|
101 if ($DEBUG) {
|
||
|
|
102 print "Base URL: $apod_base\n";
|
||
|
|
103 print "APOD URL: $apod_URL\n";
|
||
|
|
104 print "Image base: $image_base\n";
|
||
|
|
105 print "\n";
|
||
|
|
106 }
|
||
|
|
107
|
||
|
|
108 #
|
||
|
|
109 # Get the HTML page, pretending to be some unknown User Agent
|
||
|
|
110 #
|
||
|
|
111 my $ua = LWP::UserAgent->new;
|
||
|
|
112 $ua->agent("MyApp/0.1");
|
||
|
|
113
|
||
|
|
114 my $req = HTTP::Request->new( GET => $apod_URL );
|
||
|
|
115
|
||
|
|
116 my $res = $ua->request($req);
|
||
|
|
117 if ( $res->is_success ) {
|
||
|
|
118 print "GET request successful\n" if $DEBUG;
|
||
|
|
119
|
||
|
|
120 #
|
||
|
|
121 # Parse the HTML we got back
|
||
|
|
122 #
|
||
|
|
123 $tree = HTML::TreeBuilder->new;
|
||
|
|
124 $tree->parse_content( $res->content_ref );
|
||
|
|
125 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 111-114 set up and download the APOD web page. If the download was successful then the HTML is parsed with HTML::TreeBuilder in lines 123 and 124.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 126 #
|
||
|
|
127 # Get and display the title in debug mode
|
||
|
|
128 #
|
||
|
|
129 if ($DEBUG) {
|
||
|
|
130 if ( $title = $tree->look_down( _tag => 'title' ) ) {
|
||
|
|
131 $title = $title->as_trimmed_text();
|
||
|
|
132 print "Found title: $title\n" if $title;
|
||
|
|
133 }
|
||
|
|
134 }
|
||
|
|
135
|
||
|
|
136 #
|
||
|
|
137 # Look for the image. This is expected to be the href attribute of an <a>
|
||
|
|
138 # tag. The image we see on the page is merely a link to this (usually)
|
||
|
|
139 # larger image.
|
||
|
|
140 #
|
||
|
|
141 for ( @{ $tree->extract_links('a') } ) {
|
||
|
|
142 ( $url, $element, $attr, $tag ) = @$_;
|
||
|
|
143 if ($DEBUG) {
|
||
|
|
144 print "Found: $url\n" if $url;
|
||
|
|
145 }
|
||
|
|
146 last unless defined($url);
|
||
|
|
147 last if ( $url =~ /\.(jpg|png)$/i );
|
||
|
|
148 }
|
||
|
|
149 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 141-148 consist of a loop which walks through the parsed HTML looking for <a> tags. The loop ends if the tag references an image URL.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 150 #
|
||
|
|
151 # Abort if no image (it might be a video or a GIF)
|
||
|
|
152 #
|
||
|
|
153 die "Image URL not found\n"
|
||
|
|
154 unless defined($url)
|
||
|
|
155 && $url =~ /\.(jpg|png)$/i;
|
||
|
|
156 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 153-155 check that an image URL was actually found. Some days the APOD site might host a YouTube video or some other animated display. The script is not interested in these since they are no use as wallpaper.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 157 $image_URL = "$apod_base/$url";
|
||
|
|
158
|
||
|
|
159 #
|
||
|
|
160 # Extract the final part of the URL for the file name. We usually get
|
||
|
|
161 # a JPEG, sometimes with a shouty extension, which we change.
|
||
|
|
162 #
|
||
|
|
163 ( $image_file = $image_URL ) =~ s|.*/||mx;
|
||
|
|
164 ( $image_file = "$image_base/$image_file" ) =~ s/JPG$/jpg/mx;
|
||
|
|
165
|
||
|
|
166 if ($DEBUG) {
|
||
|
|
167 print "Image URL: $image_URL\n";
|
||
|
|
168 print "Image file: $image_file\n";
|
||
|
|
169 }
|
||
|
|
170
|
||
|
|
171 #
|
||
|
|
172 # Abort if the file already exists (the script already ran?)
|
||
|
|
173 #
|
||
|
|
174 die "File $image_file already exists\n" if ( -f $image_file );
|
||
|
|
175 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 157-174 prepare the image URL and make a file name to hold the image.</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 176 #
|
||
|
|
177 # Set up the GET request for the image
|
||
|
|
178 #
|
||
|
|
179 $req = HTTP::Request->new( GET => $image_URL );
|
||
|
|
180
|
||
|
|
181 #
|
||
|
|
182 # Download the image to the (possibly renamed) image file
|
||
|
|
183 #
|
||
|
|
184 $res = $ua->request( $req, $image_file );
|
||
|
|
185 if ( $res->is_success ) {
|
||
|
|
186 print "Downloaded to $image_file\n" if $DEBUG;
|
||
|
|
187 }
|
||
|
|
188 else {
|
||
|
|
189 #
|
||
|
|
190 # The image download failed
|
||
|
|
191 #
|
||
|
|
192 die $res->status_line, " ($image_URL)\n";
|
||
|
|
193 }
|
||
|
|
194 </code></pre>
|
||
|
|
<hr />
|
||
|
|
<p>Lines 179-193 download the image to a file</p>
|
||
|
|
<hr />
|
||
|
|
<pre><code> 195 }
|
||
|
|
196 else {
|
||
|
|
197 #
|
||
|
|
198 # We failed to get the web page
|
||
|
|
199 #
|
||
|
|
200 die $res->status_line, " ($apod_URL)\n";
|
||
|
|
201 }
|
||
|
|
202
|
||
|
|
203 exit;
|
||
|
|
204
|
||
|
|
205 # vim: syntax=perl:ts=8:sw=4:et:ai:tw=78:fo=tcrqn21:fdm=marker</code></pre>
|
||
|
|
<p>I hope you find the script interesting and/or useful.</p>
|
||
|
|
<h2 id="links">Links</h2>
|
||
|
|
<ul>
|
||
|
|
<li>Wikipedia entry <a href="http://en.wikipedia.org/wiki/Astronomy_Picture_of_the_Day" class="uri">http://en.wikipedia.org/wiki/Astronomy_Picture_of_the_Day</a></li>
|
||
|
|
<li>Astronomy Picture of the Day <a href="http://apod.nasa.gov/apod/astropix.html" class="uri">http://apod.nasa.gov/apod/astropix.html</a></li>
|
||
|
|
<li>NASA <a href="http://en.wikipedia.org/wiki/NASA" class="uri">http://en.wikipedia.org/wiki/NASA</a></li>
|
||
|
|
<li>Michigan Technological University (MTU) <a href="http://en.wikipedia.org/wiki/Michigan_Technological_University" class="uri">http://en.wikipedia.org/wiki/Michigan_Technological_University</a></li>
|
||
|
|
<li>Robert Nemiroff <a href="http://www.mtu.edu/physics/department/faculty/nemiroff/" class="uri">http://www.mtu.edu/physics/department/faculty/nemiroff/</a></li>
|
||
|
|
<li>Jerry Bonnell <a href="http://antwrp.gsfc.nasa.gov/htmltest/jbonnell/www/bonnell.html" class="uri">http://antwrp.gsfc.nasa.gov/htmltest/jbonnell/www/bonnell.html</a></li>
|
||
|
|
<li><code>HTML::TreeBuilder</code> Perl module <a href="http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/TreeBuilder.pm" class="uri">http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/TreeBuilder.pm</a></li>
|
||
|
|
<li><code>ImageMagick</code> image manipulation software suite <a href="http://www.imagemagick.org/" class="uri">http://www.imagemagick.org/</a></li>
|
||
|
|
<li><em>GitLab</em> link <a href="https://gitlab.com/davmo/hprmisc" class="uri">https://gitlab.com/davmo/hprmisc</a>.</li>
|
||
|
|
</ul>
|
||
|
|
<!--
|
||
|
|
vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker
|
||
|
|
-->
|
||
|
|
</body>
|
||
|
|
</html>
|