Move under www to ease rsync

This commit is contained in:
2025-10-29 10:51:15 +01:00
parent 2bb22c7583
commit 30ad62e938
890 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
#!/usr/bin/awk -f
{
a[l] = $0
l++
print NR" "$0
}
END{
print "Numeric subscripts:"
for (i = l - 1; i >= 0; i--)
print i": "a[i]
print "Actual subscripts:"
for (i in a)
print i": "a[i]
}

View File

@@ -0,0 +1,12 @@
#!/usr/bin/awk -f
{
lines[NR] = $0
}
END{
for (i in lines) {
split(lines[i],flds,/ *, */,seps)
for (j in flds)
printf "|%s| (%s)\n",flds[j],seps[j]
}
}

Binary file not shown.

View File

@@ -0,0 +1,325 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Dave Morriss">
<title>Gnu Awk - Part 10 (HPR Show 2526)</title>
<style type="text/css">code{white-space: pre;}</style>
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="stylesheet" href="http://hackerpublicradio.org/css/hpr.css">
</head>
<body id="home">
<div id="container" class="shadow">
<header>
<h1 class="title">Gnu Awk - Part 10 (HPR Show 2526)</h1>
<h2 class="author">Dave Morriss</h2>
<hr/>
</header>
<main id="maincontent">
<article>
<header>
<h1>Table of Contents</h1>
<nav id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#a-bit-more-about-arrays">A bit more about arrays</a><ul>
<li><a href="#a-recap">A recap</a></li>
<li><a href="#using-numbers-as-array-subscripts">Using numbers as array subscripts</a></li>
<li><a href="#what-if-the-subscript-is-uninitialised">What if the subscript is uninitialised?</a></li>
<li><a href="#deleting-array-elements">Deleting array elements</a></li>
<li><a href="#splitting-strings-into-arrays">Splitting strings into arrays</a><ul>
<li><a href="#split"><code>split</code></a></li>
</ul></li>
</ul></li>
<li><a href="#real-world-examples">Real-world Examples</a><ul>
<li><a href="#scanning-a-log-file">Scanning a log file</a></li>
<li><a href="#parsing-a-tab-delimited-file">Parsing a tab-delimited file</a></li>
</ul></li>
<li><a href="#links">Links</a></li>
</ul>
</nav>
</header>
<h2 id="introduction">Introduction</h2>
<p>This is the tenth episode of the "<a href="http://hackerpublicradio.org/eps/hpr0094" title="Learning Awk">Learning Awk</a>" series which is being produced by <a href="http://hackerpublicradio.org/correspondents.php?hostid=300" title="Mr. Young">Mr. Young</a> and myself.</p>
<p>In this episode I want to talk more about the use of arrays in GNU Awk and then I want to examine some real-world examples of the use of <code>awk</code>.</p>
<h2 id="a-bit-more-about-arrays">A bit more about arrays</h2>
<h3 id="a-recap">A recap</h3>
<p>We know from earlier in the series that arrays in <code>awk</code> are <em>associative</em>. That is, the index used to refer to an element is a string. The contents of each array element may be a number or a string (or nothing). An associative array is also called a <em>hash</em>. An array <em>index</em> is also referred to as a <em>subscript</em>.</p>
<p>We also know that array elements are referred to with expressions such as:</p>
<p><em>array</em>[<em>index</em>]</p>
<p>so, <code>fruit[&quot;apple&quot;]</code> means the element of the array <code>fruit</code> which is indexed by the string <code>&quot;apple&quot;</code>. The index value is actually an expression, so it can be arbitrarily complex, such as:</p>
<pre><code>ind1 = &quot;app&quot;
ind2 = &quot;le&quot;
print fruit[ind1 ind2]</code></pre>
<p>Here the two strings <code>&quot;app&quot;</code> and <code>&quot;le&quot;</code> are concatenated to make the index <code>&quot;apple&quot;</code>.</p>
<p>We saw earlier in the series that the presence of an array element is checked with an expression using:</p>
<p><em>index</em> <code>in</code> <em>array</em></p>
<p>So an example might be:</p>
<pre><code>if (&quot;apple&quot; in fruit)
print fruit[&quot;apple&quot;]</code></pre>
<p>Looping through the elements of an array is achieved with the specialised <code>for</code> statement as we saw in an earlier episode:</p>
<pre><code>for (ind in fruit)
print fruit[ind]</code></pre>
<h3 id="using-numbers-as-array-subscripts">Using numbers as array subscripts</h3>
<p>In <code>awk</code> array subscripts are always strings. If a number is used then this is converted into a string. This is not a problem with statements like the following:</p>
<pre><code>data[42] = 8388607</code></pre>
<p>The integer number <code>42</code> is converted into the string <code>&quot;42&quot;</code> and everything works as normal.</p>
<p>However, <code>awk</code> can handle other number bases. For example, in common with many other programming languages, a leading zero denotes an octal number, making <code>data[052]</code> the same as <code>data[42]</code> (because decimal 42 is octal 52).</p>
<p>Also <code>data[0x2A]</code> is the same as <code>data[42]</code> because hexadecimal <code>2A</code> is decimal 42.</p>
<p>The way in which numbers are converted into strings in <code>awk</code> is important to understand. A built-in variable called <code>CONVFMT</code> defines the conversion for floating point numbers. Behind the scenes the function <code>sprintf</code> is used. (This is like <code>printf</code> which we saw in episode 9, but it returns a formatted string rather than printing anything.)</p>
<p>The default value for <code>CONVFMT</code> is <code>&quot;%.6g&quot;</code> which means (according to the manual) to <em>print a number in either scientific notation or in floating-point notation, whichever uses fewer characters</em>. The number 6 aims to format the number in a width of 6 characters (plus the decimal point). The setting of <code>CONVFMT</code> can be adjusted in the script if desired.</p>
<p>Knowing this the index can be determined in cases like this:</p>
<pre><code>$ awk &#39;BEGIN{ x=100/3; data[x]=&quot;custard&quot;; print x, data[x] }&#39;
33.3333 custard</code></pre>
<p>However, things get a little weird in this case:</p>
<pre><code>$ awk &#39;BEGIN{ x=1000000000/3; data[x]=&quot;prunes&quot;; print x, data[x] }&#39;
3.33333e+08 prunes</code></pre>
<p>The thing to be careful of is adjusting <code>CONVFMT</code> between storing and retrieving an array element!</p>
<h3 id="what-if-the-subscript-is-uninitialised">What if the subscript is uninitialised?</h3>
<p>The <a href="https://www.gnu.org/software/gawk/manual/html_node/index.html" title="GNU Awk User&#39;s Guide">GNU Awk Users Guide</a> mentions this. An uninitialised variable treated as a number is zero, but treated as a string is a null string <code>&quot;&quot;</code>. The following script is in the file <a href="hpr2526_awk10_ex2.awk">awk10_ex1.awk</a> which may be downloaded:</p>
<pre><code>#!/usr/bin/awk -f
{
a[l] = $0
l++
print NR&quot; &quot;$0
}
END{
print &quot;Numeric subscripts:&quot;
for (i = l - 1; i &gt;= 0; i--)
print i&quot;: &quot;a[i]
print &quot;Actual subscripts:&quot;
for (i in a)
print i&quot;: &quot;a[i]
}</code></pre>
<p>This can lead to unexpected results:</p>
<pre><code>$ echo -e &quot;A\nB\nC&quot; | ./awk10_ex1.awk
1 A
2 B
3 C
Numeric subscripts:
2: C
1: B
0:
Actual subscripts:
: A
0:
1: B
2: C</code></pre>
<p>The variable <code>l</code> is used as the index to the array <code>a</code>. It is uninitialised the first time it is used so the string it provides is an empty string, which is a valid array index. Then it is incremented and it then takes numeric values. The main rule prints each line as it receives it just to prove its actually seeing all three lines.</p>
<p>In the <code>END</code> rule the array is printed (in reverse order) using numeric indexes 2, 1 and zero. There is nothing in element zero.</p>
<p>Then the array is printed again using the "<em>index in array</em>" method. Notice how the letter <code>A</code> is there with an empty index. Notice also that there is an element with index zero too. That was created in the previous loop since accessing a non-existent array element creates it!</p>
<p>Had the two lines in the main rule been replaced as shown the outcome would have been more predictable:</p>
<pre><code> a[l] = $0
l++</code></pre>
<p>Replacement:</p>
<pre><code> a[l++] = $0</code></pre>
<p>Remembering that <code>l++</code> returns the value of <code>l</code> then increments it, this forces the first value returned to be zero because it is a numeric expression.</p>
<h3 id="deleting-array-elements">Deleting array elements</h3>
<p>There is a <code>delete</code> statement which can delete a given array element. For example, in the above demonstration of subscript issues, the spurious element could have been deleted with:</p>
<pre><code>delete a[0]</code></pre>
<p>The generic format is:</p>
<p><code>delete</code> <em>array</em>[<em>index</em>]</p>
<p>We already saw that array elements with empty subscripts or empty values can exist in an array, so we know that making an element empty does not delete it.</p>
<p>An entire array can be deleted with the generic statement:</p>
<p><code>delete</code> <em>array</em></p>
<p>The array remains declared but is empty, so re-using its name as an ordinary (scalar) variable after using <code>delete</code> on it will result in an error.</p>
<h3 id="splitting-strings-into-arrays">Splitting strings into arrays</h3>
<p>There are two functions in <code>awk</code> which generate arrays from strings by splitting them up by some criterion. The functions are: <code>split</code> and <code>patsplit</code>. We will look at <code>split</code> in this episode and <code>patsplit</code> in a subsequent one.</p>
<h4 id="split"><code>split</code></h4>
<p>The general format of the <code>split</code> function is:</p>
<p><code>split</code>(<em>string</em>, <em>array</em> [ , <em>fieldsep</em> [ , <em>seps</em> ] ])</p>
<p>The first two arguments are mandatory but the second two are optional.</p>
<p>The function divides <em>string</em> into pieces separated by <em>fieldsep</em> and stores the pieces in <em>array</em> and the separator strings in the <em>seps</em> array (a GNU Awk extension).</p>
<p>Successive pieces are placed in <em>array</em><code>[1]</code>, <em>array</em><code>[2]</code>, and so on. The <em>array</em> is emptied before the splitting begins.</p>
<p>If <em>fieldsep</em> is omitted then the value of the built-in variable <code>FS</code> is used, so <code>split</code> can be seen as a method of generating fields from a string in a similar way to the main field processing that <code>awk</code> performs. If <em>fieldsep</em> is provided than it is a regular expression (again in the same way as <code>FS</code>).</p>
<p>The <em>seps</em> array is used to hold each of the separators. If <em>fieldsep</em> is a single space then any leading white space goes into <em>seps</em><code>[0]</code> and any trailing white space goes into <em>seps</em><code>[n]</code>, where <code>n</code> is the number of number of elements in <em>array</em>.</p>
<p>The function <em>split</em> returns the number of pieces placed in <em>array</em>.</p>
<h5 id="example-of-using-split">Example of using <code>split</code></h5>
<p>The following script is in the file <a href="hpr2526_awk10_ex2.awk">awk10_ex2.awk</a> which may be downloaded:</p>
<pre><code>#!/usr/bin/awk -f
{
lines[NR] = $0
}
END{
for (i in lines) {
split(lines[i],flds,/ *, */,seps)
for (j in flds)
printf &quot;|%s| (%s)\n&quot;,flds[j],seps[j]
}
}</code></pre>
<p>It reads lines into an array called <code>lines</code> using the record number as the index. In the <code>END</code> rule it processes this array, splitting each line into another array called <code>flds</code> and the separators into an array called <code>seps</code>.</p>
<p>The <em>fieldsep</em> value is a regular expression consisting of a comma surrounded by any number of spaces. The <code>flds</code> array is printed in delimiters to demonstrate that any leading and trailing spaces have been removed. The <code>seps</code> array is appended to each output line enclosed in parentheses so you can see what was captured there.</p>
<p>Here is what happens when the script is run:</p>
<pre><code>$ echo -e &quot;A,B,C\nD, E ,F&quot; | ./awk10_ex2.awk
|A| (,)
|B| (,)
|C| ()
|D| (, )
|E| ( ,)
|F| ()</code></pre>
<h2 id="real-world-examples">Real-world Examples</h2>
<p>The following example scripts are not specifically about the use of arrays in <code>awk</code>. This is more of an attempt to demonstrate some real-world <code>awk</code> scripts for reference.</p>
<h3 id="scanning-a-log-file">Scanning a log file</h3>
<p>I have a script I wrote to add tags and summaries to HPR episodes that have none. I seem to mention this project every month on the Community News show! The script receives email messages with updates, and keeps a log with lines that look like this as it processes them:</p>
<pre><code>2018/02/19 04:17:21 [INFO] Moving /home/dave/MailSpool/episode-736.eml to &#39;processed&#39;
2018/02/19 04:17:21 [INFO] 736:tags:summary</code></pre>
<p><small>Note: if you are wondering about the times they are local to the server, based in California USA, on which the script is run. I run things from the UK timezone (UTC or UTC+1).</small></p>
<p>I like to add a report on the number of tags and summaries processed each month to the Community News show notes, so I wanted to scan this log file for the months total.</p>
<p>Originally I used a pipeline with <code>grep</code> and <code>wc</code> but the task is well suited to <code>awk</code>. This was my solution (with added line numbers for reference):</p>
<pre><code> 1: awk &#39;
2: BEGIN{
3: re = &quot;^&quot; strftime(&quot;%Y/%m/&quot;) &quot;.. .* [0-9]{1,4}:&quot;
4: count = 0
5: }
6: $0 ~ re {
7: printf &quot;%02d %s\n&quot;,++count,$0
8: }
9: END{
10: print &quot;Additions&quot;,count
11: }
12: &#39; process_mail_tags.log</code></pre>
<ul>
<li>In the <code>BEGIN</code> (lines 2-5) rule a regular expression is defined in the variable <code>re</code>.
<ul>
<li>This starts with a <code>^</code> character which anchors the expression to the start of the line.</li>
<li>This is followed by part of the date generated with the built-in function <code>strftime</code>. Here we generate the current year and the current month number and a slash.</li>
<li>Two dots follow which cater for the day number, then there is a space and <code>.*</code> meaning zero or more characters.</li>
<li>This is followed by a space then between one and four digits. This matches the show number after the <code>[INFO]</code> part.</li>
<li>The expression ends with a colon which matches the one after the show number</li>
</ul></li>
<li><p>In the rule a variable <code>count</code> is initialised to zero (not strictly necessary but good programming practice)</p></li>
<li><p>The main rule for processing the input file (lines 6-8) matches each line against the regular expression. If it matches the line is printed preceded by the current value of <code>count</code> (which is pre-incremented before being printed).</p></li>
<li><p>The <code>END</code> rule (lines 9-11) prints the final value of <code>count</code>.</p></li>
</ul>
<p>Running this towards the end of February 2018 we get:</p>
<pre><code>01: 2018/02/05 01:19:09 [INFO] 788:summary:tags
02: 2018/02/05 05:17:27 [INFO] 1683:tags
03: 2018/02/05 06:15:16 [INFO] 1663:tags
04: 2018/02/05 06:15:16 [INFO] 1666:tags
05: 2018/02/05 06:15:16 [INFO] 1668:tags
06: 2018/02/05 06:15:16 [INFO] 1669:tags
07: 2018/02/05 06:22:43 [INFO] 1693:tags
08: 2018/02/05 06:52:13 [INFO] 1550:tags
09: 2018/02/05 06:52:13 [INFO] 1551:tags
10: 2018/02/05 06:52:13 [INFO] 1552:tags
11: 2018/02/05 06:52:13 [INFO] 1554:tags
12: 2018/02/05 06:52:13 [INFO] 1556:tags
13: 2018/02/05 06:52:13 [INFO] 1559:tags
14: 2018/02/05 14:33:46 [INFO] 1540:tags
15: 2018/02/05 14:33:46 [INFO] 1541:tags
16: 2018/02/05 14:33:46 [INFO] 1543:tags
17: 2018/02/05 14:33:46 [INFO] 1547:tags
18: 2018/02/05 14:33:46 [INFO] 1549:tags
19: 2018/02/17 11:44:56 [INFO] 798:tags:summary
20: 2018/02/18 02:55:53 [INFO] 0021:summary:tags
21: 2018/02/19 04:17:21 [INFO] 736:tags:summary
22: 2018/02/25 03:32:45 [INFO] 1480:tags
23: 2018/02/25 03:32:45 [INFO] 1489:summary:tags
Additions 23</code></pre>
<p>Of course, I would not run this <code>awk</code> script on the command line as shown here. Id place it in a Bash script to simplify the typing, but I will not demonstrate that here.</p>
<h3 id="parsing-a-tab-delimited-file">Parsing a tab-delimited file</h3>
<p>I am currently looking after the process of uploading HPR episodes to the Internet Archive (IA) - <code>archive.org</code>. To manage this I use a Python library called <em>internetarchive</em> and a command line tool called <em>ia</em>. The <code>ia</code> tool lets me interrogate the archive, returning data about shows that have been uploaded as well as allowing me to upload and change them.</p>
<p>In some cases I find it necessary to replace the audio formats which have been generated automatically by <code>archive.org</code> with copies generated by the HPR software. This is because we want to ensure these audio files contain metadata (<em>audio tags</em>). The shows generated by <code>archive.org</code> are converted from the WAV file we upload in a process referred to as <em>derivation</em>, and contain no metadata.</p>
<p>I needed to be able to tell which HPR episodes had derived audio and which had original audio. The <code>ia</code> tool could do this but in a format which was difficult to parse, so I wrote an <code>awk</code> script to do it for me.</p>
<p>The data I needed to parse consists of tab-delimited lines. The first line contains the names of all of the columns. However, some the columns were not always present or were in different orders, so this required a little more work to parse.</p>
<p>Here is a sample of the input file format:</p>
<pre><code>$ ia list -va hpr2450 | head -3
name sha1 format btih height source length width mtime crc32 size bitrate original md5
hpr2450.afpk b71f63ef1e8c359b3f0f7a546835919a8a7889da Columbia Peaks derivative 1513450216 656e162d 107184 hpr2450.wav 0ace3e0ae96510a85bee6dda3b69ab78
hpr2450.flac cd917c46eaf22f0ec0253bd018b475380e83ce7e Flac 0 derivative 738.08 0 1515280267 e7934979 27556168 hpr2450.wav 7a9b716932b33a2e6713ae3f4e23d24d</code></pre>
<p>The following script, called <code>parse_ia_audio.awk</code>, was what I produced to parse this data.</p>
<pre><code>#!/usr/bin/awk -f
#-------------------------------------------------------------------------------
# Process tab-delimited data from the Internet Archive with a field name
# header, reporting particular fields. The algorithm is general though this
# instance is specific.
#
# In this case we extract only the audio files
#
# This script is meant to be used thus:
# $ ia list -va hpr2450 | ./parse_ia_audio.awk
# hpr2450.flac derivative
# hpr2450.mp3 derivative
# hpr2450.ogg derivative
# hpr2450.opus original
# hpr2450.spx original
# hpr2450.wav original
#
#-------------------------------------------------------------------------------
BEGIN {
FS = &quot;\t&quot;
}
#
# Read the header line and collect the fields into an array such that a search
# by field name returns the field number.
#
NR == 1 {
for (i = 1; i &lt;= NF; i++) {
fld[$i] = i
}
}
#
# Read the rest of the data, reporting only the lines relating to audio files
# and print the fields &#39;name&#39; and &#39;source&#39;
#
NR &gt; 1 &amp;&amp; $(fld[&quot;name&quot;]) ~ /[^.]\.(flac|mp3|ogg|opus|spx|wav)/ {
printf &quot;%-15s %s\n&quot;,$(fld[&quot;name&quot;]),$(fld[&quot;source&quot;])
}</code></pre>
<p>The <code>BEGIN</code> rule defines the field delimiter as the TAB character.</p>
<p>The first rule runs only when the first record is encountered. This is the header with the names of the columns (fields). A <code>for</code> loop scans the fields which have been split up by <code>awk</code>s usual record splitting. The fields are named <code>$1</code>, <code>$2</code> etc. The variable <code>i</code> increments from 1 to however many fields there are in the record and stores the field numbers in the array <code>fld</code> indexed by the contents of the field.</p>
<p>The end result will be:</p>
<pre><code>fld[&quot;name&quot;] = 1
fld[&quot;sha1&quot;] = 2
fld[&quot;format&quot;] = 3
etc</code></pre>
<p>The second rule is invoked if two conditions are met:</p>
<ul>
<li>The record number is greater than 1</li>
<li>The field numbered whatever the header <code>&quot;name&quot;</code> returned (1 in the example) ends with one of <code>flac</code>, <code>mp3</code>, <code>ogg</code>, <code>opus</code>, <code>spx</code>, <code>wav</code></li>
</ul>
<p>This rule prints the fields indexed by the column names <code>&quot;name&quot;</code> and <code>&quot;source&quot;</code>. The first comment in the script shows what this will look like.</p>
<p>Note the use of expressions like:</p>
<pre><code>$(fld[&quot;name&quot;])</code></pre>
<p>Here <code>awk</code> will find the value stored in <code>fld[&quot;name&quot;]</code> (1 in the example data) and will reference the field called <code>$(1)</code>, which is another way of writing <code>$1</code>. The parentheses are necessary to remove ambiguity.</p>
<p>So, the script is just printing columns for certain selected lines, but is able to cope with the columns being in different positions at different times because it prints them "by name".</p>
<p>Most of the queries handled by the Internet Archive API return JSON-format results (not something that <code>awk</code> can easily parse), but for some reason this one returns a varying tab-delimited file. Still, <code>awk</code> was able to come to the rescue!</p>
<h2 id="links">Links</h2>
<ul>
<li><a href="https://www.gnu.org/software/gawk/manual/html_node/index.html"><em>GNU Awk Users Guide</em></a></li>
<li>Previous shows in this series on HPR:
<ul>
<li><a href="http://hackerpublicradio.org/eps/hpr2114">"<em>Gnu Awk - Part 1</em>"</a> - episode 2114</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2129">"<em>Gnu Awk - Part 2</em>"</a> - episode 2129</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2143">"<em>Gnu Awk - Part 3</em>"</a> - episode 2143</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2163">"<em>Gnu Awk - Part 4</em>"</a> - episode 2163</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2184">"<em>Gnu Awk - Part 5</em>"</a> - episode 2184</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2238">"<em>Gnu Awk - Part 6</em>"</a> - episode 2238</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2330">"<em>Gnu Awk - Part 7</em>"</a> - episode 2330</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2438">"<em>Gnu Awk - Part 8</em>"</a> - episode 2438</li>
<li><a href="http://hackerpublicradio.org/eps/hpr2476">"<em>Gnu Awk - Part 9</em>"</a> - episode 2476</li>
</ul></li>
<li>Resources:
<ul>
<li><a href="hpr2526_full_shownotes.epub">ePub version of these notes</a></li>
<li><a href="hpr2526_full_shownotes.pdf">PDF version of these notes</a></li>
<li><a href="hpr2526_awk10_ex1.awk">awk10_ex1.awk</a></li>
<li><a href="hpr2526_awk10_ex2.awk">awk10_ex2.awk</a></li>
</ul></li>
</ul>
</article>
</main>
</div>
</body>
</html>

Binary file not shown.