Files
hpr_website/www/eps/hpr2129/hpr2129_full_shownotes.html

222 lines
17 KiB
HTML
Raw Permalink Normal View History

2025-10-28 18:39:57 +01:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Dave Morriss">
<title>Gnu Awk - Part 2 (HPR Show 2129)</title>
<style type="text/css">code{white-space: pre;}</style>
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" href="http://hackerpublicradio.org/css/hpr.css">
</head>
<body id="home">
<div id="container" class="shadow">
<header>
<h1 class="title">Gnu Awk - Part 2 (HPR Show 2129)</h1>
<h2 class="author">Dave Morriss</h2>
<hr/>
</header>
<main id="maincontent">
<article>
<header>
<h1>Table of Contents</h1>
<nav id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#simple-awk-usage-recap">Simple Awk Usage Recap</a><ul>
<li><a href="#invoking-awk">Invoking Awk</a></li>
<li><a href="#what-awk-does">What Awk does</a></li>
<li><a href="#awk-program">Awk program</a></li>
</ul></li>
<li><a href="#more-about-fields-and-records">More about fields and records</a></li>
<li><a href="#more-about-printing">More about printing</a></li>
<li><a href="#more-about-awk-programs">More about Awk programs</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#links">Links</a></li>
</ul>
</nav>
</header>
<h2 id="introduction">Introduction</h2>
<p>This is the second episode in a series where <a href="http://hackerpublicradio.org/correspondents.php?hostid=300" title="Mr. Young">Mr. Young</a> and I will be looking at the <code>AWK</code> language (more particularly its GNU variant <code>gawk</code>). It is a comprehensive interpreted scripting language designed to be used for manipulating text.</p>
<p>The name <strong><code>AWK</code></strong> comes from the names of the authors: <em>Alfred V. </em><strong>A</strong><em>ho</em>, <em>Peter J. </em><strong>W</strong><em>einberger</em>, and <em>Brian W. </em><strong>K</strong><em>ernighan</em>. The original version of <code>AWK</code> was written in 1977<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> at AT&amp;T Bell Laboratories. See the <a href="https://www.gnu.org/software/gawk/manual/html_node/History.html#History" title="History of awk and gawk"><em>GNU Awk Users Guide</em></a> for the full history of <code>awk</code> and <code>gawk</code>.</p>
<p>Strictly the name of the language is <code>AWK</code> in capitals, but the command that is typed to invoke it is <code>awk</code> or <code>gawk</code>, so I will use the lower-case version throughout these notes unless it is important to differentiate the two. Nowadays, on most Linux distributions, <code>awk</code> and <code>gawk</code> are synonyms referring to GNU Awk.</p>
<p>I first encountered <code>awk</code> in the late 1980s when I was working on a Digital Equipment Corporation (DEC) VAXCluster running OpenVMS. This operating system did not have any very good ways of manipulating text without writing a compiled program, which was something I frequently needed to do. A version of <code>gawk</code> was ported to OpenVMS around this time, which I installed. For me <code>gawk</code> (and <code>sed</code>) totally changed the way I was able to work on OpenVMS at that time.</p>
<h2 id="simple-awk-usage-recap">Simple Awk Usage Recap</h2>
<h3 id="invoking-awk">Invoking Awk</h3>
<p>As we saw in the <a href="http://hackerpublicradio.org/eps/hpr2114" title="Gnu Awk - Part 1">last episode</a>, <code>awk</code> is invoked on the command line as:</p>
<pre><code>$ awk [options] &#39;program&#39; inputfile1 inputfile2...</code></pre>
<ul>
<li><code>awk</code> is the command</li>
<li><code>[options]</code> are the options accepted by the command, one of which, <code>-F</code> was introduced in the last episode</li>
<li><code>program</code> is the <code>awk</code> program enclosed in single quotes; this may be preceded by <code>-e</code> (like <code>sed</code>) to make it clear that the program follows (where it might otherwise be ambiguous)</li>
<li><code>inputfile1</code> is the first file to be processed; there may be many; if the character <code>-</code> is given instead of a filename data is expected on standard input</li>
</ul>
<h3 id="what-awk-does">What Awk does</h3>
<p>Awk views its input data as a series of “<em>records</em>” (usually newline-delimited lines), where each record contains a series of “<em>fields</em>”. A field is a component of a record delimited by a “<em>field separator</em>”.</p>
<p>In the last episode field separators were <strong>whitespace</strong> (spaces, TABs and newlines), which is the default, or a comma (<code>-F &quot;,&quot;</code> or <code>-F,</code>).</p>
<p>One of the features of <code>awk</code> is that it treats multiple <strong>space</strong> separators as one, as we saw in the last episode. There were multiple spaces between many of the fields of the test file.</p>
<p>Other separators are not treated this way, so with the following example record, assuming that the field separator is a comma, three fields are found, with the second one being of zero length:</p>
<pre><code>a,,b</code></pre>
<h3 id="awk-program">Awk program</h3>
<p>As we saw in the last episode, an <code>awk</code> program consists of a series of <em>rules</em> where each rule consists of:</p>
<pre><code>pattern { action }</code></pre>
<p>Normally each rule begins on a new line in the program (though this is not mandatory). There are program components other than rules, but well deal with these later on.</p>
<p>In a rule <code>pattern</code> is used to identify a line in some way, and <code>{ action }</code> defines what will be done to the line which has been matched by the pattern. Patterns can be simple comparisons, regular expressions, combinations of the two and quite a few other things that will be covered throughout the series.</p>
<p>A pattern may be omitted, in which case the action is applied to every record. Also, a rule can consist only of a pattern, in which case the entire record is written as if the action was <code>{ print }</code> (which means print the record).</p>
<p>Awk programs are essentially <em>data driven</em> in that actions depend on the data, so they are quite a bit different from programs in many other programming languages.</p>
<h2 id="more-about-fields-and-records">More about fields and records</h2>
<p>As was covered in <a href="http://hackerpublicradio.org/eps/hpr2114" title="Gnu Awk - Part 1">episode 1</a>, once Awk has separated an input record into fields they are stored as numbered entities. These are available by using a dollar sign followed by a number. So, <code>$1</code> refers to field 1, <code>$2</code> field 2, and so on. The variable <code>$0</code> refers to the entire record in an un-split state.</p>
<p>The number after a dollar sign is actually an expression, so <code>$2</code> and <code>$(1+1)</code> mean the same thing. This is an example of an arithmetic expression, and is a useful feature of <code>awk</code>.</p>
<p>There is a special variable called <strong><code>NF</code></strong> in which <code>awk</code> stores the number of fields it has found in the current record. This can be printed or used in tests as shown in the following example (which uses <a href="hpr2129_file1.txt" title="file1.txt"><code>file1.txt</code></a> introduced in episode 1):</p>
<pre><code>$ awk &#39;{ print $0 &quot; (&quot; NF &quot;)&quot; }&#39; file1.txt | head -3
name color amount (3)
apple red 4 (3)
banana yellow 6 (3)</code></pre>
<p>(Note that we used <code>head -3</code> to truncate the output here.)</p>
<p>The way in which <code>print</code> works in <code>awk</code> is: it takes a series of arguments which may be variables or strings and concatenates them together. Here we have <code>$0</code>, the record itself, followed by a string containing a space and an open parenthesis, the <code>NF</code> variable, and another string containing a close parenthesis.</p>
<p>As well as counting fields per record, <code>awk</code> also counts input records. The record number is held in the variable <strong><code>NR</code></strong>, and this can be used in the same was as we have seen with <code>NF</code>. For example, to print the record number before each line we could write:</p>
<pre><code>$ awk &#39;{ print NR &quot;: &quot; $0 }&#39; file1.txt
1: name color amount
2: apple red 4
3: banana yellow 6
4: strawberry red 3
5: grape purple 10
6: apple green 8
7: plum purple 2
8: kiwi brown 4
9: potato brown 9
10: pineapple yellow 5</code></pre>
<p>Note that writing the above with no spaces other than the one after <code>print</code> is completely acceptable (though potentially less clear):</p>
<pre><code>$ awk &#39;{print NR&quot;: &quot;$0}&#39; file1.txt</code></pre>
<p>In the audio I wasnt sure about this, but I have since checked.</p>
<h2 id="more-about-printing">More about printing</h2>
<p>So far we have seen the <code>print</code> statement and have found that it is a little awkward to use to print a mixture of fixed text and variables. In particular, there is no interpolation of variables into strings as can be seen in other scripting languages (e.g. Bash).</p>
<p>There is also a <code>printf</code> statement in Awk. This is similar to <code>printf</code> in <em>C</em> and <em>Bash</em>. It takes a <em>format</em> argument followed by a comma-separated list of items. The argument list may be enclosed in parentheses.</p>
<pre><code>printf format, item1, item2, ...</code></pre>
<p>The format argument (or <em>format string</em>) defines how each of the other arguments is to be output. It uses <em>format specifiers</em> to do this, amongst which are <code>%s</code> which means “output a string” and <code>%d</code> for outputting a whole decimal number. For example, the following <code>printf</code> statement outputs the record followed by a parenthesised number of fields:</p>
<pre><code>printf &quot;%s (%d)\n&quot;,$0,NF</code></pre>
<p>Note that, unlike <code>print</code> no newline is generated unless requested explicitly. The escape sequence <code>\n</code> does this.</p>
<p>There are more <em>format specifiers</em> and more features of <code>printf</code> to be described, and these will be covered later in the series.</p>
<h2 id="more-about-awk-programs">More about Awk programs</h2>
<p>So far we have seen examples of simple <code>awk</code> programs written on the command line. For more complex programs it is usually preferable to place them in files. The option <code>-f FILE</code> may be used to invoke such a file containing a program. File <a href="hpr2129_example1.awk" title="example1.awk"><code>example1.awk</code></a>, included with this episode, is an example of this and holds the following:</p>
<pre><code>/^a/ { print &quot;A: &quot; $0 }
/^b/ { print &quot;B: &quot; $0 }</code></pre>
<p>This would be run as follows:</p>
<pre><code>$ awk -f example1.awk file1.txt
A: apple red 4
B: banana yellow 6
A: apple green 8</code></pre>
<p>It is the convention to give such files the extension <code>.awk</code> to make it clear that they hold an Awk program. This is not mandatory but it gives a useful clue to file managers and editors as to what the file is.</p>
<p>As you will have seen if you followed the <code>sed</code> series and other HPR episodes on scripting, an Awk program file can be made into a script by adding a <code>#!</code> line at the top and making it executable. The file <a href="hpr2129_example2.awk" title="example2.awk"><code>example2.awk</code></a> has been included with this episode to demonstrate this feature. It looks like this:</p>
<div class="sourceCode"><table class="sourceCode awk numberLines"><tr class="sourceCode"><td class="lineNumbers"><pre>1
2
3
4
5
</pre></td><td class="sourceCode"><pre><code class="sourceCode awk"><span class="co">#!/usr/bin/awk -f</span>
<span class="co">#</span>
<span class="co"># Print all but line 1 with the line number on the front</span>
<span class="co">#</span>
NR &gt; <span class="dv">1</span> <span class="kw">{</span> <span class="kw">printf</span> <span class="st">&quot;%d: %s\n&quot;</span>,NR,<span class="dt">$0</span> <span class="kw">}</span></code></pre></td></tr></table></div>
<p>Note that we added the path to the where the <code>awk</code> program may be found, and <code>-f</code> to the first line. Without the option, <code>awk</code> will not read the rest of the file.</p>
<p>Note also that lines 2-4 are comments. Line 5 is the program which prints each line with a line number, but only if the number is greater than 1. Thus the header line is not printed.</p>
<p>The Awk file must be made executable for this to work:</p>
<pre><code>$ chmod u+x example2.awk</code></pre>
<p>Then it can be invoked as follows (assuming it is in the current directory):</p>
<pre><code>$ ./example2.awk file1.txt
2: apple red 4
3: banana yellow 6
4: strawberry red 3
5: grape purple 10
6: apple green 8
7: plum purple 2
8: kiwi brown 4
9: potato brown 9
10: pineapple yellow 5</code></pre>
<h2 id="summary">Summary</h2>
<p>This episode covered:</p>
<ul>
<li>Awks concept of <em>records</em> and <em>fields</em></li>
<li>How spaces as field separators are different from any other separators</li>
<li>How an Awk program is made up of <code>pattern { action }</code> rules</li>
<li>How fields are referred to by a dollar sign followed by a numeric expression</li>
<li>The variables <code>NF</code> and <code>NR</code> which hold the number of fields and the record number</li>
<li>The <code>print</code> and <code>printf</code> statements</li>
<li>Awk program files and the <code>-f</code> option</li>
<li>Executable Awk scripts</li>
</ul>
<h2 id="links">Links</h2>
<ul>
<li><em>GNU Awk Users Guide</em>: <a href="https://www.gnu.org/software/gawk/manual/html_node/index.html" class="uri">https://www.gnu.org/software/gawk/manual/html_node/index.html</a></li>
<li><em>Awk - A Tutorial and Introduction</em>: <a href="http://www.grymoire.com/Unix/Awk.html" class="uri">http://www.grymoire.com/Unix/Awk.html</a></li>
<li>Wikipedia article on <em>AWK</em>: <a href="https://en.wikipedia.org/wiki/AWK" class="uri">https://en.wikipedia.org/wiki/AWK</a></li>
<li>Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger (1988). <em>The AWK Programming Language</em>. Addison-Wesley Publishing Company. ISBN 9780201079814</li>
<li>Previous show on HPR:
<ul>
<li><em>Gnu Awk - Part 1</em>”: <a href="http://hackerpublicradio.org/eps/hpr2114" class="uri">http://hackerpublicradio.org/eps/hpr2114</a></li>
</ul></li>
<li>Resources:
<ul>
<li>Example data file 1 - whitespace delimited “<code>file1.txt</code>”: <a href="hpr2129_file1.txt" class="uri">hpr2129_file1.txt</a></li>
<li>Example data file 1 - comma delimited “<code>file1.csv</code>”: <a href="hpr2129_file1.csv" class="uri">hpr2129_file1.csv</a></li>
<li>Example Awk file 1 “<code>example1.awk</code>”: <a href="hpr2129_example1.awk" class="uri">hpr2129_example1.awk</a></li>
<li>Example Awk file 2 “<code>example2.awk</code>”: <a href="hpr2129_example2.awk" class="uri">hpr2129_example2.awk</a></li>
</ul></li>
</ul>
<!--
vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker
-->
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>I said 1997 in the audio, not 1977. Doh!<a href="#fnref1"></a></p></li>
</ol>
</section>
</article>
</main>
</div>
</body>
</html>