Files
hpr_website/www/eps/hpr2129/hpr2129_full_shownotes.html

222 lines
17 KiB
HTML
Executable File
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Dave Morriss">
<title>Gnu Awk - Part 2 (HPR Show 2129)</title>
<style type="text/css">code{white-space: pre;}</style>
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" href="http://hackerpublicradio.org/css/hpr.css">
</head>
<body id="home">
<div id="container" class="shadow">
<header>
<h1 class="title">Gnu Awk - Part 2 (HPR Show 2129)</h1>
<h2 class="author">Dave Morriss</h2>
<hr/>
</header>
<main id="maincontent">
<article>
<header>
<h1>Table of Contents</h1>
<nav id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#simple-awk-usage-recap">Simple Awk Usage Recap</a><ul>
<li><a href="#invoking-awk">Invoking Awk</a></li>
<li><a href="#what-awk-does">What Awk does</a></li>
<li><a href="#awk-program">Awk program</a></li>
</ul></li>
<li><a href="#more-about-fields-and-records">More about fields and records</a></li>
<li><a href="#more-about-printing">More about printing</a></li>
<li><a href="#more-about-awk-programs">More about Awk programs</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#links">Links</a></li>
</ul>
</nav>
</header>
<h2 id="introduction">Introduction</h2>
<p>This is the second episode in a series where <a href="http://hackerpublicradio.org/correspondents.php?hostid=300" title="Mr. Young">Mr. Young</a> and I will be looking at the <code>AWK</code> language (more particularly its GNU variant <code>gawk</code>). It is a comprehensive interpreted scripting language designed to be used for manipulating text.</p>
<p>The name <strong><code>AWK</code></strong> comes from the names of the authors: <em>Alfred V. </em><strong>A</strong><em>ho</em>, <em>Peter J. </em><strong>W</strong><em>einberger</em>, and <em>Brian W. </em><strong>K</strong><em>ernighan</em>. The original version of <code>AWK</code> was written in 1977<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> at AT&amp;T Bell Laboratories. See the <a href="https://www.gnu.org/software/gawk/manual/html_node/History.html#History" title="History of awk and gawk"><em>GNU Awk Users Guide</em></a> for the full history of <code>awk</code> and <code>gawk</code>.</p>
<p>Strictly the name of the language is <code>AWK</code> in capitals, but the command that is typed to invoke it is <code>awk</code> or <code>gawk</code>, so I will use the lower-case version throughout these notes unless it is important to differentiate the two. Nowadays, on most Linux distributions, <code>awk</code> and <code>gawk</code> are synonyms referring to GNU Awk.</p>
<p>I first encountered <code>awk</code> in the late 1980s when I was working on a Digital Equipment Corporation (DEC) VAXCluster running OpenVMS. This operating system did not have any very good ways of manipulating text without writing a compiled program, which was something I frequently needed to do. A version of <code>gawk</code> was ported to OpenVMS around this time, which I installed. For me <code>gawk</code> (and <code>sed</code>) totally changed the way I was able to work on OpenVMS at that time.</p>
<h2 id="simple-awk-usage-recap">Simple Awk Usage Recap</h2>
<h3 id="invoking-awk">Invoking Awk</h3>
<p>As we saw in the <a href="http://hackerpublicradio.org/eps/hpr2114" title="Gnu Awk - Part 1">last episode</a>, <code>awk</code> is invoked on the command line as:</p>
<pre><code>$ awk [options] &#39;program&#39; inputfile1 inputfile2...</code></pre>
<ul>
<li><code>awk</code> is the command</li>
<li><code>[options]</code> are the options accepted by the command, one of which, <code>-F</code> was introduced in the last episode</li>
<li><code>program</code> is the <code>awk</code> program enclosed in single quotes; this may be preceded by <code>-e</code> (like <code>sed</code>) to make it clear that the program follows (where it might otherwise be ambiguous)</li>
<li><code>inputfile1</code> is the first file to be processed; there may be many; if the character <code>-</code> is given instead of a filename data is expected on standard input</li>
</ul>
<h3 id="what-awk-does">What Awk does</h3>
<p>Awk views its input data as a series of “<em>records</em>” (usually newline-delimited lines), where each record contains a series of “<em>fields</em>”. A field is a component of a record delimited by a “<em>field separator</em>”.</p>
<p>In the last episode field separators were <strong>whitespace</strong> (spaces, TABs and newlines), which is the default, or a comma (<code>-F &quot;,&quot;</code> or <code>-F,</code>).</p>
<p>One of the features of <code>awk</code> is that it treats multiple <strong>space</strong> separators as one, as we saw in the last episode. There were multiple spaces between many of the fields of the test file.</p>
<p>Other separators are not treated this way, so with the following example record, assuming that the field separator is a comma, three fields are found, with the second one being of zero length:</p>
<pre><code>a,,b</code></pre>
<h3 id="awk-program">Awk program</h3>
<p>As we saw in the last episode, an <code>awk</code> program consists of a series of <em>rules</em> where each rule consists of:</p>
<pre><code>pattern { action }</code></pre>
<p>Normally each rule begins on a new line in the program (though this is not mandatory). There are program components other than rules, but well deal with these later on.</p>
<p>In a rule <code>pattern</code> is used to identify a line in some way, and <code>{ action }</code> defines what will be done to the line which has been matched by the pattern. Patterns can be simple comparisons, regular expressions, combinations of the two and quite a few other things that will be covered throughout the series.</p>
<p>A pattern may be omitted, in which case the action is applied to every record. Also, a rule can consist only of a pattern, in which case the entire record is written as if the action was <code>{ print }</code> (which means print the record).</p>
<p>Awk programs are essentially <em>data driven</em> in that actions depend on the data, so they are quite a bit different from programs in many other programming languages.</p>
<h2 id="more-about-fields-and-records">More about fields and records</h2>
<p>As was covered in <a href="http://hackerpublicradio.org/eps/hpr2114" title="Gnu Awk - Part 1">episode 1</a>, once Awk has separated an input record into fields they are stored as numbered entities. These are available by using a dollar sign followed by a number. So, <code>$1</code> refers to field 1, <code>$2</code> field 2, and so on. The variable <code>$0</code> refers to the entire record in an un-split state.</p>
<p>The number after a dollar sign is actually an expression, so <code>$2</code> and <code>$(1+1)</code> mean the same thing. This is an example of an arithmetic expression, and is a useful feature of <code>awk</code>.</p>
<p>There is a special variable called <strong><code>NF</code></strong> in which <code>awk</code> stores the number of fields it has found in the current record. This can be printed or used in tests as shown in the following example (which uses <a href="hpr2129_file1.txt" title="file1.txt"><code>file1.txt</code></a> introduced in episode 1):</p>
<pre><code>$ awk &#39;{ print $0 &quot; (&quot; NF &quot;)&quot; }&#39; file1.txt | head -3
name color amount (3)
apple red 4 (3)
banana yellow 6 (3)</code></pre>
<p>(Note that we used <code>head -3</code> to truncate the output here.)</p>
<p>The way in which <code>print</code> works in <code>awk</code> is: it takes a series of arguments which may be variables or strings and concatenates them together. Here we have <code>$0</code>, the record itself, followed by a string containing a space and an open parenthesis, the <code>NF</code> variable, and another string containing a close parenthesis.</p>
<p>As well as counting fields per record, <code>awk</code> also counts input records. The record number is held in the variable <strong><code>NR</code></strong>, and this can be used in the same was as we have seen with <code>NF</code>. For example, to print the record number before each line we could write:</p>
<pre><code>$ awk &#39;{ print NR &quot;: &quot; $0 }&#39; file1.txt
1: name color amount
2: apple red 4
3: banana yellow 6
4: strawberry red 3
5: grape purple 10
6: apple green 8
7: plum purple 2
8: kiwi brown 4
9: potato brown 9
10: pineapple yellow 5</code></pre>
<p>Note that writing the above with no spaces other than the one after <code>print</code> is completely acceptable (though potentially less clear):</p>
<pre><code>$ awk &#39;{print NR&quot;: &quot;$0}&#39; file1.txt</code></pre>
<p>In the audio I wasnt sure about this, but I have since checked.</p>
<h2 id="more-about-printing">More about printing</h2>
<p>So far we have seen the <code>print</code> statement and have found that it is a little awkward to use to print a mixture of fixed text and variables. In particular, there is no interpolation of variables into strings as can be seen in other scripting languages (e.g. Bash).</p>
<p>There is also a <code>printf</code> statement in Awk. This is similar to <code>printf</code> in <em>C</em> and <em>Bash</em>. It takes a <em>format</em> argument followed by a comma-separated list of items. The argument list may be enclosed in parentheses.</p>
<pre><code>printf format, item1, item2, ...</code></pre>
<p>The format argument (or <em>format string</em>) defines how each of the other arguments is to be output. It uses <em>format specifiers</em> to do this, amongst which are <code>%s</code> which means “output a string” and <code>%d</code> for outputting a whole decimal number. For example, the following <code>printf</code> statement outputs the record followed by a parenthesised number of fields:</p>
<pre><code>printf &quot;%s (%d)\n&quot;,$0,NF</code></pre>
<p>Note that, unlike <code>print</code> no newline is generated unless requested explicitly. The escape sequence <code>\n</code> does this.</p>
<p>There are more <em>format specifiers</em> and more features of <code>printf</code> to be described, and these will be covered later in the series.</p>
<h2 id="more-about-awk-programs">More about Awk programs</h2>
<p>So far we have seen examples of simple <code>awk</code> programs written on the command line. For more complex programs it is usually preferable to place them in files. The option <code>-f FILE</code> may be used to invoke such a file containing a program. File <a href="hpr2129_example1.awk" title="example1.awk"><code>example1.awk</code></a>, included with this episode, is an example of this and holds the following:</p>
<pre><code>/^a/ { print &quot;A: &quot; $0 }
/^b/ { print &quot;B: &quot; $0 }</code></pre>
<p>This would be run as follows:</p>
<pre><code>$ awk -f example1.awk file1.txt
A: apple red 4
B: banana yellow 6
A: apple green 8</code></pre>
<p>It is the convention to give such files the extension <code>.awk</code> to make it clear that they hold an Awk program. This is not mandatory but it gives a useful clue to file managers and editors as to what the file is.</p>
<p>As you will have seen if you followed the <code>sed</code> series and other HPR episodes on scripting, an Awk program file can be made into a script by adding a <code>#!</code> line at the top and making it executable. The file <a href="hpr2129_example2.awk" title="example2.awk"><code>example2.awk</code></a> has been included with this episode to demonstrate this feature. It looks like this:</p>
<div class="sourceCode"><table class="sourceCode awk numberLines"><tr class="sourceCode"><td class="lineNumbers"><pre>1
2
3
4
5
</pre></td><td class="sourceCode"><pre><code class="sourceCode awk"><span class="co">#!/usr/bin/awk -f</span>
<span class="co">#</span>
<span class="co"># Print all but line 1 with the line number on the front</span>
<span class="co">#</span>
NR &gt; <span class="dv">1</span> <span class="kw">{</span> <span class="kw">printf</span> <span class="st">&quot;%d: %s\n&quot;</span>,NR,<span class="dt">$0</span> <span class="kw">}</span></code></pre></td></tr></table></div>
<p>Note that we added the path to the where the <code>awk</code> program may be found, and <code>-f</code> to the first line. Without the option, <code>awk</code> will not read the rest of the file.</p>
<p>Note also that lines 2-4 are comments. Line 5 is the program which prints each line with a line number, but only if the number is greater than 1. Thus the header line is not printed.</p>
<p>The Awk file must be made executable for this to work:</p>
<pre><code>$ chmod u+x example2.awk</code></pre>
<p>Then it can be invoked as follows (assuming it is in the current directory):</p>
<pre><code>$ ./example2.awk file1.txt
2: apple red 4
3: banana yellow 6
4: strawberry red 3
5: grape purple 10
6: apple green 8
7: plum purple 2
8: kiwi brown 4
9: potato brown 9
10: pineapple yellow 5</code></pre>
<h2 id="summary">Summary</h2>
<p>This episode covered:</p>
<ul>
<li>Awks concept of <em>records</em> and <em>fields</em></li>
<li>How spaces as field separators are different from any other separators</li>
<li>How an Awk program is made up of <code>pattern { action }</code> rules</li>
<li>How fields are referred to by a dollar sign followed by a numeric expression</li>
<li>The variables <code>NF</code> and <code>NR</code> which hold the number of fields and the record number</li>
<li>The <code>print</code> and <code>printf</code> statements</li>
<li>Awk program files and the <code>-f</code> option</li>
<li>Executable Awk scripts</li>
</ul>
<h2 id="links">Links</h2>
<ul>
<li><em>GNU Awk Users Guide</em>: <a href="https://www.gnu.org/software/gawk/manual/html_node/index.html" class="uri">https://www.gnu.org/software/gawk/manual/html_node/index.html</a></li>
<li><em>Awk - A Tutorial and Introduction</em>: <a href="http://www.grymoire.com/Unix/Awk.html" class="uri">http://www.grymoire.com/Unix/Awk.html</a></li>
<li>Wikipedia article on <em>AWK</em>: <a href="https://en.wikipedia.org/wiki/AWK" class="uri">https://en.wikipedia.org/wiki/AWK</a></li>
<li>Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger (1988). <em>The AWK Programming Language</em>. Addison-Wesley Publishing Company. ISBN 9780201079814</li>
<li>Previous show on HPR:
<ul>
<li><em>Gnu Awk - Part 1</em>”: <a href="http://hackerpublicradio.org/eps/hpr2114" class="uri">http://hackerpublicradio.org/eps/hpr2114</a></li>
</ul></li>
<li>Resources:
<ul>
<li>Example data file 1 - whitespace delimited “<code>file1.txt</code>”: <a href="hpr2129_file1.txt" class="uri">hpr2129_file1.txt</a></li>
<li>Example data file 1 - comma delimited “<code>file1.csv</code>”: <a href="hpr2129_file1.csv" class="uri">hpr2129_file1.csv</a></li>
<li>Example Awk file 1 “<code>example1.awk</code>”: <a href="hpr2129_example1.awk" class="uri">hpr2129_example1.awk</a></li>
<li>Example Awk file 2 “<code>example2.awk</code>”: <a href="hpr2129_example2.awk" class="uri">hpr2129_example2.awk</a></li>
</ul></li>
</ul>
<!--
vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker
-->
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>I said 1997 in the audio, not 1977. Doh!<a href="#fnref1"></a></p></li>
</ol>
</section>
</article>
</main>
</div>
</body>
</html>