hpr_website/www/eps/hpr1986/hpr1986_full_shownotes.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="generator" content="pandoc">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  <meta name="author" content="Dave Morriss">
  <title>Introduction to sed - part 2 (HPR Show 1986)</title>
  <style type="text/css">code{white-space: pre;}</style>
  <!--[if lt IE 9]>
    <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
  <![endif]-->
  <link rel="stylesheet" href="http://hackerpublicradio.org/css/hpr.css">
</head>

<body id="home">
<div id="container" class="shadow">
<header>
<h1 class="title">Introduction to sed - part 2 (HPR Show 1986)</h1>
<h2 class="author">Dave Morriss</h2>
<hr/>
</header>

<main id="maincontent">
<article>
<header>
<h1>Table of Contents</h1>
<nav id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#command-line-options">Command line options</a></li>
<li><a href="#more-about-the-s-command">More about the <strong>s</strong> command</a><ul>
<li><a href="#regular-expressions">Regular expressions</a><ul>
<li><a href="#one-or-more-of-the-preceding">One or more of the preceding</a></li>
<li><a href="#zero-or-one-of-the-preceding">Zero or one of the preceding</a></li>
<li><a href="#a-fixed-number-of-the-preceding">A fixed number of the preceding</a></li>
<li><a href="#between-i-and-j-of-the-preceding">Between <em>i</em> and <em>j</em> of the preceding</a></li>
<li><a href="#from-i-or-more-of-the-preceding">From <em>i</em> or more of the preceding</a></li>
<li><a href="#grouping-a-regexp">Grouping a regexp</a></li>
<li><a href="#alternative-regexps">Alternative regexps</a></li>
<li><a href="#greediness">Greediness</a></li>
</ul></li>
<li><a href="#replacement">Replacement</a><ul>
<li><a href="#back-references">Back references</a></li>
<li><a href="#case-manipulation">Case manipulation</a></li>
</ul></li>
<li><a href="#flags">Flags</a><ul>
<li><a href="#the-number-flag">The <em>number</em> flag</a></li>
<li><a href="#the-p-flag">The <strong>p</strong> flag</a></li>
<li><a href="#the-i-and-i-flags">The <strong>I</strong> and <strong>i</strong> flags</a></li>
</ul></li>
</ul></li>
<li><a href="#gnu-extensions-for-escapes-in-regular-expressions">GNU Extensions for Escapes in Regular Expressions</a></li>
<li><a href="#examples">Examples</a><ul>
<li><a href="#example-1">Example 1</a></li>
<li><a href="#example-2">Example 2</a></li>
<li><a href="#example-3">Example 3</a></li>
<li><a href="#example-4">Example 4</a></li>
<li><a href="#example-5">Example 5</a></li>
<li><a href="#example-6">Example 6</a></li>
</ul></li>
<li><a href="#links">Links</a></li>
</ul>
</nav>
</header>
<h2 id="introduction">Introduction</h2>
<p>In the <a href="http://hackerpublicradio.org/eps/hpr1976" title="Introduction to sed - part 1">last episode</a> we looked at <code>sed</code> at the simplest level. We looked at three command-line options and the '<em>s</em>' command. We introduced the idea of basic <em>regular expressions</em>.</p>
<p>In this episode we will cover all of these topics in more detail.</p>
<p>We are looking at GNU <code>sed</code> in this series. This version contains many extensions to POSIX <code>sed</code>. These extensions provide many more features, but <code>sed</code> scripts written this way are not portable.</p>
<p>This episode uses two new data files called <a href="hpr1986_sed_demo2.txt" title="hpr1986_sed_demo2.txt"><code>sed_demo2.txt</code></a> and <a href="hpr1986_sed_demo3.txt" title="hpr1986_sed_demo3.txt"><code>sed_demo3.txt</code></a> in the various demonstrations and examples.</p>
<h2 id="command-line-options">Command line options</h2>
<p>We looked at the <code>-e</code> and <code>-f</code> options in the last episode. We will look at several more of the available options this time, but will not cover everything. Refer to the <a href="https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed" title="Invocation">GNU manual</a> for the full list.</p>
<dl>
<dt><code>-n</code> <em>or</em> <code>--quiet</code> <em>or</em> <code>--silent</code></dt>
<dd><p>By default, <code>sed</code> prints out the pattern space at the end of each cycle through the script (see &quot;<a href="http://hackerpublicradio.org/eps/hpr1976#how-sed-works" title="How sed works">How sed works</a>&quot; in the last episode). These options disable this automatic printing, and <code>sed</code> only produces output when explicitly told to via the '<em>p</em>' flag or command (see &quot;<a href="#the-p-flag">The <strong>p</strong> flag</a>&quot; below).</p>
</dd>
<dt><code>-i[SUFFIX]</code> <em>or</em> <code>--in-place[=SUFFIX]</code></dt>
<dd><p>This option allows <code>sed</code> to edit files in place. If a suffix is specified the original file is renamed by appending the suffix, and the edited file given the original name. This provides a way of creating a backup of the original. If no suffix is given the original file is replaced by the edited file.</p>
<p>By default <code>sed</code> treats the input files on the command line as a single stream of data. When the <code>-i</code> option is used the files are treated separately (see the <code>-s</code> option).</p>
<p>If the suffix contains a '*' symbol then this is replaced by the current file name. See <a href="#example-1">Example 1</a> below for how to use this.</p>
</dd>
<dt><code>--follow-symlinks</code></dt>
<dd><p>This option is relevant to the <code>-i</code> option and is available only on systems that support symbolic links. If specified then, if the file being edited is a symbolic link the link will be followed and the actual file edited. If omitted (the default) the link will be broken and the actual file will not be changed.</p>
</dd>
<dt><code>-s</code> <em>or</em> <code>--separate</code></dt>
<dd><p>By default <code>sed</code> treats the input files on the command line as a single stream of data. This GNU <code>sed</code> extension causes the command to consider them as separate files. The relevance of this will become apparent in later episodes.</p>
</dd>
<dt><code>-r</code> <em>or</em> <code>--regexp-extended</code></dt>
<dd><p>By default <code>sed</code> uses basic regular expressions, but this GNU extension allows the use of extended regular expressions (those allowed by <code>egrep</code>). Standard <code>sed</code> uses backslashes to denote many special characters. In extended mode these backslashes are not required. However, the result is not portable.</p>
</dd>
</dl>
<h2 id="more-about-the-s-command">More about the <strong>s</strong> command</h2>
<h3 id="regular-expressions">Regular expressions</h3>
<p>Regular expressions in <code>sed</code> can be more complex than those we looked at in the last episode, allowing much greater flexibility. The new meta-characters we'll look at this time all start with a backslash. Many other Unix tools that use regular expressions do the same, but others do not. This can be confusing, so it's important to be aware of the differences.</p>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">Expression  </th>
<th style="text-align: left;">Meaning</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>\+</strong></td>
<td style="text-align: left;">Similar to <strong>*</strong> but matches a sequence of one or more instances of the preceding item</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>\?</strong></td>
<td style="text-align: left;">Similar to <strong>*</strong> but matches a sequence of zero or one instance of the preceding item</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>\{i\}</strong></td>
<td style="text-align: left;">Matches exactly <code>i</code> sequences (<code>i</code> is a decimal integer)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>\{i,j\}</strong></td>
<td style="text-align: left;">Matches between <code>i</code> and <code>j</code> sequences, inclusive</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>\{i,\}</strong></td>
<td style="text-align: left;">Matches <code>i</code> or more sequences, inclusive</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>\(regexp\)</strong></td>
<td style="text-align: left;">Groups the inner <em>regexp</em>. Allows it to be followed by a postfix operator, or can be used for back references (see below)</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>regexp1\|regexp2</strong></td>
<td style="text-align: left;">Matches <em>regexp1</em> or <em>regexp2</em>, <strong>\|</strong> is used to separate alternatives</td>
</tr>
</tbody>
</table>
<!-- \** -->
<h4 id="one-or-more-of-the-preceding">One or more of the preceding</h4>
<p>Using the '\+' modifier matches sequences of variable length starting with one instance. So, using an example from the last episode:</p>
<pre><code>s/a\+bc/def/</code></pre>
<p>Here the sequence being matched is 'abc', 'aabc', 'aaabc' and so forth. It does not batch 'bc' since there has to be at least one 'a'.</p>
<p>This is a GNU extension.</p>
<h4 id="zero-or-one-of-the-preceding">Zero or one of the preceding</h4>
<p>The '\?' modifier matches zero or one of the preceding expression. So, considering the following example:</p>
<pre><code>s/a\?bc/def/</code></pre>
<p>This matches 'bc' and 'abc' because zero or one 'a' is specified.</p>
<p>This is a GNU extension.</p>
<h4 id="a-fixed-number-of-the-preceding">A fixed number of the preceding</h4>
<p>Using the '\{i\}' modifier we specify a fixed number of the preceding expression:</p>
<pre><code>s/a\{3\}bc/def/</code></pre>
<p>This only matches 'aaabc' since three 'a' characters are needed.</p>
<h4 id="between-i-and-j-of-the-preceding">Between <em>i</em> and <em>j</em> of the preceding</h4>
<p>Using the '\{i,j\}' modifier we specify a number of the preceding expression between lower and upper bounds:</p>
<pre><code>s/a\{1,5\}bc/def/</code></pre>
<p>This matches 'abc', 'aabc', 'aaabc', 'aaaabc' and 'aaaaabc'; that is, between 1 and 5 'a' characters followed by 'bc'.</p>
<h4 id="from-i-or-more-of-the-preceding">From <em>i</em> or more of the preceding</h4>
<p>Using the '\{i,\}' modifier we specify a number of the preceding expression from a lower value to an undefined upper limit:</p>
<pre><code>s/a\{1,\}bc/def/</code></pre>
<p>This matches 'abc', 'aabc' and so on, with no limit to the number of 'a' characters. This is the same as:</p>
<pre><code>s/a\+bc/def/</code></pre>
<p>However, the lower limit does not have to be 1.</p>
<h4 id="grouping-a-regexp">Grouping a regexp</h4>
<p>So far the modifiers we have seen have been applied to single characters. However, with grouping we can apply them to a more complex expression. The group is enclosed in <strong>\(</strong> and <strong>\)</strong>. For example:</p>
<pre><code>s/\(abc\)*def/ghi/</code></pre>
<p>Here the complete regex matches 'def', 'abcdef', 'abcabcdef' and so forth with multiple instances of 'abc'.</p>
<p>Each group is numbered by <code>sed</code> simply by counting <strong>\(</strong> occurrences. This allows references to be made to these sub-expressions as we will see shortly.</p>
<h4 id="alternative-regexps">Alternative regexps</h4>
<p>It is possible to build a regexp with alternative sub-expressions separated by the characters <strong>\|</strong>. For example, say the intention is to match either 'Hello World' or 'Goodbye World' without an exclamation mark at the end and add one, the following might be tried as a first attempt:</p>
<pre><code>$ echo &quot;Hello World&quot; | sed -e &#39;s/Hello\|Goodbye World/&amp;!/&#39;
Hello! World
$ echo &quot;Goodbye World&quot; | sed -e &#39;s/Hello\|Goodbye World/&amp;!/&#39;
Goodbye World!</code></pre>
<p>Those results might be unexpected. What has happened is that <code>sed</code> has just matched the 'Hello' in the first case, and so the replacement '&amp;!' has resulted in an exclamation mark being placed after this word. However, it has matched 'Goodbye World' in the second case so the exclamation mark has been placed as we expected.</p>
<p>To match either 'Hello' or 'Goodbye' we need grouping:</p>
<pre><code>$ echo &quot;Hello World&quot; | sed -e &#39;s/\(Hello\|Goodbye\) World/&amp;!/&#39;
Hello World!
$ echo &quot;Goodbye World&quot; | sed -e &#39;s/\(Hello\|Goodbye\) World/&amp;!/&#39;
Goodbye World!</code></pre>
<p>The number of alternatives may be more than two:</p>
<pre><code>$ echo &quot;Farewell World&quot; | sed -e &#39;s/\(Hello\|Goodbye\|Farewell\) World/&amp;!/&#39;
Farewell World!</code></pre>
<p>This meta-character is a GNU extension.</p>
<h4 id="greediness">Greediness</h4>
<p>The way that <code>sed</code> matches a regexp is sometimes a little unexpected. This because of what is referred to as &quot;<em>greediness</em>&quot;, where more is matched than might be predicted.</p>
<p>The following is taken from the GNU manual:</p>
<p><em>Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest.</em></p>
<p>For example, say we are trying to process the example file for this episode <a href="hpr1986_sed_demo2.txt" title="hpr1986_sed_demo2.txt"><code>sed_demo2.txt</code></a>, looking for a word starting with capital 'H' at the start of a line. It would be tempting to use a regexp such as '^H.\+ ' meaning a line starting with capital 'H' up to a space. In the example below we enclose what was matched by square brackets, printing out only the lines that matched (see the sections entitled &quot;<a href="#command-line-options">Command line options</a>&quot; for the '-n' option and &quot;<a href="#the-p-flag">The <strong>p</strong> flag</a>&quot; below):</p>
<pre><code>$ sed -ne &#39;s/^H.\+ /[&amp;]/p&#39; sed_demo2.txt
[Hacker Public Radio (HPR) is an Internet Radio show (podcast) that ]releases
[HPR&quot; for more ]information.
[Hacker Public Radio is dedicated to sharing knowledge. We do ]not</code></pre>
<p>The regexp matcher has matched everything from the leading 'H' to the last space on the line.</p>
<p>One technique for limiting this behaviour is shown below:</p>
<pre><code>$ sed -ne &#39;s/^H[^ ]\+ /[&amp;]/p&#39; sed_demo2.txt
[Hacker ]Public Radio (HPR) is an Internet Radio show (podcast) that releases
[HPR&quot; ]for more information.
[Hacker ]Public Radio is dedicated to sharing knowledge. We do not</code></pre>
<p>Here, rather than following the 'H' with a dot (any character) we use a list in square brackets. The list is negated by using a circumflex, so it means &quot;<em>not space</em>&quot;. So, here we are looking for a capital 'H' at the start of a line followed by one or more &quot;<em>not spaces</em>&quot; then a space. Notice how this has constrained the greediness.</p>
<h3 id="replacement">Replacement</h3>
<p>Last time we saw the use of <strong>&amp;</strong> meaning the whole of the line which matched the <strong>REGEXP</strong> part of the command.</p>
<h4 id="back-references">Back references</h4>
<p>As we saw earlier, there is also a way of referring to a matching group. We use <strong>\<em>n</em></strong> <!-- \\* -->where <em>n</em> is a number between 1 and 9 which refers to the <em>n</em>th group between <strong>\(</strong> and <strong>\)</strong> delimiters (as discussed above under &quot;<a href="#grouping-a-regexp">Grouping a regexp</a>&quot;).</p>
<p>For example:</p>
<pre><code>$ echo &quot;Hacker Public Radio&quot; | sed -e &#39;s/\(.\+\) \(.\+\) \(.\+\)/\3 \2 \1/&#39;
Radio Public Hacker</code></pre>
<p>Here we look for three groups of characters separated by a single space and we group each one. We then replace them in the order 3, 2, 1, resulting in the words being printed in reverse order.</p>
<p>Interestingly, these <em>back references</em> can be used inside the regexp itself:</p>
<pre><code>$ echo &quot;Run Lola Run&quot; | sed -e &#39;s/\(.\+\) \(.\+\) \1/\2 \1 \1/&#39;
Lola Run Run</code></pre>
<p>Here the first group matches the first &quot;Run&quot;, and we use it as the last element of the regexp. We could have made it a group:</p>
<pre><code>$ echo &quot;Run Lola Run&quot; | sed -e &#39;s/\(.\+\) \(.\+\) \(\1\)/\2 \3 \1/&#39;
Lola Run Run</code></pre>
<p>There is no point in doing this since the result is the same yet it makes <code>sed</code> work harder.</p>
<h4 id="case-manipulation">Case manipulation</h4>
<p>GNU <code>sed</code> provides a means of changing the case of the replacement text using the sequences <strong>\L</strong>, <strong>\l</strong>, <strong>\U</strong>, <strong>\u</strong> and <strong>\E</strong>.</p>
<dl>
<dt><strong>\L</strong></dt>
<dd>Turn the replacement to lowercase until a \U or \E is found,
</dd>
<dt><strong>\l</strong></dt>
<dd>Turn the next character to lowercase,
</dd>
<dt><strong>\U</strong></dt>
<dd>Turn the replacement to uppercase until a \L or \E is found,
</dd>
<dt><strong>\u</strong></dt>
<dd>Turn the next character to uppercase,
</dd>
<dt><strong>\E</strong></dt>
<dd>Stop case conversion started by \L or \U.
</dd>
</dl>
<p>When used in conjunction with grouping the following results may be obtained (from Ken's script for the Community News perhaps):</p>
<pre><code>$ echo &quot;Hacker Public Radio&quot; |\
    sed -e &#39;s/\(.\+\) \(.\+\) \(.\+\)/\U\1 \L\1 \U\2 \L\2 \U\3 \L\3/&#39;
HACKER hacker PUBLIC public RADIO radio</code></pre>
<h3 id="flags">Flags</h3>
<p>We saw the '<em>g</em>' flag in the last episode, which makes the substitution repeat for each line applying to <strong>all</strong> matches. We will look at some other flags in this episode, but some of the more advanced features will be omitted here.</p>
<h4 id="the-number-flag">The <em>number</em> flag</h4>
<p>There is also a <em>number</em> flag which only applies the <em>number</em>th match. For example:</p>
<pre><code>$ echo &quot;eeny, meeny, miny&quot; | sed -e &#39;s/ny/\U&amp;/2&#39;
eeny, meeNY, miny</code></pre>
<p>Here the match is for 'ny', and the replacement is the matching text forced to upper case (see &quot;<a href="#case-manipulation">Case manipulation</a>&quot; above). However, we restrict the substitution to just the second match, as you can see from the result.</p>
<h4 id="the-p-flag">The <strong>p</strong> flag</h4>
<p>This causes the result of the substitution to be printed. More precisely, it causes the <em>pattern space</em> to be printed if the substitution was made.</p>
<p>Normally this happens anyway, but when the <strong>-n</strong> command line option has been selected (see &quot;<a href="#command-line-options">Command line options</a>&quot;) nothing is printed unless the script explicitly requests it.</p>
<pre><code>$ sed -n -e &#39;s/Hacker /Hobby /p&#39; sed_demo2.txt
Hobby Public Radio (HPR) is an Internet Radio show (podcast) that releases
Hobby Public Radio is dedicated to sharing knowledge. We do not</code></pre>
<p>Only the lines where 'Hacker ' was replaced by 'Hobby ' are reported.</p>
<h4 id="the-i-and-i-flags">The <strong>I</strong> and <strong>i</strong> flags</h4>
<p>These flags are a GNU <code>sed</code> extension. They cause the regexp to be case-insensitive. Both forms of this flag have the same meaning.</p>
<pre><code>$ sed -n -e &#39;s/hacker /Hobby /ip&#39; sed_demo2.txt
Hobby Public Radio (HPR) is an Internet Radio show (podcast) that releases
Hobby Public Radio is dedicated to sharing knowledge. We do not</code></pre>
<h2 id="gnu-extensions-for-escapes-in-regular-expressions">GNU Extensions for Escapes in Regular Expressions</h2>
<p>GNU <code>sed</code> contains a way of referencing (or producing) special characters. These are documented in the <a href="https://www.gnu.org/software/sed/manual/sed.html#Escapes" title="GNU Extensions for Escapes in Regular Expressions">GNU Manual</a> (under the same title as this section). We will not look at all of these in this series, but will touch on some of the more generally useful ones.</p>
<dl>
<dt><strong>\n</strong></dt>
<dd>Produces or matches a newline (ASCII 10).
</dd>
<dt><strong>\t</strong></dt>
<dd>Produces or matches a horizontal tab (ASCII 9).
</dd>
</dl>
<p>There are also escapes which match a particular character class which are valid only in regular expressions. These are mentioned here because they can be very useful, as we will see in the examples:</p>
<dl>
<dt><strong>\w</strong></dt>
<dd>Matches any <em>word</em> character. A <em>word</em> character is any letter or digit or the underscore character.
</dd>
<dt><strong>\W</strong></dt>
<dd>Matches any <em>non-word</em> character.
</dd>
<dt><strong>\b</strong></dt>
<dd>Matches a word boundary; that is it matches if the character to the left is a <em>word</em> character and the character to the right is a <em>non-word</em> character, or vice-versa.
</dd>
<dt><strong>\&lt;</strong> <strong>\&gt;</strong></dt>
<dd>(These are not very clear in the <code>sed</code> documentation but are available). These are alternative ways of denoting word boundaries, with <strong>\&lt;</strong> being used for the left boundary and <strong>\&gt;</strong> for the right.
</dd>
<dt><strong>\B</strong></dt>
<dd>Matches everywhere but on a word boundary; that is it matches if the character to the left and the character to the right are either both <em>word</em> characters or both <em>non-word</em> characters.
</dd>
</dl>
<h2 id="examples">Examples</h2>
<h3 id="example-1">Example 1</h3>
<p>This example shows the use of the <code>-i</code> option:</p>
<pre><code>$ for f in {A..C}; do echo $RANDOM &gt; $f; done
$ sed -i&#39;saved_*.sav&#39; -e &#39;s/4/@/g&#39; {A..C}
$ cat {A..C}
1@855
2@593
@217
$ cat saved_{A..C}.sav
14855
24593
4217</code></pre>
<p>The first line generates three files called <code>A</code>, <code>B</code> and <code>C</code> using <em>brace expansion</em> in a <code>for</code> loop. Each file contains a random number. The second line runs <code>sed</code> against these files replacing any instance of the digit 4 by an '@' symbol. The third line shows the contents of these three files. Backups of their original contents are held in files called <code>saved_A.sav</code>, <code>saved_B.sav</code> and <code>saved_C.sav</code>. Their contents are shown by the final <code>cat</code> command.</p>
<h3 id="example-2">Example 2</h3>
<p>The second example file <a href="hpr1986_sed_demo3.txt" title="hpr1986_sed_demo3.txt"><code>sed_demo3.txt</code></a> contains statistics pulled from the HPR website. Imagine that we are writing a Bash script to parse this, and we want the number of days to the next free slot in a variable. The line in question looks like this:</p>
<pre><code>Days to next free slot: 8</code></pre>
<p>There are two lines beginning with the word 'Days' so we have to be careful:</p>
<pre><code>$ DTNFS=&quot;$(sed -ne &#39;s/^Days to[^:]\+:[\t ]\+\([0-9]\+\)/\1/p&#39; sed_demo3.txt)&quot;
$ echo &quot;DTNFS=$DTNFS&quot;
DTNFS=8</code></pre>
<p>The regexp starts with <code>'^Days to'</code> which makes it match the target line. After this come some other words and a colon. We'll represent this with <code>'[^:]\+:'</code> meaning one or more &quot;<em>not colons</em>&quot; followed by a colon. Then there are what look like spaces or could be a tab character (Hint: it's actually a tab). For safety's sake we'll represent this as <code>'[\t ]\+'</code> meaning one or more of tab or space. Then we have a regexp group consisting of <code>'[0-9]\+'</code> meaning one or more digits.</p>
<p>If this matches then we'll have a back reference to the group which we can return -- 8 in this case. The overall <code>sed</code> command uses the <code>'-n'</code> option suppressing printing and the '<em>s</em>' command uses the '<em>p</em>' flag to print just the matched line.</p>
<p>The output from the <code>sed</code> command is returned in a command substitution and is used to set the variable <code>DTNFS</code>. This is echoed in this fragment to show what was returned.</p>
<p>It is possible that the <code>sed</code> command could return nothing, in which case the variable would not be set. An actual Bash script doing this should check for this eventuality and take appropriate action.</p>
<h3 id="example-3">Example 3</h3>
<p>In this example we use the '\n' escape we examined earlier (backslash 'n' meaning <em>newline</em>):</p>
<pre><code>$ sed -e &#39;s/\(Hacker\) \(Public\) \(Radio\) /\1\n\2\n\3\n/&#39; sed_demo2.txt | head -4
Hacker
Public
Radio
(HPR) is an Internet Radio show (podcast) that releases</code></pre>
<p>We simply looked for the words &quot;Hacker Public Radio&quot;, grouping each of them so that they could be back referenced, and output them each followed by a newline. We used the <code>head</code> command to view just the first 4 lines produced by this <code>sed</code> command.</p>
<p>You might have expected that the following would join all the lines of the file together, but that doesn't happen:</p>
<pre><code>$ sed -e &#39;s/\n//&#39; sed_demo2.txt</code></pre>
<p>That is because <code>sed</code> places one line at a time into the <em>pattern space</em>, removing the trailing newline. Then it applies the script to it and (unless the '-n' option was used) prints it out with a trailing newline.</p>
<p>We will look at ways in which actions like line concatenation can be achieved in a later episode.</p>
<h3 id="example-4">Example 4</h3>
<p>We saw the '<code>-r</code>' (<code>--regexp_extended</code>) option earlier in this episode. If we were to use this in conjunction with <a href="#example-3">Example 3</a> we would write the following:</p>
<pre><code>$ sed -r -e &#39;s/(Hacker) (Public) (Radio) /\1\n\2\n\3\n/&#39; sed_demo2.txt | head -4
Hacker
Public
Radio
(HPR) is an Internet Radio show (podcast) that releases</code></pre>
<p>This is a useful feature, but it needs to be used with caution because it is specific to GNU <code>sed</code> and not portable.</p>
<h3 id="example-5">Example 5</h3>
<p>One task often needed when processing text is to remove leading and trailing spaces. With <code>sed</code> you might expect the following would work:</p>
<pre><code>$ echo &quot;    Hello World!      &quot; | sed -e &#39;s/^ *\| *$//&#39;
Hello World!</code></pre>
<p>At first glance it seems to, until you test it by enclosing the result of the trimming in visible characters:</p>
<pre><code>$ echo &quot;    Hello World!      &quot; | sed -e &#39;s/^ *\| *$//;s/^/&lt;/;s/$/&gt;/&#39;
&lt;Hello World!      &gt;</code></pre>
<p>In this case <code>sed</code> has stopped after the first match. This is an example where the '<em>g</em>' flag is needed to make <code>sed</code> repeat the match and substitution:</p>
<pre><code>$ echo &quot;    Hello World!      &quot; | sed -e &#39;s/^ *\| *$//g;s/^/&lt;/;s/$/&gt;/&#39;
&lt;Hello World!&gt;</code></pre>
<h3 id="example-6">Example 6</h3>
<p>In the audio I said that I would be demonstrating the use of word boundaries in an example. I had forgotten to add it at the time of recording, so this one is not described in the podcast.</p>
<p>Really, this is a piece of extreme silliness, but it does demonstrate word boundaries. It is being run on the example file from the last episode.</p>
<pre><code>$ sed -e &#39;s/\&lt;[A-Z]\w*\&gt;/Chicken/g;s/\b[a-z]\w*\b/chicken/g&#39; sed_demo1.txt</code></pre>
<p>The example consists of two '<em>s</em>' commands separated by a semicolon. The first matches any word that begins with a capital letter, using the <em>\&lt;</em> and <em>\&gt;</em> word boundaries and the <em>\w</em> expression. It replaces each occurrence it finds with an alternative capitalised word, using the '<em>g</em>' flag to ensure this happens.</p>
<p>The second '<em>s</em>' command does the same for lower-case words but uses the <em>\b</em> word boundary instead.</p>
<p>I will leave you to try it yourself.</p>
<h2 id="links">Links</h2>
<ul>
<li><em>Introduction to sed - part 1</em>: <a href="http://hackerpublicradio.org/eps/hpr1976" class="uri">http://hackerpublicradio.org/eps/hpr1976</a></li>
<li>GNU <code>sed</code> manual:
<ul>
<li>HTML Manual: <a href="https://www.gnu.org/software/sed/manual/sed.html" class="uri">https://www.gnu.org/software/sed/manual/sed.html</a></li>
<li>Section on <em>Invocation</em>: <a href="https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed" class="uri">https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed</a></li>
<li>Section on <em>Escapes in Regular expressions</em>: <a href="https://www.gnu.org/software/sed/manual/sed.html#Escapes" class="uri">https://www.gnu.org/software/sed/manual/sed.html#Escapes</a></li>
</ul></li>
<li>&quot;<em>Sed - An Introduction and Tutorial</em>&quot; by Bruce Barnett: <a href="http://www.grymoire.com/Unix/Sed.html" class="uri">http://www.grymoire.com/Unix/Sed.html</a></li>
<li>Wikipedia entry for <code>sed</code>: <a href="https://en.wikipedia.org/wiki/Sed" class="uri">https://en.wikipedia.org/wiki/Sed</a></li>
<li>Example files for processing:
<ul>
<li><a href="hpr1986_sed_demo2.txt" class="uri">hpr1986_sed_demo2.txt</a> (extracted from <a href="http://hackerpublicradio.org/about.php" class="uri">http://hackerpublicradio.org/about.php</a>)</li>
<li><a href="hpr1986_sed_demo3.txt" class="uri">hpr1986_sed_demo3.txt</a> (downloaded from <a href="http://hackerpublicradio.org/stats.php" class="uri">http://hackerpublicradio.org/stats.php</a>)</li>
</ul></li>
</ul>
<!--
vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker
-->
</article>
</main>
</div>
</body>
</html>