Episode: 2610
Title: HPR2610: Gnu Awk - Part 12
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2610/hpr2610.mp3
Transcribed: 2025-10-19 06:32:25

---

This is HPR Episode 2610 titled Genoaq Part 12 and is part of the series Learning Auk.
It is hosted by Dave Morris and is about 34 minutes long and can in an explicit flag.
The summary is advanced use of a range.
This episode of HPR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15.
That's HPR15.
Better web hosting that's honest and fair at An Honesthost.com.
Hello everybody. Welcome to Hacker Public Radio.
This is Dave Morris and I'm doing part 12 of the Genoaq Series or Learning Auk as we
calling it. So I started talking about arrays in episode 10 and I thought I would
continue that in this episode looking at some of the advanced elements of arrays.
Now the stuff I'm talking about today is specific largely I think to the Gnu version of Auk.
That means if you're using Auk which is not the enhanced Gnu version then some of these
might not be available so you need to need to check to be sure. So I'm talking about arrays
but I'm also going to finish off with an example of using or to solve a problem that I had.
This is not relative to arrays but just because I thought it was useful if you had some real world
examples of using or rather than sort of fairly sterile examples that you tend to find in these
episodes otherwise. So I'm going to start talking about Pat Split. I mentioned the split function
in the last episode I did on this episode 10 but there's a more powerful function for splitting
strings into array elements and it's called Pat Split because it splits according to patterns.
It takes a series of arguments. The first one is a string which is I think to be chopped up
and it's going to be chopped up according to the third argument which is called Field Pat which
defines the way in which the string is to be split and it's put into the pieces of put into
an array which is the second argument. There will be separators between each of the fields
or there may be anyway and they yeah I think there would have to be, wouldn't they?
And they are put into a further argument which is an array which is denoted by steps in the
example. We'll look at this in a bit more detail. This is very similar to the way that split works
and you can see the examples I gave you there. But the main difference from split is that
this Field Pat argument, the third one, is a regular expression which defines the field rather
than the separator. So I've got a bunch of examples here and I've gone for splitting up
comma separated stuff. Now what I've done here is to write a script which deals with $0,
the input record. I'm just giving the example one record but this would work with multiple records of
course. I'm using $0 and now you could just as well have ignored Pat Split and use the standard
splitting mechanism but we haven't really covered how you can do that using a regular expression.
There is a field built in thing called F Pat which is similar to Fs which I should do that
but hasn't been covered yet we'll be adding that into the series a bit later on.
So I've got a bunch of examples which I've called org12 underscore EX and then a number dot org.
This one is EX1 and what we're doing here is we're using Pat Split to split $0 into an array called
A. We're doing it by finding fields which consist of 0 to any number of non-comers. So it's
the regular expression is in slashes and it's open square bracket then a circumflex which means
not and then a comma close square bracket asterisk. That means 0 to any number of characters which
are not commas. So in other words anything that consists of things which are not commas followed
by a comma followed by not commas and a comma will fit that. Then having split it there's a loop
which goes for I in A in remembering that that's the way that you walk through an array and it
prints out the value of the array A index by I. So if we feed it an apple a day keeps the
doctor away with commas and sort of spaces then the output is the same sentence with spaces
in between. I'm printing them out without new lines and then I putting a new line on the end so
you see an actual string. I've used a similar sort of approach throughout these examples.
In the example not the thing that you can download which is the the orc script but in the actual
example in the in the notes I've showed shown the process of making a bash variable X into the
string an apple a day keeps the doctor away then using bash is editing features to replace all
the spaces by commas and then feeding that to the orc script which then removes them again.
It's a silly example but you get I hope you get the idea from that. Now if you wanted to do a
more complex regular expression example two shows that this example takes the expression the the
string I should say a bird in the hands of worth two in the bush but I turned the word bird into
a red bird separated by commas and enclosed in double quotes so in standard CSV format you can have
elements of the comma separated variable list which contain spaces or commas indeed enclosed in
double quotes so I've just emulated that. Then when it's printed out it's printed out with each of
these elements separated by spaces and I put angle brackets around each one just to make them
stand out more clearly and you can see that the the red bird string is is is one one entity.
The regular expression consists of two sub expressions enclosed in parentheses with a vertical bar
in between them so it's it's an all type expression. The first one is the same as in the previous
example with a series of zero or more not commas if you like to put it that way. The second one
looks for a double quoted string containing one or more things which are not double quotes so this
technique of saying the thing that encloses a string followed by any number of characters which
are not the enclosing characters is a is a technique you'll often see in regular expressions so
that works fine with the with example as you will see and that's EX2 then in EX3 we've got an
example where the pattern is quite simple but what we're doing here is we are saving the separators
so the patch split is simply using a series of letters capital or lowercase letters one or more
so any sequence which which matches that is the field definition field pattern we're saving the
result in an array s the script prints out all of the elements of the array which are which captured
by splitting and of course I've called the array a because I've not got much imagination no new lines
just spaces in between the elements followed by a new line at the end and then similar loop to
print out the contents of the array s I might say similar but it's not quite the same because
this time it's a counted loop because when you run patch split it returns the number of fields
that it found and I captured that in a variable called FLDS short for fields so I use that in the
loop setting i equal to one then adding one to it until while it's less than or equal to that
number of fields so that prints out all of the separators and then it puts a new line on the end
so the result is you get the the words in the sentence fed to it followed by a line containing
all the separators what I fed to it was the expression grinning like a cheshire cat where each
word is separated by a number of hyphens so the first thing you see is grinning like a cheshire cat
separated by spaces followed by all of the different hyphens separated by spaces just so happens
that the separators the hyphens are the same length sequence of hyphens same length has the word
before it and I just wrote a little box script to do that which I've included in the notes here but
I've marked it skip unless you're really interested so I won't read this one out you can dig into it
if you really want to it is available for download if you want to grab it and mess around with it
now the printing of the array s doesn't begin at 0 it begins at 1 but there is a 0th element
because it captures past split captures the separators prior to the first field well there aren't
any in this case so I didn't bother to print it but it's worth bearing in mind because it can be
of interest okay that's all I'm going to say about past split let's move on to sorting arrays
basically there are two main ways to do this the first one is to use an extension in gnu ork
which is a built-in array variable called proc info all in uppercase the element of the array is
has the the index sorted underscore in because that's a that's a string has to be in double quotes
so proc info square brackets quote sorted in quote closed square bracket that's the the magic
variable which can be used to control how arrays get sorted in the original version of ork the
non-gnu version then arrays came back in an arbitrary order when you you loop through them
so sorting them could be a bit of a pain and I know this because that that was one of the things I had
to do in my early computing career when I started to use ork there was no sorting built-in
the thing you put into the the proc info element is a string predefined string which begins with an
at sign and consists of various keywords and the the default one is at unsorted which means that
the array come back as in standard ork in an arbitrary order then there's a bunch of others and
look at a read them all out because there's quite a number it's a little table I put together of
them take for example one that I quite like to use and one I've used in the example which is
at VAL underscore STR underscore ASC that stands for values the values of the array as opposed to
the indices STR treat them as strings ASC in ascending order the notes here say order by element
values in ascending order scalar values are compared as strings so whatever values of place are
found in the array elements be they numbers or strings will be treated as strings and sorted
accordingly so this is this can be quite useful I certainly would have been more than delighted to
have had this one I had various tasks to do using ork back in my career setting this value is determined
it determines the sort order but before the loop scanning it begins you can't change it during
the loop while the loop is scanning and what's more important perhaps is that whenever you set
this value prox info sorted in then it's effective throughout the entire script there's no sort of
sculpting or localization so if you have a script that's an ork script that's printing a
raise in several instances they're all going to be sorted in this way you can change the value
between instances writing it out of course but you can't and you can also switch it off by setting
it to unsorted but it has a wider effect than might be obvious there's a bit more to what to
this thing than I've mentioned here and I've just alluded to it because arrays can be more complicated
than we've seen so far plus also this prox info sorted in can also contain the name of a function
which will perform sorting on the array for you it's just the function that you have to define
we haven't looked at functions use the divine functions yet I'm not sure whether we will go into
this when we when we do get to that point I've pointed to the GNU or manual section 8.1.6 which
covers this in a lot of detail so if you really need to use this then that's the place to go
so there's an example which is called EX4 another downloadable one and it consists of a
begin rule and in the begin rule prox info sorted in is set to at val underscore string underscore
ask the one I mentioned before and we just use split the split of dollar zero into an array
it's doing the split by space which is the same sort of split you would get in in default
anyway but if you split stuff in the usual way with all you you can't easily sort it
and it doesn't go into an array then the script prints out the elements of the array and it will
come out they will come out in sorted order he uses for i in a as we've used before it prints out
the value of i and then the value of the value of the a array index by i I've fed it the string
and Englishman's home in his castle because it's sorted on the value it comes out as an
Englishman's castle his home is and you'll see the indexes are not in sorted order but the values
are in sorted order alphabetically sorted with the capitalized letters before the lowercase ones
it's quite I think that's quite a potentially useful thing I have certainly used this but the
sorting capability in the past to capture frequency information from bits of data and
frequencies are often a thing that were quite important bits of knowledge in the environment I
worked in and having a sorted list of frequencies was often a useful thing to have so sorting
in this way or alphabetically sorting the names that you were you were doing frequency
counts on or something like that was often quite a desirable thing for various reasons so let's
now look at the functions which are available for array sorting be easy mention them when you
think his review of string functions in episode 11 the functions we're going to look at are called
a sort and a sort i now the two functions have pretty much the same arguments I have listed them
separately and described each of the arguments separately in the notes the arguments are the source
which is the array that you're you're going to be sorting then second one is called
desks which is an optional one which is the the place you you're going to put the results of the
sort and the third argument which again is optional how is a way which you can define the type of
sort the how argument not too surprisingly can be any of the strings that we've already seen
these at unsorted and at val string desk and ask and so forth that we saw in the context of the
proc info stuff anyway let's look at the examples always through these fairly quickly there are three
of them first one is ex five and I've made them all fairly trivial where I've defined an array
and called it a and the arrays indexed by the numbers one two and three and in it I've just put
names Jones was the first one x I think the mr x when I wrote that I think and Smith and then I
used a sort a on that array and printed out the results so using a sort on that array a which
has got the values of one two and three the indexes results in nothing very nothing very exciting
I had to do a double take of this one example five because the array is being loaded up with indexes
one two and three with the strings Jones x and Smith then it's being sorted the sort will cause
the the values to be sorted so when it's printed out you get Jones Smith and x in alphabetical order
but the indexes have been changed to be one two and three against the Jones Smith and x so in
other words the potentially the indexes are completely destroyed and are replaced by the numbers one
two and three whatever whatever's appropriate for the number of elements and it's not very
obvious in the examples I apologize for that but it's it's an odd thing to do in some respects because
you're taking an array which has got indexes which one assumes are important and it's reordering
the indexes so one stays at Jones two instead of being x becomes Smith and three instead of being
Smith becomes x so it's it's a slightly odd thing to do I guess you'd say I think the the
prog info method is better in many ways example six I have done the same thing Jones x and Smith
in an array a but I've instead of using numeric indexes I've given them characters a b and c
what I've done this time is used a sort on a but I've said the destination is to be a b so in doing
that the a sort first of all copies a into b and then sorts b and does it does it stuff with b so
there's a loop which loops through the array b and it prints out one Jones two Smith three x
the same as before but then the second loop prints out the array a which goes a Jones b x and
c Smith so in that case the indexes have not been destroyed but they are being messed up in the
in the first one example x seven uses the other function a sort i and it creates an an array a
where the the indexes are strings third second and first so third is Jones second is x first
is Smith then a sort i a that when that's printed out you see that what you've got is an ordering
of the indexes but the actual values have been thrown away so you might wonder what an earth
I think I might be able to explain that in a moment example eight number four in this group
uses a sort i but with the a desk argument so that you you don't destroy the original it's pretty
much the same except that it prints out it prints out the result by using one array to index the
so it loops it's sorting it's a sort i the very the array a into b and then it loops through b
so for i and b and then for each element it prints out the array b indexed by i and then it
prints out the array a indexed by b i so the result is first call on Smith second call on x
third call on Jones so effectively it sorted them into the correct order by index without messing
up any of the data all these the data has been messed up in the array b but b the results in b
have been useful in indexing a hope that makes sense and i have to say that when i was having to
do sorting of this sort of stuff myself using a just basic sorting algorithms in ork then that
was the technique that i used but it's it's a little it's a little odd until you get the get the
idea of it and that's why i think they are ork is changing these arrays because it's assuming you
they're going to use them as ways of indexing the original data the next example the fifth one
in this particular group e x nine uses the same sort of idea the three elements in array indexed
by a b and c is using a sort from a into b but it's using the how value the how argument and it's
using at v al underscore str underscore desk so descending by value treating them as strings
then it's using array b to in the loop and it's going it's printing out the index of b and then
the contents of b element a of a was Jones b was x element b was x element c was Smith and when it
comes to print these out you see x Smith and Jones listed out in in that order and descending order
with the index is one two and three this is useful but i would i would i would offer that the
use is moderately limited okay but i've got a section here entitled yet more about arrays but
it's really just to say i'm not going to do any more about arrays just now there is more to be said
there is a sort of multi-dimensional array capability in ork with friend original ork it had this
and it's still available in canoe ork and there's a considerable enhancement in that you can have
arrays as array elements too in in canoe ork but i'm not sure that we're going to be covering
these topics in this series there's loads of information about this in the canoe ork manual if
you want to dig deeper however if you if you feel if we receive any requests to recover this
they're in more depth then we'll reconsider doing something about it i must admit i have never
used multi-dimensional arrays nor array arrays of arrays which can be arbitrarily deep and
there are some quite reasonable facilities for manipulating them and walking through them and
but um it's it's not a thing i've ever used in ork if i was to do that i would use a different
scripting language i have to say time is marching on so let me go quickly through my real world or
example this is about things i do and one of the things i do is to process show notes for
hpr which are sent in with episodes many people send in their show notes as plain text which is
but we need html for loading into the hpr database so what i do with them i've got a series of
scripts in with which i check them pull the the notes out of the file that we get from the form
and i edit them to fix any errors turn them into markdown and then generate html using
a tool called pandok as part of that process i look at the html that's generated locally
i've grabbed this and and i'm working on my local workstation and um i make a copy of the html
in a format which is easy to browse and pandok is good at doing this it makes it turns them up down
into standalone html which i can view in a browser and and it looks pretty much how
look when it's on the hpr site so that's the point at which i can say oops there's a mistake here
and go and fix it and move on from there i'd to make the html copy i want for viewing locally
pandok has recently changed to the extent that you need to provide further information the further
information is a couple of lines of metadata which has to be in a format known as yamol yamol is
a sort of simplistic data format which is quite well defined but simple to to produce
and human readable and so forth there are alternative ways to do but i'm using the yamol option
so the way this should look is there should be two lines of metadata with a three hyphens above
three full stops below and the two lines consist of title colon the word title colon
lowercase space then the title of the show which has to be enclosed in quotes or should i
enclose it in quotes anyway the second one is author colon space then the name of the host and
i enclose that in quotes too and that's used to generate headers in the final document this is
this is just for my own benefit so i wrote an org script to generate this yamol metadata and i'm
embedded that in the bash script that i used to run pandok so i've included this bit of org in the
notes here and it consists of 14 lines this is part of another script as i said the first line
is org space minus f and then a space minus a hyphen character then the the name of a variable
which is then piped into redirected into an output file again defined by a variable the first
variable is called dollar raw file the second variable is called dollar tmp1 the temporary file
be thrown away afterwards but the end of the line and this is where we're digressing a little bit
from org into some of the areas of bash consists of a thing called a heirdoc and heirdoc is the
way in which you tell bash there is some data that's to be in added or given to or stored in a
file or given to a program and you in order to do this you need to use two less than signs followed
by a word the word has to be has to have no no spaces in it i think it can tell you another
characters i usually just make it a series of letters this particular one i've called end
org all in capitals i put it in quotes and i'll mention this in a moment everything from that line
up to a line that only consists of end org starting in column one is data to be chewed up by
and because the org command uses minus f which is telling org where the program files to come from
the script itself and the argument to minus f is a hyphen that hyphen means get it from
standard the standard input channel so it's telling org effectively that what follows is
the program it's just a convenient way of including an org script in us in another script
immediately after the invocation to to org you can put the whole thing in quotes but if the script
itself uses quotes things get really convoluted this particular case includes both single and
double quotes so using quotes to enclose it would be a real pain the quotes around the the
here doc terminator tell bash not to interpolate any dollar signed variables in the in the data
by default it will actually scan this data and if it finds dollar something it will assume it's
at the name of a bash variable and it will interpolate it if you put the here doc terminator
in single quotes then it won't do that and i've got dollars and stuff in this script the script
itself begins with a begin rule and the begin rule simply prints out the three hyphens that we
need to start the thing and it ends with on line 13 i put line numbers on this one for ease of
reference it ends with an end rule which prints out the three full stops at the end of the metadata
then there are two regular expressions in the the main script and these are things which are going
to be matched against the input data the first one is a circumflex title column with a capital T
and what this is meant to do is to match the string title which is in the the input file
where that input file is the one that's come from the hbr server and contains the data that's
been fed in by the the host submitting the show and has been turned into into this file so one of
the one of the the items on the form is the title of the show so we're looking for the the result
of that so the rule itself uses the sub function which we've looked at in the previous show
which matches the string circumflex title column circumflex being the start of line as you'll
remember and uses backslash s after that because that means a universal white space sequence or
single white space I should say so that's a space or a tab I think most of these when they're
returned consist of one tab but not quite sure so I just did this to be safe and the sub the second
argument to sub is simply an empty string what it's saying is the bit of the the line that's that
comes in the one that begins title chopped the bit off that says title in his followed by white
space removes it entirely so all that's left is the actual title then the next line line five is a g sub
g sub and recall is a means of doing multiple substitutions on a line and here it's looking for
single quotes and if it finds any it will replace them by by two single quotes and that's because
YAML needs if you've enclosed a string in quotes and you're wise to do so then if you want to embed
single quotes within it then they have to be doubled so that's what it's doing ready for YAML
then it finally prints line six it prints the string title column in lowercase followed by one
space followed by the final result of these bits of editing in single quotes followed by a new line
and it's actually printing dollar zero which is the entire line that's been matched by the
the regular expression the second regular expression is that the rule began by a regular
expression is looking for host name and that's doing the same sort of thing except looking for
the name of the the particular host in this in this file that's come back from the form on
the hpa website and it's doing pretty much the same thing i wouldn't explain it again because
it's pretty much identical when that's finished the result should be that the the four lines of
metadata should be in the file whose name is in the variable tmp1 and then a bit later on in the
bash script there's a long line which calls paddock to do the necessary thing and as part of its
arguments i've printed them all out here in the notes but i don't really think i should explain
them because i'm not sure anybody's interested but essentially it's it's given paddock two data files
called dollar tmp1 and dollar extract which it's to process and produce some results in a file
called full html along the way it's told paddock to include the hpr css which it's grabbed from
the website so it means that the the html it's produced looks identical to the sort of html that
the hpr website generates itself that took a lot of explanation but it's really not a very
complex org script i thought it might be of interest see the sort of thing that what gets used for
at least the way i use it and it also shows an example of using a bash here dock which people might
not be that up to speed with so that's it that's the end of my show today there are all of the
examples i've mentioned during colluded in the show there's an e-pub version of the notes
okay then bye bye
you've been listening to hecka public radio at hecka public radio dot org
we are a community podcast network that releases shows every weekday Monday through Friday
today's show like all our shows was contributed by an hpr listener like yourself
if you ever thought of recording a podcast then click on our contributing to find out how easy it
really is hecka public radio was founded by the digital dov pound and the infonomican computer club
and it's part of the binary revolution at binrev.com if you have comments on today's show
please email the host directly leave a comment on the website or record a follow-up episode yourself
on this otherwise stated today's show is released on the creative comments
attribution share a light 3.0 license