- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
311 lines
28 KiB
Plaintext
311 lines
28 KiB
Plaintext
Episode: 2610
|
|
Title: HPR2610: Gnu Awk - Part 12
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2610/hpr2610.mp3
|
|
Transcribed: 2025-10-19 06:32:25
|
|
|
|
---
|
|
|
|
This is HPR Episode 2610 titled Genoaq Part 12 and is part of the series Learning Auk.
|
|
It is hosted by Dave Morris and is about 34 minutes long and can in an explicit flag.
|
|
The summary is advanced use of a range.
|
|
This episode of HPR is brought to you by an honesthost.com.
|
|
Get 15% discount on all shared hosting with the offer code HPR15.
|
|
That's HPR15.
|
|
Better web hosting that's honest and fair at An Honesthost.com.
|
|
Hello everybody. Welcome to Hacker Public Radio.
|
|
This is Dave Morris and I'm doing part 12 of the Genoaq Series or Learning Auk as we
|
|
calling it. So I started talking about arrays in episode 10 and I thought I would
|
|
continue that in this episode looking at some of the advanced elements of arrays.
|
|
Now the stuff I'm talking about today is specific largely I think to the Gnu version of Auk.
|
|
That means if you're using Auk which is not the enhanced Gnu version then some of these
|
|
might not be available so you need to need to check to be sure. So I'm talking about arrays
|
|
but I'm also going to finish off with an example of using or to solve a problem that I had.
|
|
This is not relative to arrays but just because I thought it was useful if you had some real world
|
|
examples of using or rather than sort of fairly sterile examples that you tend to find in these
|
|
episodes otherwise. So I'm going to start talking about Pat Split. I mentioned the split function
|
|
in the last episode I did on this episode 10 but there's a more powerful function for splitting
|
|
strings into array elements and it's called Pat Split because it splits according to patterns.
|
|
It takes a series of arguments. The first one is a string which is I think to be chopped up
|
|
and it's going to be chopped up according to the third argument which is called Field Pat which
|
|
defines the way in which the string is to be split and it's put into the pieces of put into
|
|
an array which is the second argument. There will be separators between each of the fields
|
|
or there may be anyway and they yeah I think there would have to be, wouldn't they?
|
|
And they are put into a further argument which is an array which is denoted by steps in the
|
|
example. We'll look at this in a bit more detail. This is very similar to the way that split works
|
|
and you can see the examples I gave you there. But the main difference from split is that
|
|
this Field Pat argument, the third one, is a regular expression which defines the field rather
|
|
than the separator. So I've got a bunch of examples here and I've gone for splitting up
|
|
comma separated stuff. Now what I've done here is to write a script which deals with $0,
|
|
the input record. I'm just giving the example one record but this would work with multiple records of
|
|
course. I'm using $0 and now you could just as well have ignored Pat Split and use the standard
|
|
splitting mechanism but we haven't really covered how you can do that using a regular expression.
|
|
There is a field built in thing called F Pat which is similar to Fs which I should do that
|
|
but hasn't been covered yet we'll be adding that into the series a bit later on.
|
|
So I've got a bunch of examples which I've called org12 underscore EX and then a number dot org.
|
|
This one is EX1 and what we're doing here is we're using Pat Split to split $0 into an array called
|
|
A. We're doing it by finding fields which consist of 0 to any number of non-comers. So it's
|
|
the regular expression is in slashes and it's open square bracket then a circumflex which means
|
|
not and then a comma close square bracket asterisk. That means 0 to any number of characters which
|
|
are not commas. So in other words anything that consists of things which are not commas followed
|
|
by a comma followed by not commas and a comma will fit that. Then having split it there's a loop
|
|
which goes for I in A in remembering that that's the way that you walk through an array and it
|
|
prints out the value of the array A index by I. So if we feed it an apple a day keeps the
|
|
doctor away with commas and sort of spaces then the output is the same sentence with spaces
|
|
in between. I'm printing them out without new lines and then I putting a new line on the end so
|
|
you see an actual string. I've used a similar sort of approach throughout these examples.
|
|
In the example not the thing that you can download which is the the orc script but in the actual
|
|
example in the in the notes I've showed shown the process of making a bash variable X into the
|
|
string an apple a day keeps the doctor away then using bash is editing features to replace all
|
|
the spaces by commas and then feeding that to the orc script which then removes them again.
|
|
It's a silly example but you get I hope you get the idea from that. Now if you wanted to do a
|
|
more complex regular expression example two shows that this example takes the expression the the
|
|
string I should say a bird in the hands of worth two in the bush but I turned the word bird into
|
|
a red bird separated by commas and enclosed in double quotes so in standard CSV format you can have
|
|
elements of the comma separated variable list which contain spaces or commas indeed enclosed in
|
|
double quotes so I've just emulated that. Then when it's printed out it's printed out with each of
|
|
these elements separated by spaces and I put angle brackets around each one just to make them
|
|
stand out more clearly and you can see that the the red bird string is is is one one entity.
|
|
The regular expression consists of two sub expressions enclosed in parentheses with a vertical bar
|
|
in between them so it's it's an all type expression. The first one is the same as in the previous
|
|
example with a series of zero or more not commas if you like to put it that way. The second one
|
|
looks for a double quoted string containing one or more things which are not double quotes so this
|
|
technique of saying the thing that encloses a string followed by any number of characters which
|
|
are not the enclosing characters is a is a technique you'll often see in regular expressions so
|
|
that works fine with the with example as you will see and that's EX2 then in EX3 we've got an
|
|
example where the pattern is quite simple but what we're doing here is we are saving the separators
|
|
so the patch split is simply using a series of letters capital or lowercase letters one or more
|
|
so any sequence which which matches that is the field definition field pattern we're saving the
|
|
result in an array s the script prints out all of the elements of the array which are which captured
|
|
by splitting and of course I've called the array a because I've not got much imagination no new lines
|
|
just spaces in between the elements followed by a new line at the end and then similar loop to
|
|
print out the contents of the array s I might say similar but it's not quite the same because
|
|
this time it's a counted loop because when you run patch split it returns the number of fields
|
|
that it found and I captured that in a variable called FLDS short for fields so I use that in the
|
|
loop setting i equal to one then adding one to it until while it's less than or equal to that
|
|
number of fields so that prints out all of the separators and then it puts a new line on the end
|
|
so the result is you get the the words in the sentence fed to it followed by a line containing
|
|
all the separators what I fed to it was the expression grinning like a cheshire cat where each
|
|
word is separated by a number of hyphens so the first thing you see is grinning like a cheshire cat
|
|
separated by spaces followed by all of the different hyphens separated by spaces just so happens
|
|
that the separators the hyphens are the same length sequence of hyphens same length has the word
|
|
before it and I just wrote a little box script to do that which I've included in the notes here but
|
|
I've marked it skip unless you're really interested so I won't read this one out you can dig into it
|
|
if you really want to it is available for download if you want to grab it and mess around with it
|
|
now the printing of the array s doesn't begin at 0 it begins at 1 but there is a 0th element
|
|
because it captures past split captures the separators prior to the first field well there aren't
|
|
any in this case so I didn't bother to print it but it's worth bearing in mind because it can be
|
|
of interest okay that's all I'm going to say about past split let's move on to sorting arrays
|
|
basically there are two main ways to do this the first one is to use an extension in gnu ork
|
|
which is a built-in array variable called proc info all in uppercase the element of the array is
|
|
has the the index sorted underscore in because that's a that's a string has to be in double quotes
|
|
so proc info square brackets quote sorted in quote closed square bracket that's the the magic
|
|
variable which can be used to control how arrays get sorted in the original version of ork the
|
|
non-gnu version then arrays came back in an arbitrary order when you you loop through them
|
|
so sorting them could be a bit of a pain and I know this because that that was one of the things I had
|
|
to do in my early computing career when I started to use ork there was no sorting built-in
|
|
the thing you put into the the proc info element is a string predefined string which begins with an
|
|
at sign and consists of various keywords and the the default one is at unsorted which means that
|
|
the array come back as in standard ork in an arbitrary order then there's a bunch of others and
|
|
look at a read them all out because there's quite a number it's a little table I put together of
|
|
them take for example one that I quite like to use and one I've used in the example which is
|
|
at VAL underscore STR underscore ASC that stands for values the values of the array as opposed to
|
|
the indices STR treat them as strings ASC in ascending order the notes here say order by element
|
|
values in ascending order scalar values are compared as strings so whatever values of place are
|
|
found in the array elements be they numbers or strings will be treated as strings and sorted
|
|
accordingly so this is this can be quite useful I certainly would have been more than delighted to
|
|
have had this one I had various tasks to do using ork back in my career setting this value is determined
|
|
it determines the sort order but before the loop scanning it begins you can't change it during
|
|
the loop while the loop is scanning and what's more important perhaps is that whenever you set
|
|
this value prox info sorted in then it's effective throughout the entire script there's no sort of
|
|
sculpting or localization so if you have a script that's an ork script that's printing a
|
|
raise in several instances they're all going to be sorted in this way you can change the value
|
|
between instances writing it out of course but you can't and you can also switch it off by setting
|
|
it to unsorted but it has a wider effect than might be obvious there's a bit more to what to
|
|
this thing than I've mentioned here and I've just alluded to it because arrays can be more complicated
|
|
than we've seen so far plus also this prox info sorted in can also contain the name of a function
|
|
which will perform sorting on the array for you it's just the function that you have to define
|
|
we haven't looked at functions use the divine functions yet I'm not sure whether we will go into
|
|
this when we when we do get to that point I've pointed to the GNU or manual section 8.1.6 which
|
|
covers this in a lot of detail so if you really need to use this then that's the place to go
|
|
so there's an example which is called EX4 another downloadable one and it consists of a
|
|
begin rule and in the begin rule prox info sorted in is set to at val underscore string underscore
|
|
ask the one I mentioned before and we just use split the split of dollar zero into an array
|
|
it's doing the split by space which is the same sort of split you would get in in default
|
|
anyway but if you split stuff in the usual way with all you you can't easily sort it
|
|
and it doesn't go into an array then the script prints out the elements of the array and it will
|
|
come out they will come out in sorted order he uses for i in a as we've used before it prints out
|
|
the value of i and then the value of the value of the a array index by i I've fed it the string
|
|
and Englishman's home in his castle because it's sorted on the value it comes out as an
|
|
Englishman's castle his home is and you'll see the indexes are not in sorted order but the values
|
|
are in sorted order alphabetically sorted with the capitalized letters before the lowercase ones
|
|
it's quite I think that's quite a potentially useful thing I have certainly used this but the
|
|
sorting capability in the past to capture frequency information from bits of data and
|
|
frequencies are often a thing that were quite important bits of knowledge in the environment I
|
|
worked in and having a sorted list of frequencies was often a useful thing to have so sorting
|
|
in this way or alphabetically sorting the names that you were you were doing frequency
|
|
counts on or something like that was often quite a desirable thing for various reasons so let's
|
|
now look at the functions which are available for array sorting be easy mention them when you
|
|
think his review of string functions in episode 11 the functions we're going to look at are called
|
|
a sort and a sort i now the two functions have pretty much the same arguments I have listed them
|
|
separately and described each of the arguments separately in the notes the arguments are the source
|
|
which is the array that you're you're going to be sorting then second one is called
|
|
desks which is an optional one which is the the place you you're going to put the results of the
|
|
sort and the third argument which again is optional how is a way which you can define the type of
|
|
sort the how argument not too surprisingly can be any of the strings that we've already seen
|
|
these at unsorted and at val string desk and ask and so forth that we saw in the context of the
|
|
proc info stuff anyway let's look at the examples always through these fairly quickly there are three
|
|
of them first one is ex five and I've made them all fairly trivial where I've defined an array
|
|
and called it a and the arrays indexed by the numbers one two and three and in it I've just put
|
|
names Jones was the first one x I think the mr x when I wrote that I think and Smith and then I
|
|
used a sort a on that array and printed out the results so using a sort on that array a which
|
|
has got the values of one two and three the indexes results in nothing very nothing very exciting
|
|
I had to do a double take of this one example five because the array is being loaded up with indexes
|
|
one two and three with the strings Jones x and Smith then it's being sorted the sort will cause
|
|
the the values to be sorted so when it's printed out you get Jones Smith and x in alphabetical order
|
|
but the indexes have been changed to be one two and three against the Jones Smith and x so in
|
|
other words the potentially the indexes are completely destroyed and are replaced by the numbers one
|
|
two and three whatever whatever's appropriate for the number of elements and it's not very
|
|
obvious in the examples I apologize for that but it's it's an odd thing to do in some respects because
|
|
you're taking an array which has got indexes which one assumes are important and it's reordering
|
|
the indexes so one stays at Jones two instead of being x becomes Smith and three instead of being
|
|
Smith becomes x so it's it's a slightly odd thing to do I guess you'd say I think the the
|
|
prog info method is better in many ways example six I have done the same thing Jones x and Smith
|
|
in an array a but I've instead of using numeric indexes I've given them characters a b and c
|
|
what I've done this time is used a sort on a but I've said the destination is to be a b so in doing
|
|
that the a sort first of all copies a into b and then sorts b and does it does it stuff with b so
|
|
there's a loop which loops through the array b and it prints out one Jones two Smith three x
|
|
the same as before but then the second loop prints out the array a which goes a Jones b x and
|
|
c Smith so in that case the indexes have not been destroyed but they are being messed up in the
|
|
in the first one example x seven uses the other function a sort i and it creates an an array a
|
|
where the the indexes are strings third second and first so third is Jones second is x first
|
|
is Smith then a sort i a that when that's printed out you see that what you've got is an ordering
|
|
of the indexes but the actual values have been thrown away so you might wonder what an earth
|
|
I think I might be able to explain that in a moment example eight number four in this group
|
|
uses a sort i but with the a desk argument so that you you don't destroy the original it's pretty
|
|
much the same except that it prints out it prints out the result by using one array to index the
|
|
so it loops it's sorting it's a sort i the very the array a into b and then it loops through b
|
|
so for i and b and then for each element it prints out the array b indexed by i and then it
|
|
prints out the array a indexed by b i so the result is first call on Smith second call on x
|
|
third call on Jones so effectively it sorted them into the correct order by index without messing
|
|
up any of the data all these the data has been messed up in the array b but b the results in b
|
|
have been useful in indexing a hope that makes sense and i have to say that when i was having to
|
|
do sorting of this sort of stuff myself using a just basic sorting algorithms in ork then that
|
|
was the technique that i used but it's it's a little it's a little odd until you get the get the
|
|
idea of it and that's why i think they are ork is changing these arrays because it's assuming you
|
|
they're going to use them as ways of indexing the original data the next example the fifth one
|
|
in this particular group e x nine uses the same sort of idea the three elements in array indexed
|
|
by a b and c is using a sort from a into b but it's using the how value the how argument and it's
|
|
using at v al underscore str underscore desk so descending by value treating them as strings
|
|
then it's using array b to in the loop and it's going it's printing out the index of b and then
|
|
the contents of b element a of a was Jones b was x element b was x element c was Smith and when it
|
|
comes to print these out you see x Smith and Jones listed out in in that order and descending order
|
|
with the index is one two and three this is useful but i would i would i would offer that the
|
|
use is moderately limited okay but i've got a section here entitled yet more about arrays but
|
|
it's really just to say i'm not going to do any more about arrays just now there is more to be said
|
|
there is a sort of multi-dimensional array capability in ork with friend original ork it had this
|
|
and it's still available in canoe ork and there's a considerable enhancement in that you can have
|
|
arrays as array elements too in in canoe ork but i'm not sure that we're going to be covering
|
|
these topics in this series there's loads of information about this in the canoe ork manual if
|
|
you want to dig deeper however if you if you feel if we receive any requests to recover this
|
|
they're in more depth then we'll reconsider doing something about it i must admit i have never
|
|
used multi-dimensional arrays nor array arrays of arrays which can be arbitrarily deep and
|
|
there are some quite reasonable facilities for manipulating them and walking through them and
|
|
but um it's it's not a thing i've ever used in ork if i was to do that i would use a different
|
|
scripting language i have to say time is marching on so let me go quickly through my real world or
|
|
example this is about things i do and one of the things i do is to process show notes for
|
|
hpr which are sent in with episodes many people send in their show notes as plain text which is
|
|
but we need html for loading into the hpr database so what i do with them i've got a series of
|
|
scripts in with which i check them pull the the notes out of the file that we get from the form
|
|
and i edit them to fix any errors turn them into markdown and then generate html using
|
|
a tool called pandok as part of that process i look at the html that's generated locally
|
|
i've grabbed this and and i'm working on my local workstation and um i make a copy of the html
|
|
in a format which is easy to browse and pandok is good at doing this it makes it turns them up down
|
|
into standalone html which i can view in a browser and and it looks pretty much how
|
|
look when it's on the hpr site so that's the point at which i can say oops there's a mistake here
|
|
and go and fix it and move on from there i'd to make the html copy i want for viewing locally
|
|
pandok has recently changed to the extent that you need to provide further information the further
|
|
information is a couple of lines of metadata which has to be in a format known as yamol yamol is
|
|
a sort of simplistic data format which is quite well defined but simple to to produce
|
|
and human readable and so forth there are alternative ways to do but i'm using the yamol option
|
|
so the way this should look is there should be two lines of metadata with a three hyphens above
|
|
three full stops below and the two lines consist of title colon the word title colon
|
|
lowercase space then the title of the show which has to be enclosed in quotes or should i
|
|
enclose it in quotes anyway the second one is author colon space then the name of the host and
|
|
i enclose that in quotes too and that's used to generate headers in the final document this is
|
|
this is just for my own benefit so i wrote an org script to generate this yamol metadata and i'm
|
|
embedded that in the bash script that i used to run pandok so i've included this bit of org in the
|
|
notes here and it consists of 14 lines this is part of another script as i said the first line
|
|
is org space minus f and then a space minus a hyphen character then the the name of a variable
|
|
which is then piped into redirected into an output file again defined by a variable the first
|
|
variable is called dollar raw file the second variable is called dollar tmp1 the temporary file
|
|
be thrown away afterwards but the end of the line and this is where we're digressing a little bit
|
|
from org into some of the areas of bash consists of a thing called a heirdoc and heirdoc is the
|
|
way in which you tell bash there is some data that's to be in added or given to or stored in a
|
|
file or given to a program and you in order to do this you need to use two less than signs followed
|
|
by a word the word has to be has to have no no spaces in it i think it can tell you another
|
|
characters i usually just make it a series of letters this particular one i've called end
|
|
org all in capitals i put it in quotes and i'll mention this in a moment everything from that line
|
|
up to a line that only consists of end org starting in column one is data to be chewed up by
|
|
and because the org command uses minus f which is telling org where the program files to come from
|
|
the script itself and the argument to minus f is a hyphen that hyphen means get it from
|
|
standard the standard input channel so it's telling org effectively that what follows is
|
|
the program it's just a convenient way of including an org script in us in another script
|
|
immediately after the invocation to to org you can put the whole thing in quotes but if the script
|
|
itself uses quotes things get really convoluted this particular case includes both single and
|
|
double quotes so using quotes to enclose it would be a real pain the quotes around the the
|
|
here doc terminator tell bash not to interpolate any dollar signed variables in the in the data
|
|
by default it will actually scan this data and if it finds dollar something it will assume it's
|
|
at the name of a bash variable and it will interpolate it if you put the here doc terminator
|
|
in single quotes then it won't do that and i've got dollars and stuff in this script the script
|
|
itself begins with a begin rule and the begin rule simply prints out the three hyphens that we
|
|
need to start the thing and it ends with on line 13 i put line numbers on this one for ease of
|
|
reference it ends with an end rule which prints out the three full stops at the end of the metadata
|
|
then there are two regular expressions in the the main script and these are things which are going
|
|
to be matched against the input data the first one is a circumflex title column with a capital T
|
|
and what this is meant to do is to match the string title which is in the the input file
|
|
where that input file is the one that's come from the hbr server and contains the data that's
|
|
been fed in by the the host submitting the show and has been turned into into this file so one of
|
|
the one of the the items on the form is the title of the show so we're looking for the the result
|
|
of that so the rule itself uses the sub function which we've looked at in the previous show
|
|
which matches the string circumflex title column circumflex being the start of line as you'll
|
|
remember and uses backslash s after that because that means a universal white space sequence or
|
|
single white space I should say so that's a space or a tab I think most of these when they're
|
|
returned consist of one tab but not quite sure so I just did this to be safe and the sub the second
|
|
argument to sub is simply an empty string what it's saying is the bit of the the line that's that
|
|
comes in the one that begins title chopped the bit off that says title in his followed by white
|
|
space removes it entirely so all that's left is the actual title then the next line line five is a g sub
|
|
g sub and recall is a means of doing multiple substitutions on a line and here it's looking for
|
|
single quotes and if it finds any it will replace them by by two single quotes and that's because
|
|
YAML needs if you've enclosed a string in quotes and you're wise to do so then if you want to embed
|
|
single quotes within it then they have to be doubled so that's what it's doing ready for YAML
|
|
then it finally prints line six it prints the string title column in lowercase followed by one
|
|
space followed by the final result of these bits of editing in single quotes followed by a new line
|
|
and it's actually printing dollar zero which is the entire line that's been matched by the
|
|
the regular expression the second regular expression is that the rule began by a regular
|
|
expression is looking for host name and that's doing the same sort of thing except looking for
|
|
the name of the the particular host in this in this file that's come back from the form on
|
|
the hpa website and it's doing pretty much the same thing i wouldn't explain it again because
|
|
it's pretty much identical when that's finished the result should be that the the four lines of
|
|
metadata should be in the file whose name is in the variable tmp1 and then a bit later on in the
|
|
bash script there's a long line which calls paddock to do the necessary thing and as part of its
|
|
arguments i've printed them all out here in the notes but i don't really think i should explain
|
|
them because i'm not sure anybody's interested but essentially it's it's given paddock two data files
|
|
called dollar tmp1 and dollar extract which it's to process and produce some results in a file
|
|
called full html along the way it's told paddock to include the hpr css which it's grabbed from
|
|
the website so it means that the the html it's produced looks identical to the sort of html that
|
|
the hpr website generates itself that took a lot of explanation but it's really not a very
|
|
complex org script i thought it might be of interest see the sort of thing that what gets used for
|
|
at least the way i use it and it also shows an example of using a bash here dock which people might
|
|
not be that up to speed with so that's it that's the end of my show today there are all of the
|
|
examples i've mentioned during colluded in the show there's an e-pub version of the notes
|
|
okay then bye bye
|
|
you've been listening to hecka public radio at hecka public radio dot org
|
|
we are a community podcast network that releases shows every weekday Monday through Friday
|
|
today's show like all our shows was contributed by an hpr listener like yourself
|
|
if you ever thought of recording a podcast then click on our contributing to find out how easy it
|
|
really is hecka public radio was founded by the digital dov pound and the infonomican computer club
|
|
and it's part of the binary revolution at binrev.com if you have comments on today's show
|
|
please email the host directly leave a comment on the website or record a follow-up episode yourself
|
|
on this otherwise stated today's show is released on the creative comments
|
|
attribution share a light 3.0 license
|