Episode: 2610 Title: HPR2610: Gnu Awk - Part 12 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2610/hpr2610.mp3 Transcribed: 2025-10-19 06:32:25 --- This is HPR Episode 2610 titled Genoaq Part 12 and is part of the series Learning Auk. It is hosted by Dave Morris and is about 34 minutes long and can in an explicit flag. The summary is advanced use of a range. This episode of HPR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at An Honesthost.com. Hello everybody. Welcome to Hacker Public Radio. This is Dave Morris and I'm doing part 12 of the Genoaq Series or Learning Auk as we calling it. So I started talking about arrays in episode 10 and I thought I would continue that in this episode looking at some of the advanced elements of arrays. Now the stuff I'm talking about today is specific largely I think to the Gnu version of Auk. That means if you're using Auk which is not the enhanced Gnu version then some of these might not be available so you need to need to check to be sure. So I'm talking about arrays but I'm also going to finish off with an example of using or to solve a problem that I had. This is not relative to arrays but just because I thought it was useful if you had some real world examples of using or rather than sort of fairly sterile examples that you tend to find in these episodes otherwise. So I'm going to start talking about Pat Split. I mentioned the split function in the last episode I did on this episode 10 but there's a more powerful function for splitting strings into array elements and it's called Pat Split because it splits according to patterns. It takes a series of arguments. The first one is a string which is I think to be chopped up and it's going to be chopped up according to the third argument which is called Field Pat which defines the way in which the string is to be split and it's put into the pieces of put into an array which is the second argument. There will be separators between each of the fields or there may be anyway and they yeah I think there would have to be, wouldn't they? And they are put into a further argument which is an array which is denoted by steps in the example. We'll look at this in a bit more detail. This is very similar to the way that split works and you can see the examples I gave you there. But the main difference from split is that this Field Pat argument, the third one, is a regular expression which defines the field rather than the separator. So I've got a bunch of examples here and I've gone for splitting up comma separated stuff. Now what I've done here is to write a script which deals with $0, the input record. I'm just giving the example one record but this would work with multiple records of course. I'm using $0 and now you could just as well have ignored Pat Split and use the standard splitting mechanism but we haven't really covered how you can do that using a regular expression. There is a field built in thing called F Pat which is similar to Fs which I should do that but hasn't been covered yet we'll be adding that into the series a bit later on. So I've got a bunch of examples which I've called org12 underscore EX and then a number dot org. This one is EX1 and what we're doing here is we're using Pat Split to split $0 into an array called A. We're doing it by finding fields which consist of 0 to any number of non-comers. So it's the regular expression is in slashes and it's open square bracket then a circumflex which means not and then a comma close square bracket asterisk. That means 0 to any number of characters which are not commas. So in other words anything that consists of things which are not commas followed by a comma followed by not commas and a comma will fit that. Then having split it there's a loop which goes for I in A in remembering that that's the way that you walk through an array and it prints out the value of the array A index by I. So if we feed it an apple a day keeps the doctor away with commas and sort of spaces then the output is the same sentence with spaces in between. I'm printing them out without new lines and then I putting a new line on the end so you see an actual string. I've used a similar sort of approach throughout these examples. In the example not the thing that you can download which is the the orc script but in the actual example in the in the notes I've showed shown the process of making a bash variable X into the string an apple a day keeps the doctor away then using bash is editing features to replace all the spaces by commas and then feeding that to the orc script which then removes them again. It's a silly example but you get I hope you get the idea from that. Now if you wanted to do a more complex regular expression example two shows that this example takes the expression the the string I should say a bird in the hands of worth two in the bush but I turned the word bird into a red bird separated by commas and enclosed in double quotes so in standard CSV format you can have elements of the comma separated variable list which contain spaces or commas indeed enclosed in double quotes so I've just emulated that. Then when it's printed out it's printed out with each of these elements separated by spaces and I put angle brackets around each one just to make them stand out more clearly and you can see that the the red bird string is is is one one entity. The regular expression consists of two sub expressions enclosed in parentheses with a vertical bar in between them so it's it's an all type expression. The first one is the same as in the previous example with a series of zero or more not commas if you like to put it that way. The second one looks for a double quoted string containing one or more things which are not double quotes so this technique of saying the thing that encloses a string followed by any number of characters which are not the enclosing characters is a is a technique you'll often see in regular expressions so that works fine with the with example as you will see and that's EX2 then in EX3 we've got an example where the pattern is quite simple but what we're doing here is we are saving the separators so the patch split is simply using a series of letters capital or lowercase letters one or more so any sequence which which matches that is the field definition field pattern we're saving the result in an array s the script prints out all of the elements of the array which are which captured by splitting and of course I've called the array a because I've not got much imagination no new lines just spaces in between the elements followed by a new line at the end and then similar loop to print out the contents of the array s I might say similar but it's not quite the same because this time it's a counted loop because when you run patch split it returns the number of fields that it found and I captured that in a variable called FLDS short for fields so I use that in the loop setting i equal to one then adding one to it until while it's less than or equal to that number of fields so that prints out all of the separators and then it puts a new line on the end so the result is you get the the words in the sentence fed to it followed by a line containing all the separators what I fed to it was the expression grinning like a cheshire cat where each word is separated by a number of hyphens so the first thing you see is grinning like a cheshire cat separated by spaces followed by all of the different hyphens separated by spaces just so happens that the separators the hyphens are the same length sequence of hyphens same length has the word before it and I just wrote a little box script to do that which I've included in the notes here but I've marked it skip unless you're really interested so I won't read this one out you can dig into it if you really want to it is available for download if you want to grab it and mess around with it now the printing of the array s doesn't begin at 0 it begins at 1 but there is a 0th element because it captures past split captures the separators prior to the first field well there aren't any in this case so I didn't bother to print it but it's worth bearing in mind because it can be of interest okay that's all I'm going to say about past split let's move on to sorting arrays basically there are two main ways to do this the first one is to use an extension in gnu ork which is a built-in array variable called proc info all in uppercase the element of the array is has the the index sorted underscore in because that's a that's a string has to be in double quotes so proc info square brackets quote sorted in quote closed square bracket that's the the magic variable which can be used to control how arrays get sorted in the original version of ork the non-gnu version then arrays came back in an arbitrary order when you you loop through them so sorting them could be a bit of a pain and I know this because that that was one of the things I had to do in my early computing career when I started to use ork there was no sorting built-in the thing you put into the the proc info element is a string predefined string which begins with an at sign and consists of various keywords and the the default one is at unsorted which means that the array come back as in standard ork in an arbitrary order then there's a bunch of others and look at a read them all out because there's quite a number it's a little table I put together of them take for example one that I quite like to use and one I've used in the example which is at VAL underscore STR underscore ASC that stands for values the values of the array as opposed to the indices STR treat them as strings ASC in ascending order the notes here say order by element values in ascending order scalar values are compared as strings so whatever values of place are found in the array elements be they numbers or strings will be treated as strings and sorted accordingly so this is this can be quite useful I certainly would have been more than delighted to have had this one I had various tasks to do using ork back in my career setting this value is determined it determines the sort order but before the loop scanning it begins you can't change it during the loop while the loop is scanning and what's more important perhaps is that whenever you set this value prox info sorted in then it's effective throughout the entire script there's no sort of sculpting or localization so if you have a script that's an ork script that's printing a raise in several instances they're all going to be sorted in this way you can change the value between instances writing it out of course but you can't and you can also switch it off by setting it to unsorted but it has a wider effect than might be obvious there's a bit more to what to this thing than I've mentioned here and I've just alluded to it because arrays can be more complicated than we've seen so far plus also this prox info sorted in can also contain the name of a function which will perform sorting on the array for you it's just the function that you have to define we haven't looked at functions use the divine functions yet I'm not sure whether we will go into this when we when we do get to that point I've pointed to the GNU or manual section 8.1.6 which covers this in a lot of detail so if you really need to use this then that's the place to go so there's an example which is called EX4 another downloadable one and it consists of a begin rule and in the begin rule prox info sorted in is set to at val underscore string underscore ask the one I mentioned before and we just use split the split of dollar zero into an array it's doing the split by space which is the same sort of split you would get in in default anyway but if you split stuff in the usual way with all you you can't easily sort it and it doesn't go into an array then the script prints out the elements of the array and it will come out they will come out in sorted order he uses for i in a as we've used before it prints out the value of i and then the value of the value of the a array index by i I've fed it the string and Englishman's home in his castle because it's sorted on the value it comes out as an Englishman's castle his home is and you'll see the indexes are not in sorted order but the values are in sorted order alphabetically sorted with the capitalized letters before the lowercase ones it's quite I think that's quite a potentially useful thing I have certainly used this but the sorting capability in the past to capture frequency information from bits of data and frequencies are often a thing that were quite important bits of knowledge in the environment I worked in and having a sorted list of frequencies was often a useful thing to have so sorting in this way or alphabetically sorting the names that you were you were doing frequency counts on or something like that was often quite a desirable thing for various reasons so let's now look at the functions which are available for array sorting be easy mention them when you think his review of string functions in episode 11 the functions we're going to look at are called a sort and a sort i now the two functions have pretty much the same arguments I have listed them separately and described each of the arguments separately in the notes the arguments are the source which is the array that you're you're going to be sorting then second one is called desks which is an optional one which is the the place you you're going to put the results of the sort and the third argument which again is optional how is a way which you can define the type of sort the how argument not too surprisingly can be any of the strings that we've already seen these at unsorted and at val string desk and ask and so forth that we saw in the context of the proc info stuff anyway let's look at the examples always through these fairly quickly there are three of them first one is ex five and I've made them all fairly trivial where I've defined an array and called it a and the arrays indexed by the numbers one two and three and in it I've just put names Jones was the first one x I think the mr x when I wrote that I think and Smith and then I used a sort a on that array and printed out the results so using a sort on that array a which has got the values of one two and three the indexes results in nothing very nothing very exciting I had to do a double take of this one example five because the array is being loaded up with indexes one two and three with the strings Jones x and Smith then it's being sorted the sort will cause the the values to be sorted so when it's printed out you get Jones Smith and x in alphabetical order but the indexes have been changed to be one two and three against the Jones Smith and x so in other words the potentially the indexes are completely destroyed and are replaced by the numbers one two and three whatever whatever's appropriate for the number of elements and it's not very obvious in the examples I apologize for that but it's it's an odd thing to do in some respects because you're taking an array which has got indexes which one assumes are important and it's reordering the indexes so one stays at Jones two instead of being x becomes Smith and three instead of being Smith becomes x so it's it's a slightly odd thing to do I guess you'd say I think the the prog info method is better in many ways example six I have done the same thing Jones x and Smith in an array a but I've instead of using numeric indexes I've given them characters a b and c what I've done this time is used a sort on a but I've said the destination is to be a b so in doing that the a sort first of all copies a into b and then sorts b and does it does it stuff with b so there's a loop which loops through the array b and it prints out one Jones two Smith three x the same as before but then the second loop prints out the array a which goes a Jones b x and c Smith so in that case the indexes have not been destroyed but they are being messed up in the in the first one example x seven uses the other function a sort i and it creates an an array a where the the indexes are strings third second and first so third is Jones second is x first is Smith then a sort i a that when that's printed out you see that what you've got is an ordering of the indexes but the actual values have been thrown away so you might wonder what an earth I think I might be able to explain that in a moment example eight number four in this group uses a sort i but with the a desk argument so that you you don't destroy the original it's pretty much the same except that it prints out it prints out the result by using one array to index the so it loops it's sorting it's a sort i the very the array a into b and then it loops through b so for i and b and then for each element it prints out the array b indexed by i and then it prints out the array a indexed by b i so the result is first call on Smith second call on x third call on Jones so effectively it sorted them into the correct order by index without messing up any of the data all these the data has been messed up in the array b but b the results in b have been useful in indexing a hope that makes sense and i have to say that when i was having to do sorting of this sort of stuff myself using a just basic sorting algorithms in ork then that was the technique that i used but it's it's a little it's a little odd until you get the get the idea of it and that's why i think they are ork is changing these arrays because it's assuming you they're going to use them as ways of indexing the original data the next example the fifth one in this particular group e x nine uses the same sort of idea the three elements in array indexed by a b and c is using a sort from a into b but it's using the how value the how argument and it's using at v al underscore str underscore desk so descending by value treating them as strings then it's using array b to in the loop and it's going it's printing out the index of b and then the contents of b element a of a was Jones b was x element b was x element c was Smith and when it comes to print these out you see x Smith and Jones listed out in in that order and descending order with the index is one two and three this is useful but i would i would i would offer that the use is moderately limited okay but i've got a section here entitled yet more about arrays but it's really just to say i'm not going to do any more about arrays just now there is more to be said there is a sort of multi-dimensional array capability in ork with friend original ork it had this and it's still available in canoe ork and there's a considerable enhancement in that you can have arrays as array elements too in in canoe ork but i'm not sure that we're going to be covering these topics in this series there's loads of information about this in the canoe ork manual if you want to dig deeper however if you if you feel if we receive any requests to recover this they're in more depth then we'll reconsider doing something about it i must admit i have never used multi-dimensional arrays nor array arrays of arrays which can be arbitrarily deep and there are some quite reasonable facilities for manipulating them and walking through them and but um it's it's not a thing i've ever used in ork if i was to do that i would use a different scripting language i have to say time is marching on so let me go quickly through my real world or example this is about things i do and one of the things i do is to process show notes for hpr which are sent in with episodes many people send in their show notes as plain text which is but we need html for loading into the hpr database so what i do with them i've got a series of scripts in with which i check them pull the the notes out of the file that we get from the form and i edit them to fix any errors turn them into markdown and then generate html using a tool called pandok as part of that process i look at the html that's generated locally i've grabbed this and and i'm working on my local workstation and um i make a copy of the html in a format which is easy to browse and pandok is good at doing this it makes it turns them up down into standalone html which i can view in a browser and and it looks pretty much how look when it's on the hpr site so that's the point at which i can say oops there's a mistake here and go and fix it and move on from there i'd to make the html copy i want for viewing locally pandok has recently changed to the extent that you need to provide further information the further information is a couple of lines of metadata which has to be in a format known as yamol yamol is a sort of simplistic data format which is quite well defined but simple to to produce and human readable and so forth there are alternative ways to do but i'm using the yamol option so the way this should look is there should be two lines of metadata with a three hyphens above three full stops below and the two lines consist of title colon the word title colon lowercase space then the title of the show which has to be enclosed in quotes or should i enclose it in quotes anyway the second one is author colon space then the name of the host and i enclose that in quotes too and that's used to generate headers in the final document this is this is just for my own benefit so i wrote an org script to generate this yamol metadata and i'm embedded that in the bash script that i used to run pandok so i've included this bit of org in the notes here and it consists of 14 lines this is part of another script as i said the first line is org space minus f and then a space minus a hyphen character then the the name of a variable which is then piped into redirected into an output file again defined by a variable the first variable is called dollar raw file the second variable is called dollar tmp1 the temporary file be thrown away afterwards but the end of the line and this is where we're digressing a little bit from org into some of the areas of bash consists of a thing called a heirdoc and heirdoc is the way in which you tell bash there is some data that's to be in added or given to or stored in a file or given to a program and you in order to do this you need to use two less than signs followed by a word the word has to be has to have no no spaces in it i think it can tell you another characters i usually just make it a series of letters this particular one i've called end org all in capitals i put it in quotes and i'll mention this in a moment everything from that line up to a line that only consists of end org starting in column one is data to be chewed up by and because the org command uses minus f which is telling org where the program files to come from the script itself and the argument to minus f is a hyphen that hyphen means get it from standard the standard input channel so it's telling org effectively that what follows is the program it's just a convenient way of including an org script in us in another script immediately after the invocation to to org you can put the whole thing in quotes but if the script itself uses quotes things get really convoluted this particular case includes both single and double quotes so using quotes to enclose it would be a real pain the quotes around the the here doc terminator tell bash not to interpolate any dollar signed variables in the in the data by default it will actually scan this data and if it finds dollar something it will assume it's at the name of a bash variable and it will interpolate it if you put the here doc terminator in single quotes then it won't do that and i've got dollars and stuff in this script the script itself begins with a begin rule and the begin rule simply prints out the three hyphens that we need to start the thing and it ends with on line 13 i put line numbers on this one for ease of reference it ends with an end rule which prints out the three full stops at the end of the metadata then there are two regular expressions in the the main script and these are things which are going to be matched against the input data the first one is a circumflex title column with a capital T and what this is meant to do is to match the string title which is in the the input file where that input file is the one that's come from the hbr server and contains the data that's been fed in by the the host submitting the show and has been turned into into this file so one of the one of the the items on the form is the title of the show so we're looking for the the result of that so the rule itself uses the sub function which we've looked at in the previous show which matches the string circumflex title column circumflex being the start of line as you'll remember and uses backslash s after that because that means a universal white space sequence or single white space I should say so that's a space or a tab I think most of these when they're returned consist of one tab but not quite sure so I just did this to be safe and the sub the second argument to sub is simply an empty string what it's saying is the bit of the the line that's that comes in the one that begins title chopped the bit off that says title in his followed by white space removes it entirely so all that's left is the actual title then the next line line five is a g sub g sub and recall is a means of doing multiple substitutions on a line and here it's looking for single quotes and if it finds any it will replace them by by two single quotes and that's because YAML needs if you've enclosed a string in quotes and you're wise to do so then if you want to embed single quotes within it then they have to be doubled so that's what it's doing ready for YAML then it finally prints line six it prints the string title column in lowercase followed by one space followed by the final result of these bits of editing in single quotes followed by a new line and it's actually printing dollar zero which is the entire line that's been matched by the the regular expression the second regular expression is that the rule began by a regular expression is looking for host name and that's doing the same sort of thing except looking for the name of the the particular host in this in this file that's come back from the form on the hpa website and it's doing pretty much the same thing i wouldn't explain it again because it's pretty much identical when that's finished the result should be that the the four lines of metadata should be in the file whose name is in the variable tmp1 and then a bit later on in the bash script there's a long line which calls paddock to do the necessary thing and as part of its arguments i've printed them all out here in the notes but i don't really think i should explain them because i'm not sure anybody's interested but essentially it's it's given paddock two data files called dollar tmp1 and dollar extract which it's to process and produce some results in a file called full html along the way it's told paddock to include the hpr css which it's grabbed from the website so it means that the the html it's produced looks identical to the sort of html that the hpr website generates itself that took a lot of explanation but it's really not a very complex org script i thought it might be of interest see the sort of thing that what gets used for at least the way i use it and it also shows an example of using a bash here dock which people might not be that up to speed with so that's it that's the end of my show today there are all of the examples i've mentioned during colluded in the show there's an e-pub version of the notes okay then bye bye you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hpr listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is hecka public radio was founded by the digital dov pound and the infonomican computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself on this otherwise stated today's show is released on the creative comments attribution share a light 3.0 license