Episode: 2238 Title: HPR2238: Gnu Awk - Part 6 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2238/hpr2238.mp3 Transcribed: 2025-10-18 23:32:09 --- This is HPR episode 2,238 entitled Gnurk Part 6 and is part of the series Learning Ork. It is hosted by Dave Morris and in about 40 minutes long and Karim and exquisite flag. The summary is looking more deeply into ork's regular expressions. This episode of HPR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code HPR15, that's HPR15. Better web hosting that's honest and fair at an honesthost.com. Hello everybody, welcome to Hacker Public Radio. My name is Dave Morris. Today I'm talking about Gnurk and this is part 6 of the series that we're calling Learning Ork, as the series name, and Be Easy and I are doing this. My episode 6 is somewhat later than episode 5 because of Christmas New Year, etc, etc. So as I've done in the past I'm going to start with a short recap of the last episode, episode 5. And in that episode the subject of regular expressions was begun. We looked at how you'd use the regular expression in the pattern part of these pattern, curly bracket action sequences that we've seen. Remember these sequences are called rules. I think it's useful to hammer on about the titles, the names of these things, because it makes it less ambiguous when we talk about them. So we saw some examples like, and here's where people are going to say, how come you're reading a regular expression out loud, but yeah, well we'll see how we go. There are long long show notes to go with this where I've detailed everything that I'm talking about, so that in reality you don't need to absorb what I'm saying much if you're just listening on the go, but if you're interested you can go and look at the notes and follow through those. Anyway, if you have something in an Ork script, this is just a fragment of Ork which says dollar one, then a tilde sign, and then slash, lowercase p square bracket, ELU, closed square bracket slash, then open curly bracket, print space, dollar zero, close curly bracket. What this means is if field one, that was the dollar one, contains the letter p followed by one of E or L or U, then print the whole line. The regular expression in that case applies to the entirety of that field. It's not saying anything about where it is or anything like that. There's another example that be easy did, which was dollar two, tilde slash, lowercase E, curly bracket two, close curly bracket slash, and then print dollar zero again. And what this means is if field two, contains two instances of the letter E in sequence, so that was what the curly bracket two meant, the thing in front of it to be doubled, then print that whole line. Now it's usual to enclose the regular expression in slashes, and this make it what's called a regular expression or rejects constant. And the GNU manual goes into some detail about these, and I've linked to it in particular the section about regular expressions you might find useful. In the last episode we had a look at many of the operators used in regular expressions. Unfortunately some small errors crept into the list of operators, which were mentioned in that episode. And I've listed the ones that were incorrect, and these are backslash A, backslash lowercase Z, that was a capital A I should say, backslash lowercase B and backslash D. The first one means at the beginning of a string, second one means the end of a string, the backslash lowercase B means a word boundary, and backslash lowercase D means a digit. But not in augum afraid, those aren't the sequences for aug. This sort of highlights one of the things about regular expressions. I love regular expressions. They are most wonderful things, but there's quite a lot of inconsistencies between the different implementations of them. And if you use several languages it's so easy to get confused between them. And I guess this is what happened here, but I'm going to try and make sure that these are corrected here. And I hope I don't make any mistakes along the way because I could so easily do so. The backslash B thing is what's available in said. I'm going to mention said, if you listen to the said series, or might be interested in said, it's worth looking at that. That's another case where the regular expressions are just a little bit different. The concepts are the same, but they're different. And I've included what the GNU manual says that they chose not to use backslash B for the word boundary thing. And they explain why I won't read it out to you because I'm sure you can read it if you're interested. They use backslash Y instead in Ork. So this isn't a prime example of things being problematic between different subsets of regular expressions, which is a shame because I think that puts people off. The other thing that Be Easy did for the last episode was to look at the ways you could replace things using regular expressions. So he talked about the built-in functions that use regular expressions to match things and then replace them. And these functions are called sub G sub and Gen sub. Regular expressions are used in other functions and in other places in Ork, but we'll reserve them for a later time. So I thought it would be useful in this episode to talk a bit more about the regular expression operations and so for operators I guess is the right word saying it. And also to look at sub G sub and Gen sub in more detail. Think Be Easy even said something to that effect in his episode expecting me to maybe go into a bit more detail. I'm the guy who drills down and goes into sometimes possibly tedious detail and things. Let's look at the regular expression operators. Now one of the things you're probably aware of is that some of the characters, the normal characters in English, etc., like a full stop or period, had special meaning in regular expression. So if you want to switch off that special meaning and actually indicate a real full stop, then you need to proceed it with the backslash. So the backslash is a way of saying this is this this what is normally a special character is to become the ordinary version of it. And since a backslash itself is special, then two backslashes in sequence would means that you actually mean to include a real backslash in your expression. There'll be some examples of its use a bit later on. It's worth noting though, even though I've said that, that there are some Gnu said and Gnu Ork regular expression operators that use the backslash as part of the operator. So we'll come on to those for all Gnu's in a moment. So I've made a table of all of the regular expression operators. Really it's a reiteration of pretty much everything that Be Easy did for the last episode, but I thought putting more to together in this way might be might be useful. I don't know. Now one thing is that the expression that consists of an open square bracket, a list of characters and a closed square bracket, that has the name of a bracket expression in Gnu Ork. And they remember that whole thing represents just one character. So it's saying anything chosen from this list or if you start it with a tilde, it means anything not in this list, any single character. Now I don't think it's been said explicitly that if you want to include the character's backslash, closed square bracket, hyphen or tilde, sorry not tilde, circumflex, carrot in your list, then you need to proceed them with the backslash. So although did I say tilde before? I didn't mean that. I mean the circumflex or carrot character. Sorry about that. We also saw character classes last time, which were these things in square brackets. So it was open square bracket, colon, then some name, colon, closed square bracket. An example is alnum, which meant alphabetic characters and numbers. These can only be used in these bracketed expressions. So it starts off with two open square brackets and ends with two closed square brackets. These often map onto or are the same as the things like lowercase a hyphen, lowercase z. But they've been provided because the world now uses much broader character sets in when dealing with texting on computers. They're things like unicode and so forth. And these are catered for by these character classes, which come out of the POSIX movement, which has developed all of these things. Whatever, they only represent a single character. So then I want to get on to some of these special sequences, which are preceded by a backslash. And again, some of these were covered already, but not all. So I thought it was worth just doing the whole list again and just making sure that we have a sort of definitive list. So backslash lowercase s means any white space character. So that's a space or a tab or a new line in the backslash uppercase s matches anything that is not a new line and you're not not a white space. Sorry. And both of these have their equivalents if you use character classes. And I mentioned to be able, I won't read them out. Backslash lowercase w matches any word character where word characters are letter, a digit or the underscore character. That comes from the days of computer variable names. So word in this context means because words don't, words in English don't usually incorporate numbers as to elite speak or something. The backslash uppercase w matches the reverse, which is any non-word character. Then we've got backslash less than sign, and that matches the word boundary or the empty string it says in the manual at the beginning of a word. And backslash greater than matches the empty string at the end of a word. So here if you want to to generate a regular expression where you say things like a word beginning with x, then you can do that using these things or ending with x, whichever what you want to do. The backslash y, which you mentioned before, is the same as the backslash less than and backslash greater than. But it operates at both ends of the of the word, both boundaries it represents. More correctly, either boundary. Backslash capital B matches everywhere but on a word boundary. And you think, well, why in earth would you want to do that? Well, I've included an example that shows its use a bit later on. It's effectively the inverse of backslash lowercase y. Then we come to the last which are. And these are hard to read. So as I've done with a few others, I've spelled them out, make it clear what they are. Backslash back quote. And that matches the empty string at the beginning of a string. So it's also the beginning of a buffer or the beginning of the current line or the start of a field. It's essentially the same as the circumflexal carrot operator which you could use in a regular expression. Not in the square brackets, but just on the regular expression itself. It's not completely clear to me why the two options for doing the same thing exist. But there you are. Then there's backslash single quote. And that matches the empty string at the end of a buffer or a string. And it's essentially the same as the dollar sign operator. Now it's worth saying, but I won't go into detail about the fact that Gnu awk is a superset of the traditional awk. It includes the posix features and the Gnu features which we've added on. But you can switch either or both of those off. And the regular expression operators have just mentioned behave differently or don't exist depending on how you set these flags. I've included a link to the regular expression section in the Gnu awk manual in order to explain this hopefully. So let's go on to functions. The functions we've already seen last episode. I'm going to look at those in a bit more detail. Start with sub. Sub function has the format sub open bracket. Then the first argument is a regular expression comma. Second argument is the is a replacement string. And then there's an optional third argument so that it'd be comma and then the target. As I said, the first argument is the regular expression and it's usually enclosed in slashes. Now there are other ways of dealing with this thing which I've made a footnote about. There are other other things than regular expression constants or rejects constants. But I think probably we don't want to go into that just at the moment. But you can delve into the manual if you want to. The replacement argument is a string and that contains the thing that you want to replace when there's a match with the regular expression, the first argument. If this contains an ampersand character, it refers to the whole text that was matched. So you can use the text that was matched in the replacement. The third argument being optional is the target and that's the string or field that will be changed. Now it has to be a string variable or a field name since it's actually changed in place. So sub is just called and when it's run providing it matches, it will have changed whatever you point at. If you don't use this argument at all, then field dollar zero, which is the whole input buffer, input line, input record is modified. And one thing to bear in mind is it searches for the longest leftmost match using the regular expression argument. So it's one of these regular expressions which are greedy. It will find the the longest match that fulfills its specification. So that's important to bear in mind. I did hammer on about this on the said series. I haven't gone into that much detail here, but maybe in later episodes we'll use some examples that show this in a bit more detail. Now since this is a function, it returns a result, which you throw away if you want to, but it returns the number of changes made. Well since it can only make zero or one, then the result will either be zero or one change that's returned. So I've done some examples and I'm not sure whether it's wise to try and read these out. I'll try the first one and then I'll glass over the second and third. So what I've got here is an example of a command line command where echo the string banana through a pipe to orc. And then in the orc quoted script on the same command line, I've got open curly bracket, sub, open parenthesis slash an slash comma, and then in double quotes, of course, don't quotes the string, the imagery and orc, I've got two capital X's close bracket, close parenthesis that I should say, semi-colon, space, print, close curly bracket, close quote. So what that's doing is it's been told to find the first occurrence of the string a n, lowercase, in the dollar zero field, the whole record, replace it with two X's. So the answer you get back is b x x a n a. And the second example shows pretty much the same thing except that the replacement is two ampersand. So if an a n is matched, it's replaced with a n a n. So you get as a result banana banana. I have a very strange sense of humor. I'm sorry. The third one does a little bit more. It does does the same, same as the previous one, except that it captures the result of the sub function call into a variable called n. And then it prints out changes made equals the value of n and the result, which is the contents of dollar zero after the change is been made. So you get the message changes made equals one and the result is this banana with an extra a n in it. Okay, so nothing very complex there. Can get complex in the regular expression, but we'll get onto more advanced examples in a while. Let's look at g sub. G sub is similar to sub has the same same format g sub regular expression comma replacement and then an optional target. The arguments have exactly the same purpose, but the function differs in that it searches the the target string for all matches and replaces them. It says in the manual that the matches must not overlap. I would never have expected that to be a criterion, but I thought it was probably worth reiterating it anyway. It again returns the number of changes made, but it can be any number from zero to whatever. So we've got the echo banana business again. This time g sub replacing a n with two capital X's, but in this case it replaces because there's two instances of a n in banana that you get back b x x x a. Okay, I kept it simple hopefully just to make a point of what it's doing. Then the second example shows if you if your target is a n a and you place that with x x it only matches once and that's because banana could be said to contain the sequence a n a overlapping. So it's a n a and then if you step back one a n a again, I don't know I've never I don't know why maybe I've been working this stuff too long, but I can't imagine any system that would would see that as two patterns to match, but there you go. That was a demonstration that it doesn't replace the overlaps. Then the last example for g sub is a little bit more substantial. In this case, I've got a single line example where I'm processing the file called file1.txt that was introduced earlier on in the series and I'm using g sub and replacing the the regular expression consists of n square brackets a list and list is all of the vows AEI or you. The replacement in the g sub is a question mark and I want to apply it and so hit the target is dollar one so it's field one so it's only going to apply to field one of this this file and the result from g sub is saved in a variable called n for every instance every line that walk detects it will do that g sub and then it will use print f to print out the result and it's using format string where the format string consists of a percent hyphen 12s so it's saying print out something or other some string in a field of 12 characters wide and that's followed by a space and then in brackets percent d so that's saying print out a number in parentheses and there's a backslash end at the end because print f will not generate new lines by itself. The arguments to print f are dollar one which is the field we've just manipulated and n which is the number of changes that were made so the result is you get a list of the the fields where every vowl has been replaced by a question mark and then at the end of the line you're seeing in brackets in parentheses that the number of changes so it got 2232 etc etc. four at the end. Pine apples got a lot of vows and nothing very exciting but I hope it makes the point so sub and g sub pretty straightforward but gen sub the third one is somewhat more complex it's been added to a new awk quite a bit later than sub and g sub and I just said as a little aside here that I was using new awk quite a long time ago and I've got a manual still from from work and then I printed out on the laser printer there and bound and it's the good it's called the gawk manual it tends to call it gawk in those days dated 1992 version 0.14 and there was no gen sub in that one so I don't know when it came about I'm sure you could find out if you really wanted to know but it's um it wasn't there back in the day so gen sub has got a bunch of arguments we're which are essentially the same as the the others sub and g sub except that there's an extra one so I'm going to go through them all one by one and I've documented them individually in the note so rejects the first one that's a regular expression which is usually a constant enclosed in slashes and you can use any of the regular expression operators that you've seen in this episode in the previous one but one of the particular things that is of interest is that you can use regular expression groups which are enclosed in parentheses and I keep making references back to the learning said series I did harp on quite a lot about this in that series so if you were listening to that then um maybe this will will not be new to you second argument is referred to as the replacement as before and it's a string which specified what's going to be replaced but it can also contain back references to the things that were captured by the parenthesized expressions that were in the regular expression earlier a back reference consists of a backslash followed by a number if the number is zero then it refers to the entire regular expression and it's the same as the ampersand character that we've already seen anyway otherwise it can be one to nine you can't have more than nine back references which I find strange but there you go and it refers to one of the parenthesized groups and they are just numbered in sequence across the regular expression now here's an oddity of orc the way that orc processes strings and that's a bunch of characters enclosed in double quotes remember you have to double the backslash so in order to refer to parenthesized component number one the string must be backslash backslash one so I always found this to be a bit of a pain um but it's a it's an orc ism you find the other regular expression environments don't do this and said doesn't for example third argument isn't called how and it's a string which must contain a capital G a lowercase G or a number if it's one of the G's it means global and it means all reply all that all the matches should be replaced as specified in the replacement argument if it's a number then it indicates which particular numbered match and replacement should be performed now this is not referring to the the groups parenthesized groups it's just referring to the matches within the regular expression and you can't select more than one so which person I find a bit of a pain but there you go you can't do multiple action other do the whole string replace everything or you choose just one but you can specify which one it is the fourth argument is the target again and it's optional and if you don't provide it then dollar zero the whole record is used now it can be a variable containing a string or a field or it can be a string constant thing in double quotes and that's because the target is not changed in situ like the other ones sub and g sub instead the function doesn't return a number as the previous one so it returns the change string so this needs examples I'm sure so this is this is quite a powerful feature to my mind not quite powerful enough it's certainly a lot more powerful than just sub and g sub on the run so my example is the famous banana example again and I'm using gen sub this time and I'm saying replace the letter a with a capital A and I'm using the how command is a g how argument I should say the third argument now the gen sub function call is preceded by a print because gen sub is called an our answer is returned and then print operates on that so it prints out the result one of the as I was coming up with an example here I accidentally made it do the same as I've been doing with the sub and g sub examples and put the print on the end and was wondering why doesn't this work well I added another print here which will just a bear print will print dollar zero which is the target that's been operated on you'll see that the first thing that gets printed out is banana or all the cap with all the as it turned into capital A's the second thing printed out is banana with everything in my case because nothing has been changed in dollar zero so that just drives home that point then I've got the same example a similar example again again banana and I've used gen sub replacing like I say with the uppercase A but the how argument is the number one in in quotes so what that's asking is only the first match to be replaced so the the gen sub is matching the letter A in the lower case A in the word banana and it's been told when you've done the first one and stop don't replace anybody so see what I meant about it's got nothing to do with the groups in parentheses because there aren't any here then the next example is a rather weird and clunky one but just to sort of make the point what I've got here is again the famous banana is being changed but in the gen sub the regular expression is backslash uppercase B then the letter A in lower case A then another backslash capital B then the closing slash I want to replace the the matched A with an uppercase A and I want to do it globally across the whole thing but what this is doing is it's saying any A which is preceded and followed by a not word boundary other ones which any A which is not a word boundary so it only changes the first and the second letters letter A because the last one is on the word boundary at the end of the word now it is a way of doing replacements but it's it's clunky isn't I couldn't think of any other way in which you could do just a subset of the available replacements so the next example shows the use of bracketed parenthesized sub expressions or regular expressions in parentheses however you like to say it and it's really a reference back to something I did on the said series echo the string hack public radio with capitals on each word into orc and use gen sub and inside the gen sub regular expression is a backslash located W so that's a single word character in parentheses that's followed by backslash located W followed by a plus sign in parentheses so that means one or more characters word characters then the last parenthesized expression is backslash uppercase W so that means any non-word character and that's followed by an asterisk so that means zero to however many instances so the intention with that was it would match the word hacker say the first group would catch the capital H the second group would catch the ACKER and the third group would catch the space which is a non-word character the replacement is to include back reference to so it's backslash backslash 2 then back reference 1 which is backslash backslash 1 the letters ay then backslash backslash 3 so effectively it's taken the first letter off the word hacker put it on the end added ay to it and then replace the space after it and the how argument is a g and so it will keep applying this rule throughout the entire string it's running on dollar zero buffer the whole record so what it does is it flips around the letters in the way that I did in the said series to turn hacker public radio into aka oblique pay edu array which is very primitive pig Latin just amuse me that example I don't think amuse anybody else but amuse me then the final example in for gen sub shows pretty much the same thing but a different variant of it I in this case I have not echoed hacker public radio into an orc script but I've been closed the entire gen sub print gen sub call into a begin rule so it's capital all in capital is begin open curly bracket then the stuff that I wanted to operate on the gen sub does exactly the same operation and the replacement is exactly the same but the how argument is a number three in quote double quote so it only applies to the third instant and the string that I want to operate on hacker public radio is the target string and it's actually there as a string literal in the in the call just to show that you can do that why an earth you do want to do it of course is another question but it is possible to do and the result is hacker public and then the third instance the third matching instance is radio which is turned into ad array so hopefully that helps to demonstrate some of the features now I thought I would end off with a slightly more meaty example and so I've written a simple orc script the orc script is called contacts dot orc and it's operating on a data file called contacts dot txt the idea was that you might have a file of contacts where you you had things like name and email address and so on and you might want to do manipulation on it with orc and I'm done anything very exciting this time but we'll maybe do some other stuff later now there are there are links to these files they included with the show so you can grab them and mess around with me if you want to as an aside I used a site called mokaru which I'm sure should be said with an Australian accent which I won't try I was going to it but I shouldn't and it's a site for generating free test data at least if you want to generate just a little bit like this it's free if you want to do something much more advanced than you have to pay but I just used it to generate 10 chunks of data it generated CSV data form then I used VIM to turn that CSV into a final format and I used a plugin called CSV dot VIM which does some quite cool things with CSV format data and I used the convert data function to turn it into a different format and I've given an example of the first eight lines of the file you can look at yourself but we enjoyed doing the notes in case you don't want to so it's what it's done is it's generated lines that consist of the word name colon space then the person's name then I've actually included the first and the last name split that into field so it's first colon and then last colon and the surname and email colon space blah blah blah gender there were other fields you could you could generate but I stopped at that so that block of lines is followed by a blank line and then so on and so forth and the reason I chose this format is there's quite a few contact management things simple ones for command line use that will work with that sort of format so I've included listing of the script in the notes case you you don't want to grab it yourself and I've included an example of how to invoke it and you would type orc space minus lowercase f space and the name of the script contacts dot orc space then the name of the file contacts dot txt now the the script and I won't go into massive detail with this just skim through it quite quickly it's not that complex I'm sure you and it's also commented so hopefully it's self documenting and there's some jibba jabba in the notes as well so much point in me in too much detail what it does is we've got a file which contains blocks of lines so we've got line line line line then a blank line so what I wanted to be able to do is to make orc treat that block as one entity in in fact there's one record okay so this is sort of slightly off off the track as regards regular expressions but there with me I think you might find it useful so in the script there's a begin rule and in the begin rule various separators are defined so first of all the field separator is not the usual space but a new line the record separator is not a new line but two new line and then the output record separator oh our s is a new line bunch of hyphons and a new line for hyphons so what this means is that as orc reads this data it will keep reading a record until it hits a double new line so that blank line is effectively a double new line because it's one at the end of the record and then a black then an empty one when it prints it out it'll print it out if I didn't change the output record separately it would just come up back out again as it was written as it was read I should say but I put a line of hyphons in there just to make it clearer the field separator because each line in the normal sense the bit between new lines these regarding we're regarding them as fields and so therefore the delimiter is a new line okay this is pretty useful when you're doing weird stuff with the with a file formatted like this I used to do this a lot when I was working because I was managing an LDAP directory and it tends to write stuff in this this old general format anyway the main the program itself is just a bunch of statements in a in curly brackets and it's that that means it's going to operate on every record using record explicitly a rather line so what I wanted to do in here was simply to show you the difference between four of these regular expressions so I'm using sub to place markers at the the points where four of these regular expression operators tell it to so the first one is that backslash a back quote which means the beginning of the buffer I put an open square bracket there then I put at the end of the buffer a close square bracket and that is the backslash single then I did the same with the circumflex which is beginning of line or beginning of buffer it's a little bit vague it's what the difference is between these things in terms of the terminology as best to use and I'm using curly brackets here to to place into the output just to show where these things are then the final statement is print and I print an open parenthesis followed by the record number for the record I'm printing a slash followed by the number of fields in that record followed by a close parenthesis and then followed by that record dollar zero the sort of things that generate you can see in the notes I just given you the first eight lines so it's open parenthesis one slash five close parenthesis space then an open curly bracket an open square bracket name colon robin Richardson etc etc and at the end of that block is a close bracket square bracket and a close curly bracket then a line of hyphons and so on to the next one so it's showing you that the start of buffer means what you would expect it to mean but you might not fully understand what that means if you're just thinking in terms of line separated by new lines so what can deal with it can gobble data in different ways than than the sort of conventional way we've been looking at so far I thought it was about time to mention this now so okay I put a warning in here just saying watch out if you've been learning said if you're following the learning said series that concept here are pretty similar in fact they're almost identical but the way in which regular expressions are are written is different so do be careful between the two otherwise you're going to get into trouble I just thought it was worth playing with I have now said it several times so apologies so we have um we've done I think and uh hopefully you found that useful and now understand a lot more about how you will do regular expression stuff in orc I've included an e-pub version of these notes which personally I'm not sure is a good thing here the e-pub sorry not e-pub the e-pub one is good the pdf version is not so good I'm not keen on the pdf I didn't get any feedback last time when I asked for it from anybody saying keep that one throw that one or throw them both or go away or anything like that but feel free to let me know what you think for these things because I'd quite like to consolidate all of the notes for this series into one format probably e-pub at some stage when the series is finished I'll probably do the same for the said series at at some point as well okay that's it then thanks a lot bye now you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast then click on our contribute link to find out how easy it really is hecka public radio was found by the digital dot pound and the infonomicon computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself unless otherwise status today's show is released on the creative comments attribution share a light 3.0 license