Episode: 2163 Title: HPR2163: Gnu Awk - Part 4 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2163/hpr2163.mp3 Transcribed: 2025-10-18 15:08:36 --- This is HPR episode 2,163 entitled Genoaq Part 4 and is part of the series Learning Auk. It is hosted by Dave Morris and is about 31 minutes long. The summer is recapping the last episode and looking at variables in an Auk program. This episode of HPR is brought to you by AnanasThost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at AnanasThost.com. Hi everyone, this is Dave Morris and this is episode 4 in the Genoaq series. Be easy and I are progressing with this and we've now, well this is episode 4 as I say, so it means we now have a series which we've called Learning Auk so they're all joined together so you can find them easier and that sort of thing. Okay, so what I'm going to start with this time is a recap of the previous episode and then I'm going to go into a bit more detail about variables in Auk. So in the last episode you saw logical operators, they're also called Boolean operators. If that means anything to you, Boolean algebra and that type of thing. Boolean algebra has not and an all operators. Well in Auk the double ampersand means and double vertical bar or pipe symbol means Auk. One that wasn't covered was the not operator which is an exclamation mark. So we can generate some quite complex Boolean expressions with this but I'll leave that or we'll leave that, I'm not sure who's going to do this but we'll leave it till later because we want to deal with this in the context of other Auk statements in in an Auk program walk script so we'll expand on this a bit later on. You also saw last episode the next statement and this we discovered is a way of stopping processing on the current input record so it really does abort everything. No more patterns are tested against it. The pattern that's currently executing in the current rule I should say the actions in the current rule are finished with stopped at that point. It's a statement in a similar way to things like print and so you can't use it anywhere else other than in the action part of a rule and you can't use it in begin or end rules either and I'm going to talk about that in a minute. So beginning and end, beginning and end are actually patterns. They're in capitals, capital B E G I N and E N D. They're patterns which are special and they have to have to work with an action. You can't have either of them without an action. The action is being in curly brackets as you know and the whole Shibang begin an action, end an action, make up rule in the same way as we've seen with the pattern action sequences. So the begin stuff is run before the main pattern action rules are processed. That is the input file or files are read. End rules are run after everything's been been read and processed and you can have more than one begin and more than one end doesn't actually matter which order they occur in in terms of the begins versus the ends but if you have multiple begins then they are executed in the order that they are encountered similarly with end. So in the last episode we also started to look at variables. It's it's difficult when describing this sort of issue. This is effectively a language we're talking about here or and you can't really start at the beginning and work through it at the end because there isn't really a beginning you know because it's it's quite difficult to find a linear path through it. So we're sort of going ahead into areas that haven't really been explained yet just to demonstrate certain functions and processes and so on. So there was a bunch of things that were commented on that were shown last episode. Variables arrays and loops and that sort of thing. So we're going to look at all of these in a bit more detail in this episode so I'm trying to consolidate them all. Okay that was a quick recap of where we were from last episode and I now want to start talking about variables in relation to ORC. They've already seen things like NR capital NR capital NF which is the record number and the field number in the early part of the series and in the last episode you saw that you can create your own variables too. So what's a variable? Well as you find in most other programming languages it's a named storage area that can hold a value and it has certain rules about how you construct the name. It consists of letters, digits and the underscore in the case of ORC and it mustn't start with the digit. The case of the letters is significant so lower case sum and capital S, lower case UM and capital S capital M are three variables that you would speak them the same but they're different. The other name for these types of variables they can just hold a single value they're called scalars. You might see that name I'm mentioning these because you might see them if you look in the manual. So variable in ORC can contain a numeric value or a string value. ORC deals with the conversion of one of these to the other as appropriate. Sometimes it might mistake if you like to put it that way what it was you intended it might need some assistance but we'll refer to these later. Now one of the things you learn as a somebody learning programming or did so back in the day when I was learning this sort of stuff is when you create a variable in the language you need to initialize it because there's no definition of what it contains before you use it but in ORC that's not so. All variables begin as an empty string and an empty string is the equivalent of zero if you need to use it as a number. So how do you set variables to values? Well you do it as you do in most languages you use an assignment so I've given an example here count equals three that's an assignment count is the name of the variable the equals is the assignment operator three is the value you're going to put into it and last episode saw an assignment like used usd plus equals dollar three what this actually means is increment the contents of variable used the variable with the name used by the contents of field three so uses the variable plus equals is this special type of assignment and dollar three as you already know means field three there is an assumption here that dollar three contains a numeric value but we'll come on to what would happen if it didn't a bit later on it's a shorthand version of used equals used plus dollar two so what that means is add the contents of used to the contents of field three and then save the result back in the variable used. So the first time the variable is incremented its contents are taken to be zero and as I've said it used to be that if you were writing in C or Fortran or Pascal or one of those sorts of older languages compiled languages you you could not get away with it but in Orc and many other scripting languages these days it's it's not a problem so we've started down the road of looking at arithmetic operators so I thought we would stop and look at the whole the whole list it's a pretty short list but I'll just go through them briefly there's a table in the in the long notes here which you can refer to if you need to but you've had any experience of programming most of these will be very very obvious to you one thing to note before we proceed is that all numbers in Orc are floating point numbers that is they have a decimal point in them this can catch you out in some edge cases because comparing floating point numbers for equality doesn't always give you the result that you would expect one but we'll we'll highlight these as we go along what I've done here is to put together a list bake based on what's in the Gnuwark user's guide and as before there's a reference to it if you want to go and examine it yourself I've listed in them as they do in the order of their precedence from highest to lowest so the first one is the circumflex character which is exponentiation so x circumflex y means x raised the power of y so something like two circumflex three that's two to the power of three which has the value eight in Orc there is a double asterisk operator which is does the same job but it's not the standard version Gnuwark and it is slightly different from standard Orc so we're trying to stick to pretty much the mainstream stuff as much as possible because otherwise you you might get caught out if you try and run your Orc script on a different machine a different system a BSD system wasn't another perhaps a Mac or something so we're not going to use the double asterisk operator so a minus sign in trying to put a variable or a number obviously negates it plus sign in front of one is unri plus and that's actually a way in which you can tell Orc to treat a variable as a number and I was typing this out I was trying to think of cases where you'd want to do that and I couldn't come up with any but hopefully some will occur to me as we go along the asterisk is multiplication the forward slash is division and there's a note here which is that because all numbers in Orc are floating point the result is not rounded to an integer so three divided by four which would be written as three forward slash four it has the value 0.75 whereas if you did the same thing in bash for example which is purely integer you typed something like echo dollar open parenthesis open rendsis three slash four close parenthesis you'd get the answer zero because it's rounded it to an integer to all number the percent symbol is the remainder after division so x percent y is the the remainder after x has been divided by one so three percent four is three so it doesn't it can't be divided by four there's and the remainder is three five percent two is one because two goes into five twice leaving one remainder the plus sign is also addition so x plus y so you'll be meaning and the hyphen the minor sign is subtraction x minus one so pretty obvious so if you've already seen the plus equals operator this is an assignment operator these are shorthand forms of more verbose assignments which is we've already looked at in one particular case so I put together a table which is a modification of the GNUORC user's guide table showing all of these operators so you might do plus equals minus equals asterisk equals slash equals percent equals and circumflex equals and I think you probably get what that means in in all of the cases let's just look at the last one circumflex equals so if you wrote variable circumflex equals power so you might might type x circumflex equals two what that means is raise x to the power of two so x becomes x squared I wrote a little script just to demonstrate these things and it's available if you want it and it's called arithmetic assignment operators dot org and it's I've listed its contents and it's simply a bunch of expressions statements which use these various operators and print out the result yet the whole thing is in a begin rule because we don't want the script to actually do any file processing it's just doing a little demonstration of its internal computation capabilities as I've written it say for example the first line after the begin it is x equals 42 semi-colon print quotes x is close quotes comma x so there are there are two statements there one is the assignment statement which sets x to 42 the second one is a print which prints out x is the string x is followed by the contents of x so there there's a semicolon between them if you write two statements on a line then you need semicolons between them they could have been written on two successive lines but I just thought a little bit of need to doing it this way so you need semicolon statement separators if there are multiple statements on a line but you don't need them if there if there's only one statement per line so there's no semicolon on the end if you're used to other languages where this is necessary then orc doesn't make it so it doesn't matter if you put a semicolon on the end of the line as well if you want to there's something to be said for doing that I guess but you don't need to okay so I've got an example here of what happened when you run this and I'm not reading you that because it's pretty obvious so let's talk about type conversion so variable can contain a numeric value or a string at any point in time as we've seen when converting from a number to a string then what you get is a string containing the number a little bit more to it than that but we'll leave that for another time converting from a string to a number on the other hand well there needs to be something that can be interpreted as a number within the string in other words it needs to begin with a digit sequence so my little example here uses the string nine gag dot com and if you set into a variable called s and then I set x equal to s plus one and print x so the answer is ten because or pulled the nine off the front of this address IP address and simply added one to it so the the nine off the front was converted to number and then one was added to it if there's no valid number in a string when you come to do this type of conversion then orc will treat it as zero so orc will handle strings containing all sorts of numbers so it'll handle energy numbers like number 42 floating point numbers like 4.2 and also exponential numbers and the notation for this which is common in many languages one e three i've used a capital E in this case but it could also be a lower case one e three means one times ten to the to the three so it's a thousand so i've got a little example of these three strings being fed to a print f statement and printed out and the print f uses the g format control letter which we haven't really looked at we're going to spend some time on these control letters a bit later on but the g one is for printing general numbers so it prints 42 as 42 4.2 as 4.2 and one e three comes out as a thousand also in last the last episode these are used some operators which consisted of two operators together plus plus i think he used and these are called increment and decrement operators and they increment or decrement the value of a variable by one and if you've been following my series on on bash and parameter expansion or various expansions i covered arithmetic expansion where i talked about these in the bash context you can look at episode 1951 if you've gotten or if you're interested so again i've produced a list of the various variables the various operators i should say so for example plus plus variable name means increment the variable returning the new value as the value of expression so plus plus variable is different from variable plus plus because the first one plus plus means add one to it and then return the result variable plus plus it's called a post increment in this case returns the contents the variable before it's had one added to it then adds one to it okay so this is in a similar pair minus minus variable which decrements it and then returns that value and variable minus minus which returns the value and then subtract one from it there's some examples of how this might be used a little bit later on in the notes so that's scalar variables and but there's also a whole bunch of other capabilities in the shape of arrays within orc or provides one dimensional arrays now there's a little note here to the effect that what does actually allow you to have multi-dimensional arrays traditional orc offers this by a sort of a hacky solution. Gnu orc provides true arrays of arrays but i'm not sure that we're going to cover that in this particular series because it's pretty much on the edge if i wanted to do this personally i would not be using orc to to do it but you you may think otherwise of course but to think it might be a better we'd simply point you at the manual to to go further with this but i thought it was worth just pointing out that there's quite a lot in Gnu orc. The thing about arrays in orc is that they are so-called associative arrays that which also known as hashes so let's talk about what an array is it has a name and its name it's got to conform to the rules we talked about for scalars you can't have an array called the same thing as a as a scalar variable an array can store multiple values and to get at them you use an index since this is a scripting language it's different from compiled languages the arrays can be any length and can be expanded it can contract it at will so given an array let's call it a we might store a value in it so we type a open square bracket one closed square bracket equals and then a string i've put hpr in double quotes double quotes is the way you define a string in in orc by the way so the array name is a the index is one and the contents of a square brackets one is the string hpr so if you if you used to using arrays in other languages you might assume that the index is numeric but it's not it's a string all array indices are strings because orc arrays are these types of things they're associative you use a string as the index into it so it's an associative array or a hash their index but arbitrary string values and they make up a sort of a lookup table it's actually quite powerful capability so in one of the examples in last episode we saw this is just an extract from an from an example nr not equal to one that was a patent open curly bracket a square bracket dollar two closed square bracket plus plus close curly brace so we saw that and here the orc script was being used to produce a frequency count of colors and we were looking through the file file 1.txt which you already have a copy of would imagine field two in this file is the name of a color so what we're doing here is we're using the color name as an index and we're simply incrementing that array element so I've tried to explain it in text and here is what I've typed means index the array a by the string contents of field two if the element doesn't exist created so like this this thing can be used even before there there is an a an array a or an array a with that particular element in it since orcs very relaxed about initialization this array element will be taken to be zero when it's created and then the plus plus on the end will increment it to one if the element already exists then its previous value will be incremented so if you ran this particular bit of code was in the last episode it just went through all of the rows in the file one file one dot txt file then if you could look at the insides of that the array when it had finished you'd find an index with the string brown and the contents would be two meaning that there were two instances of the the color brown so there's an out that means there's an element a open square brackets open double quotes brown close double quotes close square bracket and in that array element there's a number two I also noted that a square brackets dollar two plus plus is the same as a square bracket dollar two close square bracket plus equals one both mean the same thing I don't know you're already there ahead of me we also saw last time the concept of looping through an array to print it out and we had I'll just read this out quickly without going into a lot of details for b in a print b a brackets b so this was a case of sort of rushing ahead into areas that we hadn't really explained yet but it was necessary to get some of the precursor concepts sorted out we haven't looked at looping and other such statements in all yet but we need to look at this one now so we can understand how you would process an entire array so briefly the four statement provides a way to repeat a given set of statements a number of time we'll have a look at this and other related things like while and do and so forth later on this particular variant of the four statement allows the processing of arrays and it consists of the word four followed by in parenthesis variable name followed by the word in followed by the name of an array so it's saying for every element of this array and then the four statement is followed by one or more statements which are being controlled by it so the expression variable in array results in all of the index values in the nominated array being provided one at a time and while the loop runs the variable is set to the successive index values and the body the body part the dependent statements are executed now the body part can be a single statement or a group if a group is used then you have to put curly braces around them but if there's only one statement you only need you don't need any any curly braces now one thing about the way or works is that the order in which the array index values are provided is not defined so they sort of come out in a in a sort of random order it's not really random but it's a it's an arbitrary undefined order different orc versions will use different orders in the way it processes this now GNU org does can have extensions which allow the ordering of this the index values to be controlled but we'll deal with this later so let's just look at the example from the last episode and I've made some modifications to it change names and that type of thing later that slightly differently just to demonstrate the concept a little bit more clearly and this particular example is in a file which you can download from the hpl website and it's called color underscore count org I've used the american spelling because be easy what had used it throughout his example and I know it's basically his example I've stolen and hacked around so the array has been renamed from a to count because it holds counts or frequencies of the number of times of colors encountered the raised index by the names of the colors in field two and when we look through the array in the end rule we use the variable color to store the latest index and I took out semi-colons and curly braces that were not really necessary just to really demonstrate that they could be removed without any any problem so I'll not read this one out because you've seen the just of it last time so you might want to check this one out just to see how different it is in terms of its layout and use of use of variable names and so on so when it runs it does the same as the previous version does it prints out a list of it actually runs against the file called file1.csv and it prints out a csv list comma section separated variable list that should be consisting of the headline color comma count and brown comma two purple comma two etc so it's just the count of the number of occurrences of those colors so to finish then I want to just mention the built-in variables that we've seen so far and we saw another one added to the list in the last episode so this this one is called FS capital F capital S and it stands for field separator and it's the internal variable within the org script that matches the the minus capital F option that you give to the command so for example minus capital F and then in double quotes a comma on the command line is the same as assigning FS equals double quotes comma close double inside the script the statement FS equals etc needs to be in a begin rule in order that it can be set early enough in the script you can't put it in a pattern action style rule because that happens too late you've already grabbed the first record by then most likely or a record at least and and and that that was separated out based on whatever the default separator is which is a space so I've just put a little example here where org hyphen f double quotes comma quote begin curly bracket print quotes FS is close quotes comma FS close curly bracket close quote and you get the answer FS is comma so you can see see there it's just to demonstrate the point really we also saw OFS the variable which is all in capital's OFS which is the output field separator and it controls the the format of the output record produced when you use the print statement and normally it's set to a space so we're giving example here where org is run it and there's a begin rule within it and it simply consists of print followed by in double quotes hello comma and then in double quotes world so there's two two arguments to the print command separated by a comma and the answer produced is hello space world because the comma whenever whenever you put a comma in one of these things it it tells the print statement to output the OFS variable contents which is a space by default so if you were to do my second example which is pretty much the same except that instead of there being a comma between hello and world two separate strings there's nothing then you get hello world with no space in between them and that's because all because seen these two strings as the arguments to print and it can coordinate them together and given print one argument which consists of the string hello followed by world no spaces OFS variable can be set to a string if you want to so I did a rather silly example where I set in a begin rule OFS equals double quotes space blog close a space double quotes semicolon print hello comma world as before and then you get out instead of a space between hello world you get the word blog just proves the point can be useful sometime now the OFS variable only affects the behavior of print not print f so there's an example here showing OFS being set to a tab character then we print out using print f hello world and it comes out without a tab in it as no effect so with this just reiterating that print f is always followed by at least one argument and that first argument has got to be the control the format string which specifies how what it's to print out and how it's to be formatted and then it can be followed by any number of further arguments separated by commas so this one has got the first argument is this string that's got percent s space percent s backslash in in order to get a new line the end of it and then comma then the string hello comma and then the string world so the three arguments in total you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast and click on our contributing to find out how easy it really is hecka public radio was found by the digital dog pound and the infonomican computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments attribution share a live three dot org license