Episode: 2114 Title: HPR2114: Gnu Awk - Part 1 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2114/hpr2114.mp3 Transcribed: 2025-10-18 14:30:26 --- This is HPR episode 2,140 entitled Gnurk Part 1 and is part of the series Bash Crypting. It is hosted by me and in about 23 minutes long. The summer is an introduction and the architect passing tool. This episode of HPR is brought to you by AnanasThost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at AnanasThost.com. Welcome Hacker Public Radio fans, this is Bee Easy once again. This time I'm going to do a series of tutorials you can call them, working in collaboration with the famous Dave Morse, which makes me really excited. He's allowing me to do the intro and he'll intro himself as well as go into a deep dive as we proceed, but we are going to be doing a little tutorial on Ock. In particular I'm going to be focusing on Gnurk, which is very similar to the original Unix version and it has some additional features. I don't know if we're going to go into the differences between Ock and Gnurk, but we are going to at least start with some of the basics right now. So without any further ado, here's Ock. So from its man page, Ock is the Gnurk project's implementation of the Ock programming language. It conforms to the definition of the language in the POSIX 1003.1 standard. And this version is in turn based on the description of the Ock programming language by aho karagin and wineburger. And Gnurk provides the additional features found in the current version of Brian Kernigan's Ock and a number of good news specific extensions. So that's the beginning to the description of Gnurk in the man page. Ock is a powerful text parsing tool to be specific and like in the description says, it is its own language. Now Dave, especially but also myself, we're going to go into how to put Gnurk into a language inside of a text file, a dot Ock file if you will. But I'm going to start off with some basic commands to get our feet wet with Ock because you can just do it on the command line with a simple inline coding. I use this tool all the time, both inline and files. The good thing about putting in files is it's easy to go back to and run the same command over and over again on different files. But it's really handy if you don't feel like opening up or you don't want to open up or you can't open up a tool like library office and parse CSV files or if you just have some really complex stuff that you might be pulling in from a pipe from like a said command or wget command where you're getting stuff off the internet and you want to parse it in real time and put it into a file or parse it into another tool that's going to do more processing on it later. So I'm going to try to see if I can get a file uploaded but if not, I have example files right in the show notes. So all you have to do really is just copy and paste the example files right from inside the show notes and put it into a text file and you should be on your way. So the basic syntax of AUK is AUK and then some options and then inside of single quotes a pattern and inside still inside the single quotes inside of curly brackets actions before you end the curly of the single quotes and then the file that you want to do that to or the group of files that you want to do that to. So it kind of sounds hard but it really is pretty simple to get started so you're just going to do AUK dash something a pattern to search for but the pattern is optional and then the action that you're going to do file.txt.tsv whatever that that whatever you're working with. So for example purposes I created a file called file1.txt and a companion file that's all the same data that's file1.txt.tsv the difference between two is one is space delimited the other one is or white space delimited the other one is comma delimited. Delimited means the way that you're going to separate the different fields in the file. So comma separated file CSV means that your delimiter or the limit of that column is separated by the comma in a white space one it's going to be separated by any white space and that's the default in AUK is that it's going to parse whatever you're looking at whatever text string it's looking at by the white space and it's going to put it into columns that way. So if you look at the file that I have in file1.txt the first column is the headers name color and amount and then under the name I have a bunch of different fruit apple banana strawberry grape apple again plum kiwi potato I guess that's not a fruit and pineapple and then the next column over I have different colors I have red fruit apple yellow for banana strawberry red grape purple and then for that second apple I have green in this column now so we have a green apple and a red apple then plum for the plum column I have purple then brown for kiwi brown for potato and yellow for pineapple and then in my third column I have the amount of each one of those items so I have four apples six bananas three strawberries 10 grapes eight green apples two plums four kiwis four nine potatoes and five pineapples now this is going to be a cool file because we're going to be able to do a lot of things with it and later episodes we're going to be able to do a wrist metic on these and do some aggregate functions on it but for now we're going to do something really simple we're going to just do the command AUK and then inside of single quotes you put also curly brackets so single quote single curly bracket print dollar sign two close curly bracket and then second single quote file 1.txt space file 1.txt so what that is is all print column to a file 1.txt so that like we said like I said before the actions go inside the curly braces since we didn't have anything before the curly braces there was no pattern to match so it's just going to look in the entire file and it's going to look in that second column and since I didn't give it any way to to the limit the file other than its default it's going to use white space and in my example file I lighten up the white spaces so that they are all so it looks nice but AUK doesn't care about that it it will just parse it on white space no matter what so whether it's one space or ten spaces or in one column or three spaces in another column and 25 spaces in another column it doesn't care it's going to parse them all the same and put them all into even columns starting on the first now white space character so a couple of things that you can see is that it's kind of intuitive it starts with 1 it doesn't start with 0 like other program languages so you're going to say print 1 is going to be the first column print 2 is going to print the second column so if I say in this file example if I say print 2 I'm going to print out all the colors it's going to first put out the header row color let's go say red yellow red purple green purple brown brown yellow so one special character to our special column number is 0 so if you do dollar sign 0 it's going to print all the columns so that's just something to know so going back to our example I'm going to do a little bit I'm going to add to that example I'm going to say all now inside the first single quote I want to say dollar sign 2 equals equals and then double quotes yellow and then you can put a space but or not um start the curly bracket print 1 closed the curly bracket closed the single quote file 1 that takes tea what this is doing since we have now something before the curly brackets before our action we have our pattern and our pattern is dollar sign equals equals 2 oh and yellow so look in the second column for the word yellow and print column 1 and file 1.txt if you remember the file we had a bananas and pineapples I have both of those in there as yellow so let's go to just print out banana pineapple it's going to skip the header column because the header column didn't have the term yellow in it that's one thing to understand about it's not going to automatically print the headers unless you tell it to and we'll talk about that a little bit later in another episode now right now we've been working with this file that is space-separated which has a lot of uses especially on the command line where you're when you're going to pipe uh other commands into it and you just want to see like you might want to do ls dash l and then pipe that into awk and then you can separate by the columns that way that's fine but a lot of times when you're working with data you're going to be working with either tab separated files or comma separated files and so if you're not using a plain white space separated file or I like to do pipe separated a lot of times because then you don't have to worry about curly brackets I'm a curly double-coats around the um around the text fields to get around commas inside of a text you want to we might want to use a different file separator so there's different ways to do file separation and awk I'm going to go over the most apparent which is using an option the dash capital f option the the character or characters that follow capital dash capital f is your separator so if you just do dash f uh dash capital f comma that's going to tow awk to use commas for the separator so that's fine you really don't need us actually you do not want to put a space between it you don't need any other characters if you just put dash f comma it's going to do that if you do dash f period it's going to do a dot separated however sometimes you might want to do more complicated field separators that are more than one character in that case you want to put your field separator inside of double quotes and you might see that sometimes in other people's examples when they are just using commas they'll do dash f double quote comma double quote with no spaces in between that's going to do the same thing as uh dash capital f comma so I have a similar file called file one dot csv which is the same exact file but taking out the spaces and put a comma in between and if we run the same command of awk this time awk dash capital f and inside of double quotes comma space inside of single quotes dollar sign two equals equals inside of double quotes yellow space inside of curly brackets print dollar sign one and the and the the single quotes file one dot csv it's going to give us the same exact output is if we were doing the white space delimited one without the dash f option which is banana and a pineapple inside of those patterns you can also use regular expressions as well I have an example here that's awk inside of single quotes dollar sign two and till day which is the on a usk keyboard layout it's the one right above the tab if you hit shift so till day space inside of forward slashes so awk for regular expressions like the till day to say it's kind of like pearl well it likes the till day to say this is going to be a regular expression and inside of forward slashes the expression that you want to evaluate and I'm not going to go into regular expressions but uh that's a whole another topic but in this example I'm doing p dot plus p so I'm looking for a p any one or more characters in between and then another p and then I'm going to go um and after that I'm going to do inside of curly brackets or action now print zero dollar sign zero and the close single quote file one dot txt so I'm looking for any words that have the pattern of p anything in between in column two p anything in between p and it returns the entire line of grape purple ten because purple which is in column two has two letters in between the p and the second p and then also plum in the second column is also purple so it's matching purple in both cases numbers can be evaluated in the pattern as well so and it does this kind of intuitively so if you in our example we have numbers in our third column so if I say all dollar sign three greater than five and then inside of our action print dollar sign one comma space dollar sign two close the action close the single quote file one dot txt I'm going to print both the first and the second column if the value in the third column is greater than five so it's a good idea to go look at that um example but it's it's pretty intuitive you're going to say if column three is greater than five print column one and column two I'm sure you can see applications for this if you ever have to work with data that is um that you have to manipulate um so continuing along with this uh I give the output of you're going to find banana grape apple and potato because those are all the ones that had values that were higher than five in our um example file you could also take that and redirect the output of that into a file so if I do that same exact thing and say at the end of all so I'm going to do for this example I just want to show it doesn't matter because it's still going to print it out with space element um all dash capital F comma inside of the single quotes thousand three greater than five inside of our action curly braces print dollar sign one comma space the dollar sign two and the action file one dot csv then greater than sign again output dot txt it's going to put name color in the first line banana yellow grape purple apple green potato brown in a file called output dot txt so that's a good way it's a nice way to be able to filter out things that you want from a file and put it into another file and here's a cool trick that I learned on one of my recent uh references that I gave at the end of the uh episode if you do this command awk print awk and inside of the single quotes inside the curly braces print greater than sign dollar two and then right next to the dollar two inside of doublecoats dot txt close the parenthesis uh clear of the curly brace close the single quote file one dot txt so I recommend for any of these episodes that we're going to be doing on the series that if you really want to follow along and you don't want to just listen to our lovely voices that you probably get out the show notes because they're it's really helpful but anyway um that command of five of awk print so we're actually doing a redirect inside of our print statement that's what that curly bracket that print curly um greater than sign means we're doing a redirect inside of our print statement it's it's dollar two dot txt so we're looking at column two and whatever is in there we're going to put um all matching ones are going to go into their own file I'm not explaining this very well I'll do it again uh so print um greater than sign dollar two and then and doublecoats dot txt file one dot txt is going to create a group of files one yellow dot txt one red dot txt one color dot txt one brown dot txt one green dot txt because those are all the different things that you can find in that um second column and it's going to put print out in my example it's going to print out all the data that's in um that all the columns that are in there and it's going to go into their own files so it's a really quick way to take a whole bunch of data that might be all intermingled and separate it all into individual files of like information so it's like doing a if you're going to do this in Excel you'd have to do a filter and then pick pick the ones um uncheck the boxes that you don't want pick the only one that you do want highlight all those copy it paste into another file and save that file and then do the same thing for the next option in your filter and your next option in your filter next option in your filter this and one command automatically make all the different file a whole series of files based off of the um the pattern that you're matching it's really cool um i mean elistemy maybe i'm just a dork that's fine oh but that's uh some of the commands that you can do now one other thing i'm going to introduce but i'm not going to go into right now is that sometimes with awk you can get really complicated in how you both set up how you're going to parse the file so in your pattern um if you want to do some pre-processing and then do some more processing on it and then do like some counts and some sums and some division and all that kind of stuff you might want to it it's going to get really cumbersome on the command line so you're going to want to put all that in a file and a lot of times the the convention is it'll be the file name dot awk and then to get access to it you'll do awk dash lowercase f file name dot awk and then file one dot txt and i'm pretty sure that they're a remainder of our episodes we're going to be using the files because as we get more advanced in the awk it really does like i said get cumbersome to deal with awk on the command line when you have you know 15 lines of commands that you want to put in uh so that's the introduction i'm excited to get into this series with uh with Dave hopefully we are able to enlighten some people teach some new things and hopefully i'll learn a couple new things as we go i've already learned this new technique with this uh separating things into individual files based on the the match so it's pretty cool i have a couple also of a couple of uh resources that i found online to help so i don't know if anyone knows about linux.die.net so linux.de.net slash man that is like the man page for everything in linux so you'll find like so linux.die.net slash man slash one slash awk is the man one page of awk. another really cool tutorial and i'll be doing some of my examples following this or from www.linuxschool.deunuxschool.com and then some other ones are from techman um upcoming in our series we will be talking about more of the other options besides dash lowercase f and dash capital f uh we will also be talking about some of the built in variables that are in awk and we will do some arithmetic operations some fancy text manipulation as much as we can without going into said and going over the awk language and its syntax once again thank you for listening hacker public radio this is be easy signing out you've been listening to hacker public radio at hackerpublicradio.org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is hacker public radio was found by the digital dog pound and the infonomicon computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself unless otherwise status today's show is released on the creative comments attribution share a like 3.0 license