Files
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

182 lines
24 KiB
Plaintext

Episode: 2143
Title: HPR2143: Gnu Awk - Part 3
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2143/hpr2143.mp3
Transcribed: 2025-10-18 14:53:19
---
This episode of HPR is brought to you by Ananasthost.com, get 15% discount on all shared hosting
with the offer code HPR15, that's HPR15, better web hosting that's honest and fair at Ananasthost.com
Hello Hacker Public Radio, this is Be Easy once again, coming in with another episode
about the Unix command line tool Ock. Okay, so this time we're going to continue where Dave Morris left off and I want to continue where I left off from my first episode
and we're going to talk about some of the more advanced features of Ock. Now, as we discussed before, for the most part we're going to be using Ock from a file
which kind of turns off into its own programming language, so you have a whatever.Ock file and I've been using this more and more in my work life
just because, like I said before, if I'm going to have to run a command over and over again instead of having to look it up in my bash history, it makes sense for just to be in a file
and also when it's in that file, it'll help me remember how to do other commands in the future, so I'll copy and save as and use it for the files in the future.
One other thing, you might have to excuse me a little bit, I just got out of the dentist and my mouth is still a little numb, I'm going to try to articulate as best I can.
But here we go, Ock Part 3, so if you remember before we had our file and the file is, we're going to be using that file both in CSV and in tab-delimit format some more in this episode.
One thing that I discussed before is that how you can use Ock and the first part of the Ock command is a way to have a pattern to look up how to find which items to do actions on.
What you might have done in the past or what I sometimes do is use the Grap command and then pipe the output of Grap into Ock.
Grap actually gives you a lot more flexibility, there's some advanced features in Grap like the dash capital A, dash capital B number, commands that let you say I'm going to look for this string of text plus the five lines or five lines above or below the text that you're looking for.
So if you're going to do something like that, you still want to use Grap, but if you're just going to look for the specific text instead of using Grap, it makes sense just to use Ock and not have to worry about piping it.
So I just wanted to give a little heads up about that.
One other thing that I wanted to touch on that wasn't really discussed too much before was logical operators in Ock.
When I say logical operators in particular, I mean the N and the OR symbol.
So what if I want to match two things? Well, one thing you do is Ock and do something and then pipe that into another thing and Ock it again or grab it and grab it twice.
Or you could just use the logical operator and that's what this is for.
So in our example file, we have multiple things that are purple.
So my Ock example, I want to say Ock dash lower case F and then my file name, which is logical.
And then file 1.txt and inside of that file, that Ock file, we have dollar sign 2 equals equals purple inside of quotation, double quotation marks.
And what that means is look in the second column if you remember and look for the word purple.
And then after that, I have a space and I say ampersand ampersand dollar sign 3 is less than 5.
And then after that, inside of curly braces print dollar sign 1.
So what this entire command is dollar sign 2 equals equals purple ampersand ampersand dollar sign 3 less than symbol 5 inside of curly braces print dollar sign 1.
And what that means is if you see the second column is purple and the third column is less than 5 print the first column.
And then this example, we have two things that match purple, which are grape and plum.
The only one of them is less than 5, which is plum.
And so what's the end result is going to be the word plum.
So ampersand ampersand is the OR operator, is the AND operator and double pipe is the OR operator.
You might have seen that in other programming languages as well, or in SQL ampersand ampersand is AND and colon or pipe pipe is OR.
Moving on, the next command we're going to talk about is the next command.
And I did that kind of on purpose.
So we're going to have kind of a complicated one coming up. So I suggest you look at this file that I created.
And this is why it's good to make an off file. If we don't something really simple, you might not need to.
But it's something that advances this you will.
So same, for instance, we want to take this.
And we have this file and we want to put and a double ashris for all values where the amount is above 8.
And a single ashris if it's above or equal to 5 and anything else we want to leave alone.
Now, you can do this without using the next command.
And you might be able to figure out how to do that. But what you if you don't use the next command, what you're going to do is you're going to evaluate the entire rest of the file, even though you've already hit your match criteria.
And so it makes sense to instead use the next command to skip to say once you found your match, do what you need to do and then skip to the next operation.
So, for instance, a lot of times and something that you just might want to do in general is if you're going to parse the file, you're going to keep a lot of stuff out.
You might want to bring the header line and because the header line is good and you might want to know that information.
So, to do that you use nr and we will see as you see before nr is the row number nr equal equals one.
So, double equal sign one. That means the first line inside of curly braces print zero semicolon next line right the word next semicolon and then close the curly brace.
That's going to say is when you see the first line print everything print zero on dollar zero, which is everything and then continue to the next part of the next command.
Now, if you didn't do that, it would read the first line and print everything and then read the second line and nr doesn't equal one anymore and not do anything.
They do the third line, fourth line, fifth line, sixth line. Now, in a file that has seven or eight lines, not a big deal, but when you work with a file that has, you know, one or two hundred thousand lines, you're going to see the performance gain by using the next command.
So, say I want to print that header line. So, did that. Now, next thing I want to do is everything above eight. Now, I'm doing this on purpose in this order because it makes the next command work nicely.
So, I want to start from the highest and then work my way down. So, if dollar sign three is greater than or equal to. So, I'm using the greater than sign right next to the equal symbol with no space in between. That's greater than or equal to enough.
Dollar sign three greater than or equal to eight open parenthesis print F and as we've solved in Mr. Morris's commands print F inside of double quotes.
I'm going to put what I want to write in my four minute print and it's percent S backslash T, which is a tab character backslash S.
I mean, excuse me, percent S again and then backslash N, which is a new line characters. So, what that means is. So, whenever you see a dollar sign S, I'm going to, excuse me, a print percent sign S. That means look for a string that matches here.
If I did a percent sign D, that means look for a some type of number digit that matches here. So, so continue our print F statement after we have our close of the double quote comma dollar sign zero, which is know everything comma inside of double quotes again.
Asterix Asterix and then end that line with a semicolon on the next line right the word next semicolon. I like to use another another next new line for closing the parenthesis.
And you'll see that in my example files here. So, if I read that whole print F line again, it's print F inside of double quotes.
percent S backslash T percent S backslash N and finish the double quote comma space dollar sign zero zero comma and then inside of double quotes again Asterix Asterix and then end line with a semicolon.
That means is write the first thing after so after the comma, you're going to do whatever the you're going to insert whatever the first item is at the first dollar S, which is that everything the dollar zero.
And at the second percent S, you're going to add the second thing after a comma, which is asterix Asterix.
So what we're going to do here is we're going to add an Asterix Asterix tab after everything is written out a tab, a tab character and then Asterix Asterix.
And then we're not going to go. We're going to stop and move to the next command. We're not going to continue to evaluate.
So after that one is complete. We have a new set of items. So now we're going to do something very similar on the next line, dollar sign three greater than or equal to five.
And then open curly brace and then do the same exact statement of print F inside of double quotes percent S backslash T percent S backslash N closed the double quote comma.
Same thing with the dollar sign zero this time at the comma inside of double quotes a single Asterix.
And that means if it's just as we explained before, if it's above five, it's going to do a single Asterix. And then on the next line, go next again.
And if we don't put those next in there, when it after evaluates the first one, it's going to do it again and evaluate with the second one. Now you're going to have three Asterix on anything greater than eight and one Asterix on anything greater greater than five.
That's not what we want. That's why we use the next command.
And then after that next man, we close after that print F statement, we on the next line, I write next column, semicolon again, and then on the next line, close the curly brace and then lines after that greater than or less than dollar three, less than five.
We're just going to go print dollar sign zero. We don't have to do next after that. So what we're going to do is if so going back from the top and sing it in plain language on the first row, if it's the first row print all lines of that row.
And then go next. If it if the value in column three is greater than equal to eight, print out everything followed by a tab character and a double Asterix, if it's greater than and then move on to the next evaluation, if it's greater than five, greater than equal to five and put a single Asterix and otherwise put just the text by itself.
So you can see why, and as an example like this, it's to put all that in one big line and one single batch command is kind of annoying. So that makes sense to go awk dash lowercase F.
And then this one is called next dot awk.
So it makes sense and in the file name, which is file one that tasty. So it makes sense for that.
The next example I'm going to go over is the end command. So we have the big we have these other two commands called begin and end and just to summarize what they mean begin is all is all the actions you want to take before executing the arc statement.
And then means after evaluating everything you did an awk do this and we'll see why that's useful. It's mostly useful for all so that you don't have to.
Like for the begin command, you could do preprocessing and then pipe the output of the preprocessing in the arc, but if they haven't do that, you can just stay in awk and say begin do this stuff, then do awk and then be done.
And that same thing with the end command, you could say you could do in theory, just do awk and then pipe the output of awk into another command and then do that, but why awk and do it all.
So for this example, if you're on a unit like system, whether that be Linux or Mac, and you'll find this example helpful, but you can do something similar in windows, because there's a similar to the df command and windows, I just don't know it off the top of my head, but I know it's there.
So what I'm going to do is df-l, which is, and if you're not familiar with df, that's going to look at all of your file systems and do a basically tell you how many bytes or how many gigabytes there are used in each.
There's a common for available, there's a common for use, and there's a bunch of other columns to common for percent. If you just leave it with that syntax, there's some other columns in there.
So, but what I want to do is I want to, just for this example, take the output of that, pipe it into awk, and then dash f, my file that I call n.awk.
So, because I'm going to, this example is the end command.
So, and this example, what I want to do is I want to, I want to, I want to take the sum of all of the used, because there's a whole bunch of, if you just do df-l, there's a whole bunch of stuff.
And my configuration, I have tempfs file directories that are in the output, and I don't want to look at those.
I want to skip the tempfs directories, and then I want to look at the used column, which in this example is third column, and I want to look at the available column, which is number four, and I want to, all, but I don't want to see all that.
All I want to see is the sum of all of the used column, and the sum of all of the available column.
So, what I'm going to do in this example, on the first line of my file, I'm going to say dollar sign one, not equal to, which is the exclamation point equal to, and inside of double quotes tempfs.
So, T-E-M-P-F-S, and then inside of curly braces now, declaring a new variable here called used U-S-E-D plus equals dollar sign three semicolon.
And the next line available plus equals dollar sign four semicolon. So, that means I'm making a new bit.
I could do something like say used equals zero semicolon.
Next line say used plus equals three, but an arc you don't need to.
I can declare it and start adding stuff to it right away. So, if you're not familiar, plus equals is the same as used equals used plus whatever thing after it is.
So, by saying used plus equals three, I'm saying whatever value I have for used right now, the value call used, add column, whatever is in this row of column three.
And likewise for available. So, after that, I'm going to close that curly bracket, and then on the next line, put in all capital letters, E and D, open up curly brace, and put in this statement.
F, inside of double quotes, percent D, remember I said before, percent D is for numbers. If you want to explicitly say it's a number, percent D, space, GI.
So capital G, lowercase I, big B, and then the word used without a space next to it with a back slash N, which means new line.
And then dollars, percent sign D, space, GI, B, so gigabyte available, dollar sign, I mean, backslash N, closed the, closed the double quotes, comma, used for slash, which is divided by two, then the carrot symbol to 20.
So, I'm going to sit, so if I read that all the way out, the thing after the comma is used divided by two to the 20th power, and then comma available divided by two to the 20 power.
And with a dollar sign, a semicolon. Now, if I read the whole thing in plain English, it's print F, which means pretty like format print, the number for the first thing that comes after the first comma.
And then gigabytes used, and then on the new line, write out what that the evaluation available divided by two to the 20th power, and then a new line.
And so, then it closes the code brace. So what that's going to do, it's going to write out on a, if I do DF-L, pipe that into, so pipe symbol, arc-fn.arch, it's going to take that output and say use, it's going to give me used, it's going to give me available and gigabytes.
So pretty cool, and it pretty well shows how to use this to use some. So what it really does is it adds, it takes the use and it keeps on adding more and more and more to the value you use, but it doesn't print any of that.
After that, you do the end command and say, after you've done all that evaluation, do what's inside the curly braces. And in this case, what we're doing is just writing print F, and then what we're writing inside of our print F.
So you'll see that often used, and the cool thing about putting it in a command in the arc command is now I can go take this and put it on another computer and run DF-L, pipe that into arc-fn.arch, and it's going to give me the new values for this new computer.
Alright, so in a very similar light, I'm going to go over one more cool thing that you can do with end. I'm not going to go into the details about how this works, but I just want to give you a little taste because this is something that I've had to do very often.
And that is sometimes I want to get a distinct list of items, and you'll see how useful that is. So sometimes you want to distinct list of items in a file.
And you know, if you're going to do that in SQL, just be select distinct count, or select distinct X from dual, whatever the thing is they look at.
And this example, I want to do a summer thing, but what I want to do is use arc. So here's the file.
It goes nr, which is new row, not equal to one, the inside of the action curly braces, a inside of square brackets, dollar sign 2, then outside of the square brackets plus plus, close the curly bracket, and then end.
Like so now we're doing the end command inside of curly brackets for inside of parentheses, B in a, the close of parentheses inside of action curly brackets again print B, close the curly bracket, then close the end curly bracket.
So all together, it's first row, not equal to one, now we're doing a, we're creating an array of a, a dollar sign 2 plus plus.
So do that, make that big array of, of call a with dollar sign 2 in it. And then we're going to for every value in a print B, which we're calling be the value.
And it's very similar to how you do like a for each loop. And I get was that C sharp pearl, pearl, yeah.
This is my like I just now mix them up, like how you do a for each loop in pearl or a for loop in Python.
It's a for this and that type of thing print that.
So it's very useful. You can you'll find lots of examples. If you just Google all could distinct, it'll give you an example like this.
With a for and for B and A, for example.
So that's a pretty advanced script. Hopefully, Dave will go into more detail about how it works in his episode.
I'm going to switch focus and go over to the begin command.
And like I said before, the beginning command is a whole bunch of stuff that you might want to do before you evaluate your augment.
And this scenario, I want to do some stuff I want to do before and some stuff after.
And for for in particular, I'm going to take that file one that CSV file.
And instead of having to in my awk statement say dash capital F comma, I want to explicitly say inside of the awk file that the field separator is a comma.
And with that, we use that F capital F capital S global arc variable.
And so I'm going to begin in all caps inside of curly braces.
Capital F capital S equals inside of double quotes a comma that end the with the semicolon on the next line.
I'm going to say OFS equals comma and OFS is that output file separator.
So FS is file separator OFS is output file separator.
So usually all just does a space separating the fields in the output.
But I'm saying explicitly I want to put a comma as the separator.
And then do a semicolon and then on the next line, I put print color comma count.
And that's inside of double quotes and then ended with a semicolon.
So what I'm saying in my begin statement, this is my preprocessing the files, the field separator is a comma, my output field separator is also a comma.
And the first thing I want to do is I want to write, I want to print color comma count.
So it's like I'm telling awk what I want my header to be.
Then after I close my begin curly brackets, I'm going on to what's in my normal awk statement, which is NR not equal to one, which we've seen before is, you know, don't look at the header column.
Do this a and inside of square brackets, the outside two plus equals one semicolon and then close that NR's curly brace.
And then after that, do end.
So now we're using an end command inside of curly braces for B in a.
Inside of curly braces print B comma a inside of square brackets B and the curly brace and then end.
So that was the curly brace for the for loop and then end the curly brace for the end loop.
So what this is doing, this is giving me a distinct count inside of our file 1.csv.
So for and I'm looking at it, if we see the dollar sign 2, I'm looking at the second column. So give me the distinct colors kind of like you get the hint from what my header, my header lines is for every color, give me the count.
So what's in column 3.
Oh, and that's not that's not that it's just how however many times you find the color make the count go up one.
So we're going to say the count the color comma count.
So the output of that you end up having a file that's comma separate or an output that's comma separated with the color comma count on the first line and then the different colors is going to say.
Red comma to yellow comma to purple comma to green comma one round comma to right.
And then if you wanted to, you could redirect that output into a file that or file to that six V and now you have the output in its own CSV file.
So that's pretty cool. It's a way that without having to go open up and this data into a SQL like database or some other database and then running these commands on it, you can do a straight out of the arc.
So this is the type of thing I use off for all the time.
Now with that end command, a little bit more complicated is I'm going to do something where instead of getting the distinct count, I'm going to do the distinct sum.
For every color, add up all the values in column three for that color and give me what the totals are.
So now instead of having for instance red comma to we're going to, because there's two reds, we're going to have red comma seven because in column four, where there's reds and column three, excuse me, when there's reds.
Or the first amount column is a three to second. So four plus three seven.
So for this command, it looks very similar to the other command, where we do the same thing with our begins statement.
It's still all Fs, Fs equals comma and all Fs equals comma and print this time in our principle, we're going to say color comma sum and everything else in our command looks exactly the same except for inside of our NR curly braces.
So NR not equal to one inside color braces, we're going to say a inside of square brackets, dollar two plus equals, call, call, dollar three.
And that's going to say, that's saying you're still doing, you're still looking in column one to get new values, I would call them two for new values.
But instead of plus equaling one to it, which is adding one, you're going to actually add whatever values in column three.
So if we look at these two commands, they're exactly the same except for that part. And the other one and the count command, it's a inside a inside of square brackets, dollar sign two plus equals one.
Here is a inside of square brackets put dollar sign two plus equals dollar sign three. And then we still end with the same N command, which is the four B and A print inside of curly braces B comma a inside of square brackets B.
Everything else just works, same way. So that is a really cool thing as well. So if you're doing that in SQL, you'd be doing select distinct and then you do some amount, you do color, some amount, group by color, whatever.
So that's that command pretty cool that you can do that in off without any other helpers.
So we've shown you here a couple different examples how to use the begin command, the end command, we're going over some of the different things you can do inside of that begin command like file separator, output file separator.
There's a whole bunch of other things you can do there. And I suggest that you look at the all man pages or just look up all examples online and hopefully between Dave and myself will go over more example in the future.
So this is be easy. Once again, hacker public radio signing out.
You've been listening to Hacker Public Radio at Hacker Public Radio dot org. We are a community podcast network that releases shows every weekday, Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself. If you ever thought of recording a podcast and click on our contributing to find out how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the infonomicom computer club and is part of the binary revolution at binrev.com.
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the creative comments, attribution, share a life, 3.0 license.