Files
hpr-knowledge-base/hpr_transcripts/hpr2184.txt
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

153 lines
29 KiB
Plaintext

Episode: 2184
Title: HPR2184: Gnu Awk - Part 5
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2184/hpr2184.mp3
Transcribed: 2025-10-18 15:27:19
---
This episode of HPR is brought to you by Ananasthost.com, get 15% discount on all shared hosting
with the offer code HPR15, that's HPR15, better web hosting that's honest and fair at Ananasthost.com
Good morning Hacker Public Radio fans or good evening or good night, whatever the case may be for you, this is be easy signing in once again, bringing you another episode
of the Canoo Awk series, this is episode 5 where we will be discussing, I guess it's just really me, I will be discussing Reg X and the Canoo Awk programming system.
So if you have any experience with regular expressions you can listen lightly and not pay that much attention because it is actually not that different to do regular expressions in Canoo Awk then it is to do in Pearl or in Said or other languages, Python, even C sharp.
But if you are new to regular expressions don't be afraid, this can be a little intro for you to this really powerful tool for being able to manipulate text.
I actually spend a lot of time searching and indexing text at my day job and so it is really interesting to me how to use these tools and I end up using it a lot of times to do other things in just for home.
So I am not going to go into all the details of regular expressions because it is a really big topic but I am going to go into a good amount of detail and bringing it to all for the explicit purpose of the Awk tool.
So some things may be different in other languages but for the most part you will be able to follow this in other places and see it the same.
So to start off with why would you want to use a regular expression? Well a regular expression is a way to find a match to a string of text without it having to be the exact match of text.
So if you want to do something simple just replace one word or one letter or one group of letters with another group of letters then you don't need to use regular expressions.
You can just say replace this with that but if your patterns are more tricky you want to be able to match multiple things then regular expressions are a good thing that you want to match.
And in this example we are going to be using our file 1.txt that we have been using all throughout the WreckX series. I am going to excuse me the Awk series but these are just examples or simple fun examples sometimes if you see some of these you would say why wouldn't I just use regular substitution instead or regular matching instead.
And you are probably right for simple cases regular substitution is okay but for what we are doing right now it is important to see that these are just examples and it can get really complicated if you want to.
So the syntax for regular expressions in Awk comes in two places one is in well it can come in more but the two places I am going to focus on right now are matching the rows that you want to present in the rest of your equation so the filter.
And then also in some Awk commands or some Awk functions that you can use for substitution or matching.
So for doing that filtering the syntax is the word or the column that you are looking for in your data with the tilde sign which on a us keyboard is the character that is above the tab.
But you have to press shift to get to it. So that character which is similar in to what it is in Perl so if you do Perl if you want to match something you will say tilde equals well in Awk you just use tilde and then after the tilde you put in on both sides right forward slash which is the key next to the right shift on a American keyboard.
And then that is how you match it. So if you are doing a regular filter you would say word equals equals and then in parentheses the thing here going to say word tilde inside of four slashes your regular expression to do a negation it is similar to what you would do with a regular filter instead of doing
exclamation point equals your doing exclamation point till day. So for not matching this regular expression will be word exclamation point till day four slash the regular expression four slash again and then space and then print whatever you know the rest of the way that we do inside of
curly brackets. So from our file one.txt file if we did something that said for example dollar sign one till day inside of right for slashes P and then inside of brackets E L U
and then the second forward slash print zero print dollar sign zero we would get a filter list of all the items that have a P and then any come any one of the letters E L U
and in our example data we have apple grape plum and pineapple which have a P followed by an either an E L or U. And so that was an example we're going to go into some of the details about how that example works.
Another quick example is dollar sign two till day inside of four slashes E inside of curly braces two and then the second four slash print dollar sign zero we're going to do a similar thing but now we're going to be looking in column two for anything that has two E's next to each other print the whole line.
And we only have one example in that in this file because all the color green has two E's.
So what does all that mean what are those squiggly braces and numbers and kind of random letters looks mean so you'll notice that if you ever look at regular expressions like a really complex one you'll just see a whole bunch and might look like a character every now none that you recognize with a whole bunch of special characters all over the place and if you are under
familiar with regular expressions it looks like gobbling book but there is actually a lot of meaning in that so to start with let's go over some of these characters there's there's this term called an anchor and anchor is like either starting off point or an ending point of your of your regular expression so if you do the carrot or the top hat or whatever you want to call that.
Little mark which is shift six on an American keyboard that symbol means the beginning of a line so if you did if you wanted to find in your search the beginning of the line P so only the things that with a first letter on the line as a P you use carrot P.
The inverse would be dollar sign and so dollar sign means the end of the line so if you're looking at P dollar sign you're looking for the last letter on a line to match P.
And so you can have a whole bunch of other characters before that and in your match but if you're starting with the dollar sign if you're ending with the dollar sign that means that's the end of the line.
And so if you have a P in the middle of the line it won't be matching that it'll be matching if you have a P at the end of the line.
The next so sometimes you have the first letter you might want to match is the beginning of a string so skipping the white space the first letter of the string and that is backslash capitol A.
All right so backslash capitol A P would say if the first letter and the string that I'm looking for is a P then you have a good match.
The inverse of that would be slash lowercase Z or so backslash lowercase Z would be if the last letter so if I did P backslash lowercase Z that would mean that the last letter of the string is the P.
Another important anchor is the lowercase B so the backslash lowercase B you'll see a lot of times that you'll have a backslash and a regular character that annotates what some of these either anchors or other special characters mean.
And that's because there's only a finite amount of special characters on a keyboard or a unit code that are I guess really on the keyboard you can put them in unit code if you wanted to but no one would be able to access them.
So what they do instead is start using regular letters with a backslash in front of it to give it a new meaning.
So backslash B is a word boundary I use this one all the time so if I want to just find the end of the word in my match then I would do backslash B backslash B and backslash Z are different in that the end of a string would include like a period or some other character like that.
But backslash B would not I'm pretty sure that's right which is we'll just pretend that that's right if I'm not right and please feel free to correct me because I'm doing this kind of off that med if that's not right.
So for so those are the main anchors there are others and I'm going to point you all to some resources at the end which goes into a lot more detail because who can remember all these things the first time they would do it or even the 900 time they have used regular expressions so even now I still use some of these resources just so I can either check my regular expressions or to find something that I don't have or I'm missing.
So there are some other characters so if I if I want to match characters is some cool ways to do it so obviously if I just put into the character by itself I'm matching that one character or if I put a series of characters together like ABC then only when I find the term ABC will I find a match but I put inside of square braces.
So let's instead of saying ABC ABC D if I put inside of square braces A and D that means if the character that I'm looking for is either an A or a D so inside of square braces any of the characters that you put in there without any other markings you're looking for any one of those characters so as we saw in our example above where we did.
When we're looking for P and inside of square brackets E L U doesn't matter what order E L and you go and I just like to use alphabetical order because it makes it easier but it doesn't matter what the order is you're going to be matching a P with either an E or an L or you if I didn't have the square braces I'd have to match exactly P L U.
What if I'm instead of so in our example down here and am I showing a square brackets A D if I didn't want to match just the characters A D but I want to match ABC D I can either just type ABC D inside of the square brackets or I can use the dash or the hyphen whatever you want to call it between A and D so A dash D kind of like it how it looks means.
A to D or A through D so that would match ABC D so a lot of times what you'll see is A dash Z which is any letter A through Z another hand if you had a capital A dash capital Z you'd be matching upper case letters so I'm not going to go into the case too much I'm going to save that for maybe maybe Dave can cover that but in general.
Regular expressions are case sensitive as they you would expect it would be because there's so many various variations of of strings that if it was case insensitive one it wouldn't be very precise and to it would take a more resources.
But anyway going back to where it was so inside of square brackets once again sometimes you don't want to match something sometimes you want to not match something and so if I put it after the opening square bracket if I put a carrot that means not in this context why did they not stay consistent and use an exclamation point here I don't know but that's what they do so it would be consistent to say.
exclamation point but instead we say carrot so if I say open bracket carrot A to A dash D close bracket that means not the characters A to D so I don't want to match that I want to find any words that or any character that is not one of those four.
So there's some other characters that we can match or so backslash W is any word so anything that's not a white space character backslash S is any white space character so that includes the tab character the space I think it includes I don't think it includes new line no.
But any any white space character backslash D is any digit so you could do inside a square brackets zero dash nine but a shorter way to do it is just backslash D and that matches any digit.
So there are if you want so all those backslash W S and D if you do the capital version of it it's a negation so I want if I want to watch if I want to match any non white space character I do backslash capital S if I want to match any non digit that I do backslash capital D.
If I want to match any non word character that I do backslash capital W and those kind of make sense there are some other boundary and some other special character I will like I said refer you to the to the references if you want to get more detail but you know have so much time.
So let's go to the next thing which is there are some there is a standard I think it's been talked about on each bear before there's a standard way of building software the standard is called politics and there is a
politics compliant or a positive standard way of referencing a lot of those characters that either graph or PCR and graph or set or arc might use a different way of doing it most of them also accept the the positive standard way of doing it so I have them here in the show notes we have
alnum to match any off a numeric character it's similar to slash capital S kind of so any non white space so we can do all these have like a square bracket on both sides and inside the square bracket square brackets either open and close there is a colon so it goes square bracket colon whatever the phrases or whatever the word is colon
square square bracket so alnum is alpha numeric alpha like it sounds is alphabetic blank is tab in space so square bracket colon blank colon close square bracket that matches any of the white space characters like space and tab
colon t n t r l colon square bracket is any control character digit is another one that's just like the back s f d graph is another one which are for both printable and non print all characters that are both
printable and visible so a space is printable but not visible whereas an a is both printable and visible so it's kind of like slash w
slash lower so matching in the lower case character so we have print which is just principle characters does matter for visible or not punk which is any punctuation characters which is a great one to be able to say not punctuation if or if you want to match punctuation to nothing to get rid of punctuation that's a great one to use because a lot of times you just want to get the dot out of there
space is another one to get rid of space tab form feed form feed and some others you can go into the positive standards to look up all these things to get more detail upward like it sounds like it's any uppercase character and x digit to match characters that are a part of the hexadecimal system so one of those 16 characters
what is that 0 through 9 and then a b c d e f yes so now we can say okay I imagine one of these characters how can I say matching multiple of these characters so that's where quantifiers come in so if I want to match
match just one d I just put d but if I want to say match d one or more times I will use the plus character so matching d plus would say would match a d a d d a d to like 25 d's using the asterix matches 0 or more times so it doesn't have to find it but if it does find it
so asterix basically wild color anything the next character is the question mark question mark means 0 or one time so if I want to if I'm looking to see a lot of times I'll use this one not necessarily an all but in other cases if I'm looking for matching tags and HTML I'll do like
the start of the tag like and then put a slash like a forward slash with a question mark next to it and then div and then close the the tag and that way it'll find the beginning div and also find the slash div at the end so that's a good one to know inside of curly brackets
I have curly brackets inside of curly brackets any number means match this thing that many times so if I say so if I don't want to say match d one or more times but I don't want to match only choice like we saw in the example above when we match e exactly two times to put e curly bracket to other curly bracket
if we want to say if we want to match it to or more times instead of one or more times so maybe one time we don't want to find but if it's two or more we want to find the match that letter we would put inside of curly brackets and comma then nothing and then curly bracket if I want to say between two and six times it'll be inside the curly brackets two comma six
so two comma nothing is two or more two comma six would be two to six if I said comma if I open the code brackets and said nothing comma six that would be zero to six times I know it's a lot but we're getting through it
like I said at the beginning regular expressions is a lot of information but even if you just get the basics down it's really helpful and a lot of things and as you get the basics down and you do that and you start to have use cases for this you'll find reasons to do more and more
and some of the resources are going to point you to not only will help you build regular expressions but will also test them and also show you how to use show them how they work so kind of like a explain inside of there it'll explain what your regular expression that you just wrote means
so you can put a group of items together in a match and to do that you need to use square brace I mean round braces which is parentheses so if I put in the word food inside of inside of parentheses that means I want to match the entire word food
but if I put a pipe and then put bar after the pipe so inside of parentheses I put food pipe bar that means food or bar so I'm going to be matching this entire text every time I see food or bar I'm going to be able to match that
some other things that and in other places now I don't think all can really do it but a lot of times what you'll do and like said or PCRE grip which is a extension of grip like a parlor extension of grip what you'll do is you'll say I'll want to match food instead of matching food or bar you'll put food in parentheses and then you'll put bar in parentheses
and so if I want and then if you wanted to so say I want if I had every time I found food bar next to each other I wanted to say bar food instead you would be able to say matching food matching bar and then in the substitution put bar first and put food later we'll get to that later on if we have to I think we might have already done it and said
but basically you're going to you're going to use the the references to what's inside of those parentheses by number and say well that that was the first one that was the second one that was the third one to be able to switch them all around or exclude some of them and that way you can you can do cool things with
that's where a lot of the stuff where these calls when you're either not just from matching but when you're replacing and that leads us right into the next topic which is replacement and in the
so sometimes you'll have a document and every time you see one thing you want to replace it with another or you want to add text to it to either denote that it's special or whatever your use case might be
might want to replace something in text the functions for that in our sub G sub and Gen sub and I'm not really going to go into Gen sub you can read about
it but sub and G sub are pretty commonly used sub is just substituting the first match that you find with whatever you're going to whatever the replacement
is G sub is replacing every match that you find in a string so when I say so it's kind of different in all that it is and said where your string might be really really big and said because it might be like entire document
but in all it's really only talking about in this column that I'm looking at right now so if only in column three
for some made up data there were three matches to your regular expression sub would only replace the first time it's found that match G sub would replace all three times
and so another special character that you that were run into in replacement is the ampersand so the ampersand which is the shift seven means the matching text
so that's similar to other languages like I know it's the same in VIM if you're doing a VIM replacement you can use that as well
the idea is if I wanted to say I'm looking for this text and I want to I don't want to just replace the text but I want to augment the text I want to say put something before the text I'll put whatever I want to put before the text and then ampersand
and then it'll instead of substituting there's the other way you do it say you have the word foo bar and want to replace foo bar with bass bar you would match foo and then be whatever
and then instead of doing it instead of having to say in your in your substitution foo again you can just say ampersand and then as easy and then all it'll do it needs to do
hope that makes sense so hope I'm explaining that adequately but you will have an example in a second to see how that works
all right so the replacement here's one example so nr51.txt file if I did the syntax so inside of the curly braces so if I've already done a filter inside of the curly braces
I'm going to do sub then open parenthesis so everything that we're substituting is going to happen inside of parenthesis
similar to how a lot of programming languages do functions where right after the function name you have open in parenthesis and then you will have some arguments that are separated by commas
and that's what we have here so far sub and g sub it is sub and then the first argument is the regular expression
the second argument is the replacement the third argument is what you want to replace and there's an optional fourth argument I think let me see
yeah I think there's an optional fourth argument which is starting from where that's common in a lot of languages so let's see I'm in the official documentation
no it's just a target so sub so it's a regular expression replacement and then the target and you can have multiple targets but that's what it is
and oh and if you did if you didn't put a target so if you like right now in my example I want to put a target but if you don't put a target it will do it everywhere in your entire file in all the rules and all the columns
so I have in my example sub and inside of right slashes apple so every time I find apple replace it with nut inside of
inside of double quotes comma dollar sign one so I'm going to only do this every time in column one do I see the word apple replace it with nut
and then I put after the closed parenthesis colon print dollar sign one which means print column one so I'm looking in column one and I'm printing column one
you could do a substitutional column one and print column three but that would be kind of a waste of time because you're never going to see column three
but you could also do print on column zero which is all columns and that would so you'd see everything
but you don't make the changes in column one so once again if I didn't have that dollar sign one inside of my substitution as the last argument it would look
everywhere in the file and replace apple with nut and so the output of that is going to be after name you're going to have nut and then you're going to have another nut
later on and then the last one you're going to have pine nut because it replace pine apple with pine nut so that's once again like I said at the beginning you know that's a time
where we didn't have to use a regular expression but we did not really rather expression but you know we could have done a simple way to match
now this next example is not that's not the case so here in this example I do inside of curly braces sub inside of inside of parentheses and then inside of the four
slashes I have dot which I think I covered did I cover dot I didn't cover dot let me add it right now so dot is magic any character let me add that to my notes here too
I'm here match any character oh yeah see match any character with a dot then plus so any character one or more times and then inside of parentheses
pp pipe rr close parentheses so I'm I'm looking for any any character one or one or more times followed by either two p's or two r's and then close that
four slash so that's the registrations always inside of the four slashes so after the four registrations over comma inside of double quotes
oh that's my phone that's funny inside of double quotes test dash ampersand then close it close the double quote comma
dollar sign one and then close the parentheses again semicolon print dollar sign one so you can tell right now going from the back from the end up
we're going to only be printing column one and we're doing and our target is also column one we're going to be doing the substitution of any
characters that come before then either a pp or an rr so that's our matching case we're going to write tests the word tst dash the thing
that was matched which was that word so in our example here with file one dot txt if we run this on file one dot
txt will get test dash apple because apple has pp in it that's on the funny going moving right along then there's banana without a test in front of it but then there's test dash
strawberry because strawberry has rr in it and then grape and then test dash apple again and then plum qe potato and then test dash pineapple
because once again pineapple has two p in it so as you can see the ampersand here is being turned into the matching text now say I wanted to just put test dash ampersand
well I have to use the make using escape character which is the backslash so instead of the thing in that second part test second argument test dash ampersand it was a test dash backslash ampersand
and that tells awk or a lot of regular special readers that the characters that is usually a special character right here just make the character here instead so that works with all of those special characters that we use that are used as anchors or our special digits or anything so if I just wanted to put a period and I put a backslash period
alright so that is all I'm going to go in we're already here over half hour and I don't want this to go into longer because you know people have other things to do with their lives right you guys can't be listening to my voice all day
so what I'm going to do instead is I'm going to let Dave take it from here and give you some more details and also I'm going to point you to some resources that I use all the time when I'm working with very expressions
and so and these are generic ones there's also some other places if you want specific regular expression details for either Vim or Pearl or Python or Java or C sharp some of these have slight variation so you might
have to go to their specific documentation for it but the research I use all the time number one is reg x 101 dot com is a dot com is yeah reg x 101 dot com
it's a way to test your regular expressions and also does a description of what your
expression says so if you have some example text that you you can copy and paste it into the text part and then you can copy and paste your reg x into the top part or just write it in there
and you can do your variables like global so instead of doing g sub it'll just be that and put a g in the at the end in the last field and then it'll explain to you what it's being matched and how it's being matched
and it's a really great resource but it's kind of a lot like it has a lot of buttons in a lot of different ways to look at it
if you want something simpler there is reg xr which is reg xr dot com so it's very similar to reg xr dot com but it's a reg x 101 but it's more simple
another good resource is I don't even know how to pronounce this but I think it's grimoire or grimoire dot com so www dot grimoire dot com which is g ry m o i r a dot com slash unix slash awk that HTML
but if you want to just go back a level and just go up to the unix he has a lot of cool stuff so a lot is said and awk tutorials are like standard
they're reference all the time for how to understand these things
the next one I want to point you to is the official good new awk user guide which is at HTTPS.com slash www.goodnew.org slash software slash goch
slash manual slash HT ml node slash slash reg xp dot HTML you'll see it in the show notes but yeah the official good new awk
a lot of both my notes and I'm pretty sure Dave notes have come directly from here or at least the topics have come directly from here
this is a great place to find it I've also run into a cool article on tech smith on tech mint about using awk so you can also check that out
then I'll go into too many details on regular expressions but it's also good resource
but that's just about it for me on this episode of hacker public radio stay tuned for more by
you've been listening to hacker public radio at hacker public radio dot org
we are a community podcast network that releases shows every weekday Monday through Friday
today's show like all our shows was contributed by an hbr listener like yourself
if you ever thought of recording a podcast then click on our contributing to find out how easy it really is
hacker public radio was founded by the digital dog pound and the infonomicon computer club and it's part of the binary revolution at binwreff.com
if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself
unless otherwise status today's show is released on the creative comments, attribution, share a light 3.0 license
you