153 lines
29 KiB
Plaintext
153 lines
29 KiB
Plaintext
|
|
Episode: 2184
|
||
|
|
Title: HPR2184: Gnu Awk - Part 5
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2184/hpr2184.mp3
|
||
|
|
Transcribed: 2025-10-18 15:27:19
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This episode of HPR is brought to you by Ananasthost.com, get 15% discount on all shared hosting
|
||
|
|
with the offer code HPR15, that's HPR15, better web hosting that's honest and fair at Ananasthost.com
|
||
|
|
Good morning Hacker Public Radio fans or good evening or good night, whatever the case may be for you, this is be easy signing in once again, bringing you another episode
|
||
|
|
of the Canoo Awk series, this is episode 5 where we will be discussing, I guess it's just really me, I will be discussing Reg X and the Canoo Awk programming system.
|
||
|
|
So if you have any experience with regular expressions you can listen lightly and not pay that much attention because it is actually not that different to do regular expressions in Canoo Awk then it is to do in Pearl or in Said or other languages, Python, even C sharp.
|
||
|
|
But if you are new to regular expressions don't be afraid, this can be a little intro for you to this really powerful tool for being able to manipulate text.
|
||
|
|
I actually spend a lot of time searching and indexing text at my day job and so it is really interesting to me how to use these tools and I end up using it a lot of times to do other things in just for home.
|
||
|
|
So I am not going to go into all the details of regular expressions because it is a really big topic but I am going to go into a good amount of detail and bringing it to all for the explicit purpose of the Awk tool.
|
||
|
|
So some things may be different in other languages but for the most part you will be able to follow this in other places and see it the same.
|
||
|
|
So to start off with why would you want to use a regular expression? Well a regular expression is a way to find a match to a string of text without it having to be the exact match of text.
|
||
|
|
So if you want to do something simple just replace one word or one letter or one group of letters with another group of letters then you don't need to use regular expressions.
|
||
|
|
You can just say replace this with that but if your patterns are more tricky you want to be able to match multiple things then regular expressions are a good thing that you want to match.
|
||
|
|
And in this example we are going to be using our file 1.txt that we have been using all throughout the WreckX series. I am going to excuse me the Awk series but these are just examples or simple fun examples sometimes if you see some of these you would say why wouldn't I just use regular substitution instead or regular matching instead.
|
||
|
|
And you are probably right for simple cases regular substitution is okay but for what we are doing right now it is important to see that these are just examples and it can get really complicated if you want to.
|
||
|
|
So the syntax for regular expressions in Awk comes in two places one is in well it can come in more but the two places I am going to focus on right now are matching the rows that you want to present in the rest of your equation so the filter.
|
||
|
|
And then also in some Awk commands or some Awk functions that you can use for substitution or matching.
|
||
|
|
So for doing that filtering the syntax is the word or the column that you are looking for in your data with the tilde sign which on a us keyboard is the character that is above the tab.
|
||
|
|
But you have to press shift to get to it. So that character which is similar in to what it is in Perl so if you do Perl if you want to match something you will say tilde equals well in Awk you just use tilde and then after the tilde you put in on both sides right forward slash which is the key next to the right shift on a American keyboard.
|
||
|
|
And then that is how you match it. So if you are doing a regular filter you would say word equals equals and then in parentheses the thing here going to say word tilde inside of four slashes your regular expression to do a negation it is similar to what you would do with a regular filter instead of doing
|
||
|
|
exclamation point equals your doing exclamation point till day. So for not matching this regular expression will be word exclamation point till day four slash the regular expression four slash again and then space and then print whatever you know the rest of the way that we do inside of
|
||
|
|
curly brackets. So from our file one.txt file if we did something that said for example dollar sign one till day inside of right for slashes P and then inside of brackets E L U
|
||
|
|
and then the second forward slash print zero print dollar sign zero we would get a filter list of all the items that have a P and then any come any one of the letters E L U
|
||
|
|
and in our example data we have apple grape plum and pineapple which have a P followed by an either an E L or U. And so that was an example we're going to go into some of the details about how that example works.
|
||
|
|
Another quick example is dollar sign two till day inside of four slashes E inside of curly braces two and then the second four slash print dollar sign zero we're going to do a similar thing but now we're going to be looking in column two for anything that has two E's next to each other print the whole line.
|
||
|
|
And we only have one example in that in this file because all the color green has two E's.
|
||
|
|
So what does all that mean what are those squiggly braces and numbers and kind of random letters looks mean so you'll notice that if you ever look at regular expressions like a really complex one you'll just see a whole bunch and might look like a character every now none that you recognize with a whole bunch of special characters all over the place and if you are under
|
||
|
|
familiar with regular expressions it looks like gobbling book but there is actually a lot of meaning in that so to start with let's go over some of these characters there's there's this term called an anchor and anchor is like either starting off point or an ending point of your of your regular expression so if you do the carrot or the top hat or whatever you want to call that.
|
||
|
|
Little mark which is shift six on an American keyboard that symbol means the beginning of a line so if you did if you wanted to find in your search the beginning of the line P so only the things that with a first letter on the line as a P you use carrot P.
|
||
|
|
The inverse would be dollar sign and so dollar sign means the end of the line so if you're looking at P dollar sign you're looking for the last letter on a line to match P.
|
||
|
|
And so you can have a whole bunch of other characters before that and in your match but if you're starting with the dollar sign if you're ending with the dollar sign that means that's the end of the line.
|
||
|
|
And so if you have a P in the middle of the line it won't be matching that it'll be matching if you have a P at the end of the line.
|
||
|
|
The next so sometimes you have the first letter you might want to match is the beginning of a string so skipping the white space the first letter of the string and that is backslash capitol A.
|
||
|
|
All right so backslash capitol A P would say if the first letter and the string that I'm looking for is a P then you have a good match.
|
||
|
|
The inverse of that would be slash lowercase Z or so backslash lowercase Z would be if the last letter so if I did P backslash lowercase Z that would mean that the last letter of the string is the P.
|
||
|
|
Another important anchor is the lowercase B so the backslash lowercase B you'll see a lot of times that you'll have a backslash and a regular character that annotates what some of these either anchors or other special characters mean.
|
||
|
|
And that's because there's only a finite amount of special characters on a keyboard or a unit code that are I guess really on the keyboard you can put them in unit code if you wanted to but no one would be able to access them.
|
||
|
|
So what they do instead is start using regular letters with a backslash in front of it to give it a new meaning.
|
||
|
|
So backslash B is a word boundary I use this one all the time so if I want to just find the end of the word in my match then I would do backslash B backslash B and backslash Z are different in that the end of a string would include like a period or some other character like that.
|
||
|
|
But backslash B would not I'm pretty sure that's right which is we'll just pretend that that's right if I'm not right and please feel free to correct me because I'm doing this kind of off that med if that's not right.
|
||
|
|
So for so those are the main anchors there are others and I'm going to point you all to some resources at the end which goes into a lot more detail because who can remember all these things the first time they would do it or even the 900 time they have used regular expressions so even now I still use some of these resources just so I can either check my regular expressions or to find something that I don't have or I'm missing.
|
||
|
|
So there are some other characters so if I if I want to match characters is some cool ways to do it so obviously if I just put into the character by itself I'm matching that one character or if I put a series of characters together like ABC then only when I find the term ABC will I find a match but I put inside of square braces.
|
||
|
|
So let's instead of saying ABC ABC D if I put inside of square braces A and D that means if the character that I'm looking for is either an A or a D so inside of square braces any of the characters that you put in there without any other markings you're looking for any one of those characters so as we saw in our example above where we did.
|
||
|
|
When we're looking for P and inside of square brackets E L U doesn't matter what order E L and you go and I just like to use alphabetical order because it makes it easier but it doesn't matter what the order is you're going to be matching a P with either an E or an L or you if I didn't have the square braces I'd have to match exactly P L U.
|
||
|
|
What if I'm instead of so in our example down here and am I showing a square brackets A D if I didn't want to match just the characters A D but I want to match ABC D I can either just type ABC D inside of the square brackets or I can use the dash or the hyphen whatever you want to call it between A and D so A dash D kind of like it how it looks means.
|
||
|
|
A to D or A through D so that would match ABC D so a lot of times what you'll see is A dash Z which is any letter A through Z another hand if you had a capital A dash capital Z you'd be matching upper case letters so I'm not going to go into the case too much I'm going to save that for maybe maybe Dave can cover that but in general.
|
||
|
|
Regular expressions are case sensitive as they you would expect it would be because there's so many various variations of of strings that if it was case insensitive one it wouldn't be very precise and to it would take a more resources.
|
||
|
|
But anyway going back to where it was so inside of square brackets once again sometimes you don't want to match something sometimes you want to not match something and so if I put it after the opening square bracket if I put a carrot that means not in this context why did they not stay consistent and use an exclamation point here I don't know but that's what they do so it would be consistent to say.
|
||
|
|
exclamation point but instead we say carrot so if I say open bracket carrot A to A dash D close bracket that means not the characters A to D so I don't want to match that I want to find any words that or any character that is not one of those four.
|
||
|
|
So there's some other characters that we can match or so backslash W is any word so anything that's not a white space character backslash S is any white space character so that includes the tab character the space I think it includes I don't think it includes new line no.
|
||
|
|
But any any white space character backslash D is any digit so you could do inside a square brackets zero dash nine but a shorter way to do it is just backslash D and that matches any digit.
|
||
|
|
So there are if you want so all those backslash W S and D if you do the capital version of it it's a negation so I want if I want to watch if I want to match any non white space character I do backslash capital S if I want to match any non digit that I do backslash capital D.
|
||
|
|
If I want to match any non word character that I do backslash capital W and those kind of make sense there are some other boundary and some other special character I will like I said refer you to the to the references if you want to get more detail but you know have so much time.
|
||
|
|
So let's go to the next thing which is there are some there is a standard I think it's been talked about on each bear before there's a standard way of building software the standard is called politics and there is a
|
||
|
|
politics compliant or a positive standard way of referencing a lot of those characters that either graph or PCR and graph or set or arc might use a different way of doing it most of them also accept the the positive standard way of doing it so I have them here in the show notes we have
|
||
|
|
alnum to match any off a numeric character it's similar to slash capital S kind of so any non white space so we can do all these have like a square bracket on both sides and inside the square bracket square brackets either open and close there is a colon so it goes square bracket colon whatever the phrases or whatever the word is colon
|
||
|
|
square square bracket so alnum is alpha numeric alpha like it sounds is alphabetic blank is tab in space so square bracket colon blank colon close square bracket that matches any of the white space characters like space and tab
|
||
|
|
colon t n t r l colon square bracket is any control character digit is another one that's just like the back s f d graph is another one which are for both printable and non print all characters that are both
|
||
|
|
printable and visible so a space is printable but not visible whereas an a is both printable and visible so it's kind of like slash w
|
||
|
|
slash lower so matching in the lower case character so we have print which is just principle characters does matter for visible or not punk which is any punctuation characters which is a great one to be able to say not punctuation if or if you want to match punctuation to nothing to get rid of punctuation that's a great one to use because a lot of times you just want to get the dot out of there
|
||
|
|
space is another one to get rid of space tab form feed form feed and some others you can go into the positive standards to look up all these things to get more detail upward like it sounds like it's any uppercase character and x digit to match characters that are a part of the hexadecimal system so one of those 16 characters
|
||
|
|
what is that 0 through 9 and then a b c d e f yes so now we can say okay I imagine one of these characters how can I say matching multiple of these characters so that's where quantifiers come in so if I want to match
|
||
|
|
match just one d I just put d but if I want to say match d one or more times I will use the plus character so matching d plus would say would match a d a d d a d to like 25 d's using the asterix matches 0 or more times so it doesn't have to find it but if it does find it
|
||
|
|
so asterix basically wild color anything the next character is the question mark question mark means 0 or one time so if I want to if I'm looking to see a lot of times I'll use this one not necessarily an all but in other cases if I'm looking for matching tags and HTML I'll do like
|
||
|
|
the start of the tag like and then put a slash like a forward slash with a question mark next to it and then div and then close the the tag and that way it'll find the beginning div and also find the slash div at the end so that's a good one to know inside of curly brackets
|
||
|
|
I have curly brackets inside of curly brackets any number means match this thing that many times so if I say so if I don't want to say match d one or more times but I don't want to match only choice like we saw in the example above when we match e exactly two times to put e curly bracket to other curly bracket
|
||
|
|
if we want to say if we want to match it to or more times instead of one or more times so maybe one time we don't want to find but if it's two or more we want to find the match that letter we would put inside of curly brackets and comma then nothing and then curly bracket if I want to say between two and six times it'll be inside the curly brackets two comma six
|
||
|
|
so two comma nothing is two or more two comma six would be two to six if I said comma if I open the code brackets and said nothing comma six that would be zero to six times I know it's a lot but we're getting through it
|
||
|
|
like I said at the beginning regular expressions is a lot of information but even if you just get the basics down it's really helpful and a lot of things and as you get the basics down and you do that and you start to have use cases for this you'll find reasons to do more and more
|
||
|
|
and some of the resources are going to point you to not only will help you build regular expressions but will also test them and also show you how to use show them how they work so kind of like a explain inside of there it'll explain what your regular expression that you just wrote means
|
||
|
|
so you can put a group of items together in a match and to do that you need to use square brace I mean round braces which is parentheses so if I put in the word food inside of inside of parentheses that means I want to match the entire word food
|
||
|
|
but if I put a pipe and then put bar after the pipe so inside of parentheses I put food pipe bar that means food or bar so I'm going to be matching this entire text every time I see food or bar I'm going to be able to match that
|
||
|
|
some other things that and in other places now I don't think all can really do it but a lot of times what you'll do and like said or PCRE grip which is a extension of grip like a parlor extension of grip what you'll do is you'll say I'll want to match food instead of matching food or bar you'll put food in parentheses and then you'll put bar in parentheses
|
||
|
|
and so if I want and then if you wanted to so say I want if I had every time I found food bar next to each other I wanted to say bar food instead you would be able to say matching food matching bar and then in the substitution put bar first and put food later we'll get to that later on if we have to I think we might have already done it and said
|
||
|
|
but basically you're going to you're going to use the the references to what's inside of those parentheses by number and say well that that was the first one that was the second one that was the third one to be able to switch them all around or exclude some of them and that way you can you can do cool things with
|
||
|
|
that's where a lot of the stuff where these calls when you're either not just from matching but when you're replacing and that leads us right into the next topic which is replacement and in the
|
||
|
|
so sometimes you'll have a document and every time you see one thing you want to replace it with another or you want to add text to it to either denote that it's special or whatever your use case might be
|
||
|
|
might want to replace something in text the functions for that in our sub G sub and Gen sub and I'm not really going to go into Gen sub you can read about
|
||
|
|
it but sub and G sub are pretty commonly used sub is just substituting the first match that you find with whatever you're going to whatever the replacement
|
||
|
|
is G sub is replacing every match that you find in a string so when I say so it's kind of different in all that it is and said where your string might be really really big and said because it might be like entire document
|
||
|
|
but in all it's really only talking about in this column that I'm looking at right now so if only in column three
|
||
|
|
for some made up data there were three matches to your regular expression sub would only replace the first time it's found that match G sub would replace all three times
|
||
|
|
and so another special character that you that were run into in replacement is the ampersand so the ampersand which is the shift seven means the matching text
|
||
|
|
so that's similar to other languages like I know it's the same in VIM if you're doing a VIM replacement you can use that as well
|
||
|
|
the idea is if I wanted to say I'm looking for this text and I want to I don't want to just replace the text but I want to augment the text I want to say put something before the text I'll put whatever I want to put before the text and then ampersand
|
||
|
|
and then it'll instead of substituting there's the other way you do it say you have the word foo bar and want to replace foo bar with bass bar you would match foo and then be whatever
|
||
|
|
and then instead of doing it instead of having to say in your in your substitution foo again you can just say ampersand and then as easy and then all it'll do it needs to do
|
||
|
|
hope that makes sense so hope I'm explaining that adequately but you will have an example in a second to see how that works
|
||
|
|
all right so the replacement here's one example so nr51.txt file if I did the syntax so inside of the curly braces so if I've already done a filter inside of the curly braces
|
||
|
|
I'm going to do sub then open parenthesis so everything that we're substituting is going to happen inside of parenthesis
|
||
|
|
similar to how a lot of programming languages do functions where right after the function name you have open in parenthesis and then you will have some arguments that are separated by commas
|
||
|
|
and that's what we have here so far sub and g sub it is sub and then the first argument is the regular expression
|
||
|
|
the second argument is the replacement the third argument is what you want to replace and there's an optional fourth argument I think let me see
|
||
|
|
yeah I think there's an optional fourth argument which is starting from where that's common in a lot of languages so let's see I'm in the official documentation
|
||
|
|
no it's just a target so sub so it's a regular expression replacement and then the target and you can have multiple targets but that's what it is
|
||
|
|
and oh and if you did if you didn't put a target so if you like right now in my example I want to put a target but if you don't put a target it will do it everywhere in your entire file in all the rules and all the columns
|
||
|
|
so I have in my example sub and inside of right slashes apple so every time I find apple replace it with nut inside of
|
||
|
|
inside of double quotes comma dollar sign one so I'm going to only do this every time in column one do I see the word apple replace it with nut
|
||
|
|
and then I put after the closed parenthesis colon print dollar sign one which means print column one so I'm looking in column one and I'm printing column one
|
||
|
|
you could do a substitutional column one and print column three but that would be kind of a waste of time because you're never going to see column three
|
||
|
|
but you could also do print on column zero which is all columns and that would so you'd see everything
|
||
|
|
but you don't make the changes in column one so once again if I didn't have that dollar sign one inside of my substitution as the last argument it would look
|
||
|
|
everywhere in the file and replace apple with nut and so the output of that is going to be after name you're going to have nut and then you're going to have another nut
|
||
|
|
later on and then the last one you're going to have pine nut because it replace pine apple with pine nut so that's once again like I said at the beginning you know that's a time
|
||
|
|
where we didn't have to use a regular expression but we did not really rather expression but you know we could have done a simple way to match
|
||
|
|
now this next example is not that's not the case so here in this example I do inside of curly braces sub inside of inside of parentheses and then inside of the four
|
||
|
|
slashes I have dot which I think I covered did I cover dot I didn't cover dot let me add it right now so dot is magic any character let me add that to my notes here too
|
||
|
|
I'm here match any character oh yeah see match any character with a dot then plus so any character one or more times and then inside of parentheses
|
||
|
|
pp pipe rr close parentheses so I'm I'm looking for any any character one or one or more times followed by either two p's or two r's and then close that
|
||
|
|
four slash so that's the registrations always inside of the four slashes so after the four registrations over comma inside of double quotes
|
||
|
|
oh that's my phone that's funny inside of double quotes test dash ampersand then close it close the double quote comma
|
||
|
|
dollar sign one and then close the parentheses again semicolon print dollar sign one so you can tell right now going from the back from the end up
|
||
|
|
we're going to only be printing column one and we're doing and our target is also column one we're going to be doing the substitution of any
|
||
|
|
characters that come before then either a pp or an rr so that's our matching case we're going to write tests the word tst dash the thing
|
||
|
|
that was matched which was that word so in our example here with file one dot txt if we run this on file one dot
|
||
|
|
txt will get test dash apple because apple has pp in it that's on the funny going moving right along then there's banana without a test in front of it but then there's test dash
|
||
|
|
strawberry because strawberry has rr in it and then grape and then test dash apple again and then plum qe potato and then test dash pineapple
|
||
|
|
because once again pineapple has two p in it so as you can see the ampersand here is being turned into the matching text now say I wanted to just put test dash ampersand
|
||
|
|
well I have to use the make using escape character which is the backslash so instead of the thing in that second part test second argument test dash ampersand it was a test dash backslash ampersand
|
||
|
|
and that tells awk or a lot of regular special readers that the characters that is usually a special character right here just make the character here instead so that works with all of those special characters that we use that are used as anchors or our special digits or anything so if I just wanted to put a period and I put a backslash period
|
||
|
|
alright so that is all I'm going to go in we're already here over half hour and I don't want this to go into longer because you know people have other things to do with their lives right you guys can't be listening to my voice all day
|
||
|
|
so what I'm going to do instead is I'm going to let Dave take it from here and give you some more details and also I'm going to point you to some resources that I use all the time when I'm working with very expressions
|
||
|
|
and so and these are generic ones there's also some other places if you want specific regular expression details for either Vim or Pearl or Python or Java or C sharp some of these have slight variation so you might
|
||
|
|
have to go to their specific documentation for it but the research I use all the time number one is reg x 101 dot com is a dot com is yeah reg x 101 dot com
|
||
|
|
it's a way to test your regular expressions and also does a description of what your
|
||
|
|
expression says so if you have some example text that you you can copy and paste it into the text part and then you can copy and paste your reg x into the top part or just write it in there
|
||
|
|
and you can do your variables like global so instead of doing g sub it'll just be that and put a g in the at the end in the last field and then it'll explain to you what it's being matched and how it's being matched
|
||
|
|
and it's a really great resource but it's kind of a lot like it has a lot of buttons in a lot of different ways to look at it
|
||
|
|
if you want something simpler there is reg xr which is reg xr dot com so it's very similar to reg xr dot com but it's a reg x 101 but it's more simple
|
||
|
|
another good resource is I don't even know how to pronounce this but I think it's grimoire or grimoire dot com so www dot grimoire dot com which is g ry m o i r a dot com slash unix slash awk that HTML
|
||
|
|
but if you want to just go back a level and just go up to the unix he has a lot of cool stuff so a lot is said and awk tutorials are like standard
|
||
|
|
they're reference all the time for how to understand these things
|
||
|
|
the next one I want to point you to is the official good new awk user guide which is at HTTPS.com slash www.goodnew.org slash software slash goch
|
||
|
|
slash manual slash HT ml node slash slash reg xp dot HTML you'll see it in the show notes but yeah the official good new awk
|
||
|
|
a lot of both my notes and I'm pretty sure Dave notes have come directly from here or at least the topics have come directly from here
|
||
|
|
this is a great place to find it I've also run into a cool article on tech smith on tech mint about using awk so you can also check that out
|
||
|
|
then I'll go into too many details on regular expressions but it's also good resource
|
||
|
|
but that's just about it for me on this episode of hacker public radio stay tuned for more by
|
||
|
|
you've been listening to hacker public radio at hacker public radio dot org
|
||
|
|
we are a community podcast network that releases shows every weekday Monday through Friday
|
||
|
|
today's show like all our shows was contributed by an hbr listener like yourself
|
||
|
|
if you ever thought of recording a podcast then click on our contributing to find out how easy it really is
|
||
|
|
hacker public radio was founded by the digital dog pound and the infonomicon computer club and it's part of the binary revolution at binwreff.com
|
||
|
|
if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself
|
||
|
|
unless otherwise status today's show is released on the creative comments, attribution, share a light 3.0 license
|
||
|
|
you
|