Episode: 2679 Title: HPR2679: Extra ancillary Bash tips - 13 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2679/hpr2679.mp3 Transcribed: 2025-10-19 07:25:27 --- This is HPR Episode 2679 entitled Extra Ancillary Mash Tip 13 and in part of the series Mash Cripting. It is hosted by Dave Morris and in about 37 minutes long and carrying an exquisite flag. The summary is making decisions in part 5. This episode of HBR is brought to you by AnanasThost.com. With 15% discount on all shared hosting with the offer code HBR15, that's HBR15. Better web hosting that's honest and fair at AnanasThost.com. Hello everybody. This is Dave Morris. Welcome to ICovublic Radio. Today I'm doing a show about Bash. This is the 13th episode in the Bash Tip sub-series and it's the fifth of a group of shows that I'm doing about making decisions in Bash. This is the last one in that group. If you've joined me here, I would recommend going back to check out the previous four episodes which are all listed in the link section of the notes. And there are two sorts of notes here by the way. The short notes are just a preamble and then a link to the long notes which is quite detailed so that you can read independently or read as you listen. Reading as you listen might be the best but depends where you are and what you're doing. So in the last four episodes we saw types of tests that Bash provides and some of the commands that use these tests. We looked at conditional expressions and all of the operators that Bash provides to do these things, making decisions. And we concentrated particularly on string comparisons which use glob and extended glob patterns and in the last episode we devoted it to looking at Bash's regular expressions. And now we want to look at the final topic within regular expressions which is capture groups. Now if you followed this series on said that I did some time ago or indeed the one that's covering the alt language that I'm doing in conjunction with B easy, then we've talked about regular expression and capture groups there so this won't be a particular surprise to you. It's a way in which you can group the elements of a regular expression using parentheses and you can thereby denote a component of the string that you're comparing with the regular expression. So for example you might want to look for three word sentences. So there's an example in the long notes that shows this. There are three groups and they each consist of in parentheses a character class, one of these things where you put in square brackets a list of characters which you can also refer to as ranges. So it uses lower case A to lower case Z as a range and uppercase A to uppercase Z. Then after the closed square bracket there's a plus sign which means one or more alphabetic characters. So that's all in parentheses. When each of the words so defined are followed by a space and a plus sign which means one or more spaces, not each of them but the first and the second. The third case is followed by zero or more spaces and an optional full stop. So your sentence might end with a full stop. You might put spaces before the full stop which would be wrong but you might do that. So the entire regular expression with these three groups in it is anchored to the start of the string. So the way this thing is written only the words themselves are being captured by being in groups, not the intervening spaces or the full stop at the end or anything. So we're going to look at a script that uses this regular expression soon. A bash uses an internal read only array called in capitals bash underscore rematch and that holds what is matched by regular expression. The zero element of this array holds what the entire regular expression is matched and the rest hold what was matched by any capture groups in the regular expression and just like other regular expression systems each capture group is numbered in order of occurrence. So element one of bash rematch contains the first element and first word or first match which is perhaps a better way of putting it, element two contains the second and so forth. Now in said it's possible to refer to a capture group within a sequence. You can do that with backslash one so and that allows you to write a regular expression that repeats stuff. So for example in the said syntax where the parentheses are going to be followed a bit preceded by a backslash then backslash open parenthesis cat backslash close parenthesis then backslash one which is the capture group means that you want the word cat to be repeated. So I've shown you a little example here where we echo the word or the sequence cat cat into said and the expression is to replace or the said script, I think it'd be better way of putting it, replaces using the s command the word cat which is in a capture group followed by the same thing again which is referred to as the backslash one replace that with the word match. So in that particular case it would come back with the word match so because it matches So I'm really mentioning this is because it's apparently not available in bash. This is nothing documented that I can find in any of the official bash scripting documentation of the GNU bash manual and so forth. There are references to a partial implementation that some people have. I don't know, discovered maybe they've looked at the source or something or maybe this is a known thing that's in development but it doesn't seem to be a thing to rely on but I've still spent a little time doing an example example two later on in this episode to experiment with it. So here comes an example of a script that uses the three word thing and looks at the bash rematch array and it's a downloadable example that you can grab and play with if you want to. It's called bash13 underscoreex1.sh. So the regular expression in here stored in a variable called RE and it's the same expression that we already looked at. The script checks to see that there's an argument because that's the way you're going to give it a sentence when you when you've written it when you run it. The script prints out echoes the sentence preceded by the word sentence so you know what to just confirm what you typed is what the script has seen. Then it compares in an extended test, a dollar one which is that argument with the regular expression in the variable RE and if it matches then you get the message matched. Then there's a for loop which iterates a variable i through the value zero through to three using a brace expansion. And for each iteration it uses printf to print out the number in i and then the contents of bash rematch index by that number. So running it with the sentence advarks eat ants full stop in quotes because otherwise bash would split them up into three arguments. Then it reports back sentence the sentence was advarkity and which they do it's an ant eater. Well they're anything to eat. And then it reports matched and then it goes through the zero element which is the entire string advarks eat ants. One is advarks, two is eat and three is ants without the full stop. So those are the three capture groups. Now if you've listened to the previous episodes talking about this sort of stuff what the last episode in fact then you might expect that you could put the capture group for a word which is that square bracket of expression in parentheses and then follow that with space and an asterisk meaning zero or more spaces and put that all in in parentheses and follow it with a number in curly brackets. So if you put a three there it's written in the notes back slash full stop question mark. Then you might expect that first of all that would match advarks eat ants. Well it does but there's only when there's two capture groups here in fact there's the inner one which is for a word and the outer one which is the thing that's just used to hold it all together and repeat it but you don't get the three elements of the sentence. You would get in element zero advarks eat ants. Then you'd get element one which would which contain ants that's from the first capture group which is the outer one and then two will also contain ants because it's the result of the three iterations I guess the last word in the sentence so it's not useful if you want to capture the words. So the Bay of Land in mind is it's a useful thing to be able to do when you just want to shorten a complex regular expression but it doesn't let you pick out the bits that capture groups will let you do. It's probably why you use them in the first place. So the rest of this show is just a bunch of examples. An example one is a revisiting example four from the last episode where I've written a script which takes an IP address a type four IP address IP version four and it checks it for for validity but I've made it a little more clever and I've also used capture groups to do it. This one is downloadable and it's called bash 13 underscore EX2.sh. So in the previous example we had four groups of up to three digits and the numbers that they form have got to be in the range 0 to 255 and they're separated by dots. So the regular expression that we have does this just as the previous one did although in the previous one we used a repeat count in curly brackets for each of the expressions. So this time we're spelling them out we're putting them one after the other but we're putting them in parentheses and we're anchoring the whole thing at the front and the end of the string. So the script checks to see that there is an argument which is going to be the IP address and produces a neuromessage if it's been forgotten and then it compares the argument dollar one against the regular expression. So that's pretty much what we did last time but in this particular case we're going to be a little bit more detailed about it. So inside the if statement then we have a variable ERRS which is short for errors being set to zero and another one called problems being set to nothing so it's just problems equals that just sets a variable to an old string. Then there's a for loop that iterates through the values 1 to 4 that's for the four groups of numbers and it uses the variable i to do that. Then we set a variable d to the result of the bash rematch array indexed by the variable i. Then there's an if statement that says if d dollar d is less than zero or dollar d is greater than 255 then and that's obviously checking to see whether it's in the range 0 to 255 and we're using here a it's not regular expression at all of course we're using extended and extended test and the or is the two vertical bar. So if that turns out to be true then we increment the variable ERRS. I always tend to type this name but I've had to pronounce it errors anyway we increment it because that's counting the number and we also add to the problems variable which is it's just a string we add the value of the dollar d variable that that's a number but it's going to be treated as a string and we follow that with the space. So what we're going to end up with if there's multiple errors we're going to end up with a bunch of numbers separated by spaces. So after that loop has gone through four times there's there's either going to be no errors or there's going to be one to four errors and what and one to four numbers in the problems string. Then there's another if which checks to see if e double r s is greater than zero and if it is then we want to say there were problems and it's not valid IP address in here here are the problems. So the way I've done that there are many ways you could do this. The way I chose was using the variable problems I set it to itself using the parameter expansion capability where you can extract a piece out of a string. So I extract problems colon zero. I was starting at position zero it's called zero relative colon minus one that's in curly brackets with the dollar on the front. What that means is from position zero the start of the string up to but not including the last character and that's because we were adding a number in a space to it each time we found a problem we just want to take that space off from that's what it was doing. Then we echo the error report dollar one is not a valid IP address semicolon contains and then we want to report the problems string but I've got fancy with it and I've put that in curly brackets with a dollar on the front and inside there's another parameter substitution expression using double slashes space and then a single slash comma space. So what that does is to go and find every space in this variable and replace it by comma space. So you'll get back a list with commas in between and that's basically a comma in a space and then it exits if they were x so the value one. If there were no errors then this branch of the if and which validates it against the regular expression is we'll return that the address is valid. If the regular expression didn't match at all then it will say that it's not a valid IP address. So I've drilled down in the notes after this into the regular expression a little bit more and just I think I've already said what this says so I'm pointing going over it yes I've explained this already with that reading out here so hopefully you'll find both what I've just said and also the notes help you to understand if you're having difficulties and there's a bunch of examples running it with different IP addresses such in valid so 192168 zero with dots in between and a trailing dot is not valid because there's only three numbers in 192168 zero five is it's perfectly valid those had dots in between as well and the last one is 192168 500 dot 256 well 5256 are out of range so it says it's not valid and it contains 500 comma 256 you might type that in in error because you're a terrible typist like I am the point is that we've got regular expressions here to to do these sorts of checks and a few other bells and whistles to make it so example two is the one where I played around with the ideas of back references as I said I couldn't find any official documentation about them but it does seem to be something available and it's in well at least in the version that I'm using I haven't tested any others I could have done I guess on some of the raspberry pies I have which I'll have various versions but a back reference consists of a backslash and a number the number refers to then the capture group starting counting from the left of the regular expression but it looks as if you can only have a single digit after the backslash so it means you can only refer to capture groups 1 to 9 you've got 10 or more then there's no way of referring to them with backslash number some regular expression systems allow you to follow the backslash with curly brackets and and a multi-digit number can't remember if said does but that there are there are purlers the one I know most but anything which is derived from the pearl regular expressions which is a thing called PCRE pearl compatible regular expressions will have that capability I think Python uses PCR so does PHP number of other languages too so what I've done to to mess around with this one so I've created a regular expression which consists of in parentheses a thing that one of those regular expression meta characters which is a backslash and a less that which means start of word then a full stop followed by in curly brackets 1 comma 10 so that means 1 to 10 characters any characters we follow that with the backslash greater than meaning end of word closer parentheses so that's that's a word of up to or between 1 and 10 characters then a space and backslash 1 so that's my regular expression so then I compare the and this is a very simple script just not really a production thing just a messing around thing I compare dollar one in the assumption that there's an argument with that regular expression as a and the usual way then if it matched then I then put out the message matched otherwise put the message no match so my simple example is running this script which you can download again bash 13 underscore EX3.sh you can run that and I gave it a two word string in single quotes of turnip turnip and so it's less than 10 characters long and there's two instances of it and the script says it matched so it's actually working but I don't think I would recommend using it it's maybe something to to keep an eye on it's maybe something that's going to be available in a in an upcoming version of bash so that would be that would be good isn't really useful feature to have but it's not officially there yet so my last example is a quite a complex one and in this one I'm trying to parse a file of email addresses sort of things that you might have put together from an address book or something the format of email addresses is quite complicated and there is an rfc several rfc's that define it it gets a lot more complicated than you could you would imagine it would because email goes way back in time nobody uses a lot of the the features of it now of the email address I mean but the script doesn't try to be fully comprehensive about this it's it's not the best way to validate email email addresses to write a bash script to do it there are libraries that will do it in various languages so but this is hopefully a useful example to to mess around with so the two formats that the script is catering for is where an address consists of so-called local part followed by an at sign followed by a domain so my example is something like vim at vim.org so that's probably the simplest format email address that you can have but there's another one that you'll often see which consists of a name where the name I think there must be some restrictions of what characters it can contain but let's say it's letters and numbers possibly parentheses and um that's followed by a local part at domain address but it's in a less than greater than sort of diamond brackets as I think they used to be called so an example of that would be the thing you might see when you look at email from the hpr mailing list it's often titled hpr-spaced-list-space-lesson-sign-hpr at hackpubb.radio.org and then the greater than sign so those are the two generic formats that we're going to be playing with so there's two files here there's bash 13-ex4.sh which is the downloadable script but it's it's reading data from a file which is bash 13 underscore ex4.txt which you can also grab and mess around with if you want so this let's talk about this script it's not hugely long but it's it's fairly complex it's can take me a while to to talk about so the script contains a variable called data which refers to the the text file of of addresses I haven't put a path in there assuming it's in the same directory as the script you could get fancy with that if you wanted to there's a check to see whether that file actually exists and if it doesn't then it's reported as missing and the script will exit with a with a false value I won't go into details about how that's done you can you can see it's similar to the things I've used before now the regular expression is complex it consists of two parts two components which I have written out as two individual variable which are called part one and part two and then I've concatenate them together inside a string and put them into a variable RE and the two parts are enclosed in parentheses in that final statement the third statement and in front of the open parentheses is a circumflex and after the closed parentheses is a dollar so it's it's an anchored thing to the to the line of reading from the far it's not really necessary to do that but putting them in parentheses is definitely recommended two variables are being substituted in a double quoted string and they're separated by a vertical bar so we have a regular expression which has two alternative parts if we look at the expressions themselves look at part one that's the one to match the simpler form of the email address and it just consists of in square brackets the low case 8 is z uppercase 8 is z 0 to 9 and underscore that's the first first character that's because there are some restrictions on how an email address can start it can't begin with a with a dot for example can begin with the number and it can begin with another letter indeed it can begin with an underscore then after that we've got it got more or less the same thing except that inside the square brackets there's a dot as well and that's followed by a plus look after the closed square bracket so that means any of those characters one or more time that's followed by an at sign and then the square bracketed list is it follows and it includes instead of an underscore it includes a hyphen and that hyphen has to be at the end because the hyphen can be used to separate ranges of things so you have to put it at the end so it doesn't look like it might be a range I think you put it as the first thing as well in fact I know you can after the closed square bracket we've got a plus so that's the domain part which can be any any character any letter or number can be a dot has to be you have to have a dot in between the the components you can have hyphens in the in the name there's probably other things that could go in there I didn't want to get too complex part two is pretty much the same except that that expression we just looked at is enclosed in a less than greater than and prior to it there's another bracketed parenthesized expression which consists of open square bracket circumflex less than closed square bracket plus which means one too many not less than signs in other words it's the characters from the start of the string to the less than sign that begins the address part can be anything as long as they're not less than signs and so I think I've mentioned this before and it's a common thing that you see in regular expressions so this is quite a complex regular expression it could be simplified yet further I think or indeed you could do some clever things with a string substitution in bash but I didn't want to get into that because it would just just make life too complex for for anybody you're reading anything so the main script is a loop and it's one of these wild loops which on each iteration is reading a value from somewhere other into a variable called line and the way that's done is the final line of the wild loop and done line is followed by a less than sign and the name of the data violin quote in double quotes so the loop will attach itself to that file if you like and by using the read command will whenever read wants to read something it will read from that file again we've seen this before it's a common trope in bash groups inside this loop we first will have an if statement which is simply comparing the line with the regular expression so if it matches then we want to check things about the contents of bash rematch and here's where things get a little little complicated so let me just digress for a bit there's there's a fair that my notes are maybe possibly a bit more comprehensive than what I've just talked about I wasn't reading what I was doing off the top of my head really because we have capture groups within capture groups and we've got two sets of them and in one case one set will be triggered and in another case another set will be triggered we need to do some quite clever checking on what comes back to determine what type of address we've got and the way that I worked this out was by using what you see commented out in the script at just after the regular expression match there's a statement declare space hyphen p space bash rematch without any dollars or anything there the name of the array declare is a is a command within bash that allows you to create various things various variables but in amongst them there are you can create a raise this way when you use the hyphen p option it means to print out its contents and it's at its attributes so we're asking that to print out the attributes and contents of bash rematch it's commented out so it's not actually being run at the moment but what I did was I ran this script using that declare in order to see what bash rematch would hold and a bit further down the page it shows what is produced for two addresses two of the different types of addresses so the first one was the and these are all dummy addresses by the way which I generated on a so if I found that we'll generate batches of dummy addresses for you few real ones I chucked in but things like vim vim.org and stuff like that anyway this one is kawasakiatme.com and when this is checked through the regular expression comparison then bash rematch element zero not surprisingly gets the whole address element one also gets the whole address and element two similarly elements three and four get nothing element zero always gets the whole regular expression the whole not the whole regular expression but everything that the regular expression matches since we put brackets around the whole thing parentheses around the whole thing in order that we can contain the two alternatives we we get the same in element one as well of bash rematch so it's also matching everything because everything all both of the the two alternative regular expressions are inside this thing so so you can ignore element zero and element one if the address matches the first sub expression it will get stored in bash rematch element two if it matches the second sub expression because there are the third and fourth capture groups exist in that sub expression then you'll see results in there but in this particular case there's nothing being matched so they're empty so the case where regular expression the the bash rematch element two contains something then we have email address of the first type then if further down my notes you'll see that it's matching with another address which is which consists of s space mayor space and then in less than greater than s mayor at yahoo dot com so ignoring element zero and element one we find that element two bash rematch is empty because that's when it matches with the first case the first sub expression with element three and four have first of all the the name portion and secondly the address portion of this email address so the script uses the fact that element two of bash rematch is zero length in order to determine which type of address was matched so go back to the actual script itself you'll see there is an if that goes if and then an extended test hyphen locates z then bash rematch element two so if bash rematch element two is zero length then we know that it's the type two mail address which is the name and then the local partner main inside so-called diamond brackets so I've commented it this is it this there's room for confusion here in fact so I was writing this I got confused just why I added comments so in that particular case we put inside variable name we store bash rematch element three and in a variable email we store battery no match element four if it's not zero if it's not zero length bash rematch two then it's type one address or just the simple local partner domain format so we set name to nothing because there was no name portion there and we set email to bash rematch two element two and then that's the end of that if that checks which which version it is we simply print out the contents of name and the contents of email because name is going to be blank in some cases but that seemed to me to be reasonable you could you could permit printing out name when there is no name but I thought it was more useful to report that there was no no name explicitly because the other branch of this if which the if which was testing the regular expression if it doesn't match at all then we say not recognized and report the line that failed there's a there's an empty echo towards the end of the loop which just causes blank line to be produced between each iteration through the file if we look further down the page of everything relating to this example then I included an excerpt from from what was what was produced when the script is run and this is the the one with the declare command commented out as it as it will be if you download it and it's first email address that it finds is a failed spa at Yahoo dot CA who has a name that was the the second type of of mail address so the name field is populated the email field is populated second one is somebody called mcraw4atlive.com within there was no name the third instance is one where for some reason other email address of dot four two this is one I had it's because it's illegal at unknown dot Mars it's illegal because it begins with the full stop so it says not recognize it doesn't doesn't match anything that's what it does if you feel interested enough to investigate further you can download all of that stuff and run it yourself and see what it does indeed adds more and see how it baves you could also enhance the regular expression to make it more generic but I'm not sure as I say that bash is the correct vehicle for this but it makes the point about regular expressions and bash rematch and capture groups etc etc so hopefully find it well that's the end of that at the end of this group and I hope you found this to be useful there's some powerful features within bash as I'm sure you're well aware by now so I hope this helped to reveal this particular set of features to you okay then bye bye you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast and click on our contributing to find out how easy it really is hecka public radio was found by the digital dog pound and the infonomican computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments attribution share a live 3.0 license