- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
362 lines
30 KiB
Plaintext
362 lines
30 KiB
Plaintext
Episode: 2679
|
|
Title: HPR2679: Extra ancillary Bash tips - 13
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2679/hpr2679.mp3
|
|
Transcribed: 2025-10-19 07:25:27
|
|
|
|
---
|
|
|
|
This is HPR Episode 2679 entitled Extra Ancillary Mash Tip 13 and in part of the series Mash
|
|
Cripting.
|
|
It is hosted by Dave Morris and in about 37 minutes long and carrying an exquisite flag.
|
|
The summary is making decisions in part 5.
|
|
This episode of HBR is brought to you by AnanasThost.com.
|
|
With 15% discount on all shared hosting with the offer code HBR15, that's HBR15.
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
|
Hello everybody.
|
|
This is Dave Morris.
|
|
Welcome to ICovublic Radio.
|
|
Today I'm doing a show about Bash.
|
|
This is the 13th episode in the Bash Tip sub-series and it's the fifth of a group of shows that
|
|
I'm doing about making decisions in Bash.
|
|
This is the last one in that group.
|
|
If you've joined me here, I would recommend going back to check out the previous four episodes
|
|
which are all listed in the link section of the notes.
|
|
And there are two sorts of notes here by the way.
|
|
The short notes are just a preamble and then a link to the long notes which is quite detailed
|
|
so that you can read independently or read as you listen.
|
|
Reading as you listen might be the best but depends where you are and what you're doing.
|
|
So in the last four episodes we saw types of tests that Bash provides and some of the commands
|
|
that use these tests.
|
|
We looked at conditional expressions and all of the operators that Bash provides to do
|
|
these things, making decisions.
|
|
And we concentrated particularly on string comparisons which use glob and extended glob
|
|
patterns and in the last episode we devoted it to looking at Bash's regular expressions.
|
|
And now we want to look at the final topic within regular expressions which is capture
|
|
groups.
|
|
Now if you followed this series on said that I did some time ago or indeed the one that's
|
|
covering the alt language that I'm doing in conjunction with B easy, then we've talked
|
|
about regular expression and capture groups there so this won't be a particular surprise
|
|
to you.
|
|
It's a way in which you can group the elements of a regular expression using parentheses
|
|
and you can thereby denote a component of the string that you're comparing with the
|
|
regular expression.
|
|
So for example you might want to look for three word sentences.
|
|
So there's an example in the long notes that shows this.
|
|
There are three groups and they each consist of in parentheses a character class, one of
|
|
these things where you put in square brackets a list of characters which you can also refer
|
|
to as ranges.
|
|
So it uses lower case A to lower case Z as a range and uppercase A to uppercase Z.
|
|
Then after the closed square bracket there's a plus sign which means one or more alphabetic
|
|
characters.
|
|
So that's all in parentheses.
|
|
When each of the words so defined are followed by a space and a plus sign which means one
|
|
or more spaces, not each of them but the first and the second.
|
|
The third case is followed by zero or more spaces and an optional full stop.
|
|
So your sentence might end with a full stop.
|
|
You might put spaces before the full stop which would be wrong but you might do that.
|
|
So the entire regular expression with these three groups in it is anchored to the start
|
|
of the string.
|
|
So the way this thing is written only the words themselves are being captured by being
|
|
in groups, not the intervening spaces or the full stop at the end or anything.
|
|
So we're going to look at a script that uses this regular expression soon.
|
|
A bash uses an internal read only array called in capitals bash underscore rematch and that
|
|
holds what is matched by regular expression.
|
|
The zero element of this array holds what the entire regular expression is matched and
|
|
the rest hold what was matched by any capture groups in the regular expression and just
|
|
like other regular expression systems each capture group is numbered in order of occurrence.
|
|
So element one of bash rematch contains the first element and first word or first match
|
|
which is perhaps a better way of putting it, element two contains the second and so forth.
|
|
Now in said it's possible to refer to a capture group within a sequence.
|
|
You can do that with backslash one so and that allows you to write a regular expression
|
|
that repeats stuff.
|
|
So for example in the said syntax where the parentheses are going to be followed a bit
|
|
preceded by a backslash then backslash open parenthesis cat backslash close parenthesis
|
|
then backslash one which is the capture group means that you want the word cat to be repeated.
|
|
So I've shown you a little example here where we echo the word or the sequence cat cat
|
|
into said and the expression is to replace or the said script, I think it'd be better
|
|
way of putting it, replaces using the s command the word cat which is in a capture group
|
|
followed by the same thing again which is referred to as the backslash one replace that
|
|
with the word match.
|
|
So in that particular case it would come back with the word match so because it matches
|
|
So I'm really mentioning this is because it's apparently not available in bash.
|
|
This is nothing documented that I can find in any of the official bash scripting documentation
|
|
of the GNU bash manual and so forth.
|
|
There are references to a partial implementation that some people have.
|
|
I don't know, discovered maybe they've looked at the source or something or maybe this
|
|
is a known thing that's in development but it doesn't seem to be a thing to rely on
|
|
but I've still spent a little time doing an example example two later on in this episode
|
|
to experiment with it.
|
|
So here comes an example of a script that uses the three word thing and looks at the
|
|
bash rematch array and it's a downloadable example that you can grab and play with if
|
|
you want to.
|
|
It's called bash13 underscoreex1.sh.
|
|
So the regular expression in here stored in a variable called RE and it's the same expression
|
|
that we already looked at.
|
|
The script checks to see that there's an argument because that's the way you're going to give
|
|
it a sentence when you when you've written it when you run it.
|
|
The script prints out echoes the sentence preceded by the word sentence so you know what
|
|
to just confirm what you typed is what the script has seen.
|
|
Then it compares in an extended test, a dollar one which is that argument with the regular
|
|
expression in the variable RE and if it matches then you get the message matched.
|
|
Then there's a for loop which iterates a variable i through the value zero through to
|
|
three using a brace expansion.
|
|
And for each iteration it uses printf to print out the number in i and then the contents
|
|
of bash rematch index by that number.
|
|
So running it with the sentence advarks eat ants full stop in quotes because otherwise
|
|
bash would split them up into three arguments.
|
|
Then it reports back sentence the sentence was advarkity and which they do it's an ant
|
|
eater.
|
|
Well they're anything to eat.
|
|
And then it reports matched and then it goes through the zero element which is the entire
|
|
string advarks eat ants.
|
|
One is advarks, two is eat and three is ants without the full stop.
|
|
So those are the three capture groups.
|
|
Now if you've listened to the previous episodes talking about this sort of stuff what the
|
|
last episode in fact then you might expect that you could put the capture group for a word
|
|
which is that square bracket of expression in parentheses and then follow that with space
|
|
and an asterisk meaning zero or more spaces and put that all in in parentheses and follow
|
|
it with a number in curly brackets.
|
|
So if you put a three there it's written in the notes back slash full stop question mark.
|
|
Then you might expect that first of all that would match advarks eat ants.
|
|
Well it does but there's only when there's two capture groups here in fact there's the
|
|
inner one which is for a word and the outer one which is the thing that's just used to
|
|
hold it all together and repeat it but you don't get the three elements of the sentence.
|
|
You would get in element zero advarks eat ants.
|
|
Then you'd get element one which would which contain ants that's from the first capture
|
|
group which is the outer one and then two will also contain ants because it's the result
|
|
of the three iterations I guess the last word in the sentence so it's not useful if you want
|
|
to capture the words.
|
|
So the Bay of Land in mind is it's a useful thing to be able to do when you just want to shorten
|
|
a complex regular expression but it doesn't let you pick out the bits that capture groups will
|
|
let you do. It's probably why you use them in the first place.
|
|
So the rest of this show is just a bunch of examples.
|
|
An example one is a revisiting example four from the last episode where I've written a script
|
|
which takes an IP address a type four IP address IP version four and it checks it for
|
|
for validity but I've made it a little more clever and I've also used capture groups to do it.
|
|
This one is downloadable and it's called bash 13 underscore EX2.sh.
|
|
So in the previous example we had four groups of up to three digits and the numbers that they
|
|
form have got to be in the range 0 to 255 and they're separated by dots.
|
|
So the regular expression that we have does this just as the previous one did although in the
|
|
previous one we used a repeat count in curly brackets for each of the expressions.
|
|
So this time we're spelling them out we're putting them one after the other but we're putting them
|
|
in parentheses and we're anchoring the whole thing at the front and the end of the string.
|
|
So the script checks to see that there is an argument which is going to be the IP address
|
|
and produces a neuromessage if it's been forgotten and then it compares the argument dollar one
|
|
against the regular expression. So that's pretty much what we did last time but in this particular
|
|
case we're going to be a little bit more detailed about it. So inside the if statement then we have
|
|
a variable ERRS which is short for errors being set to zero and another one called problems being
|
|
set to nothing so it's just problems equals that just sets a variable to an old string.
|
|
Then there's a for loop that iterates through the values 1 to 4 that's for the four groups of numbers
|
|
and it uses the variable i to do that. Then we set a variable d to the result of the
|
|
bash rematch array indexed by the variable i. Then there's an if statement that says if d
|
|
dollar d is less than zero or dollar d is greater than 255 then and that's obviously checking
|
|
to see whether it's in the range 0 to 255 and we're using here a it's not regular expression at all
|
|
of course we're using extended and extended test and the or is the two vertical bar. So if that
|
|
turns out to be true then we increment the variable ERRS. I always tend to type this name but
|
|
I've had to pronounce it errors anyway we increment it because that's counting the number
|
|
and we also add to the problems variable which is it's just a string we add the value of the dollar
|
|
d variable that that's a number but it's going to be treated as a string and we follow that with
|
|
the space. So what we're going to end up with if there's multiple errors we're going to end up
|
|
with a bunch of numbers separated by spaces. So after that loop has gone through four times
|
|
there's there's either going to be no errors or there's going to be one to four errors and what
|
|
and one to four numbers in the problems string. Then there's another if which checks to see if
|
|
e double r s is greater than zero and if it is then we want to say there were problems and it's not
|
|
valid IP address in here here are the problems. So the way I've done that there are many ways
|
|
you could do this. The way I chose was using the variable problems I set it to itself using
|
|
the parameter expansion capability where you can extract a piece out of a string. So I extract
|
|
problems colon zero. I was starting at position zero it's called zero relative colon minus one
|
|
that's in curly brackets with the dollar on the front. What that means is from position zero the
|
|
start of the string up to but not including the last character and that's because we were adding
|
|
a number in a space to it each time we found a problem we just want to take that space off from
|
|
that's what it was doing. Then we echo the error report dollar one is not a valid IP address
|
|
semicolon contains and then we want to report the problems string but I've got fancy with it and
|
|
I've put that in curly brackets with a dollar on the front and inside there's another parameter
|
|
substitution expression using double slashes space and then a single slash comma space. So what that
|
|
does is to go and find every space in this variable and replace it by comma space. So you'll get back
|
|
a list with commas in between and that's basically a comma in a space and then it exits if
|
|
they were x so the value one. If there were no errors then this branch of the if and which
|
|
validates it against the regular expression is we'll return that the address is valid. If the
|
|
regular expression didn't match at all then it will say that it's not a valid IP address. So I've
|
|
drilled down in the notes after this into the regular expression a little bit more and just I think
|
|
I've already said what this says so I'm pointing going over it yes I've explained this already with
|
|
that reading out here so hopefully you'll find both what I've just said and also the notes help
|
|
you to understand if you're having difficulties and there's a bunch of examples running it
|
|
with different IP addresses such in valid so 192168 zero with dots in between and a
|
|
trailing dot is not valid because there's only three numbers in 192168 zero five is it's
|
|
perfectly valid those had dots in between as well and the last one is 192168 500 dot 256 well
|
|
5256 are out of range so it says it's not valid and it contains 500 comma 256 you might type
|
|
that in in error because you're a terrible typist like I am the point is that we've got regular
|
|
expressions here to to do these sorts of checks and a few other bells and whistles to make it
|
|
so example two is the one where I played around with the ideas of back references as I said I
|
|
couldn't find any official documentation about them but it does seem to be something available
|
|
and it's in well at least in the version that I'm using I haven't tested any others
|
|
I could have done I guess on some of the raspberry pies I have which I'll have various
|
|
versions but a back reference consists of a backslash and a number the number refers to then
|
|
the capture group starting counting from the left of the regular expression but it looks as if
|
|
you can only have a single digit after the backslash so it means you can only refer to capture
|
|
groups 1 to 9 you've got 10 or more then there's no way of referring to them with backslash number
|
|
some regular expression systems allow you to follow the backslash with curly brackets and
|
|
and a multi-digit number can't remember if said does but that there are there are
|
|
purlers the one I know most but anything which is derived from the pearl regular expressions which
|
|
is a thing called PCRE pearl compatible regular expressions will have that capability I think Python
|
|
uses PCR so does PHP number of other languages too so what I've done to to mess around with this one
|
|
so I've created a regular expression which consists of in parentheses a thing that one of those
|
|
regular expression meta characters which is a backslash and a less that which means start of word
|
|
then a full stop followed by in curly brackets 1 comma 10 so that means 1 to 10 characters
|
|
any characters we follow that with the backslash greater than meaning end of word closer parentheses
|
|
so that's that's a word of up to or between 1 and 10 characters then a space and backslash 1 so
|
|
that's my regular expression so then I compare the and this is a very simple script just
|
|
not really a production thing just a messing around thing I compare dollar one in the assumption
|
|
that there's an argument with that regular expression as a and the usual way then if it matched
|
|
then I then put out the message matched otherwise put the message no match so my simple example is
|
|
running this script which you can download again bash 13 underscore EX3.sh you can run that and I gave
|
|
it a two word string in single quotes of turnip turnip and so it's less than 10 characters long
|
|
and there's two instances of it and the script says it matched so it's actually working
|
|
but I don't think I would recommend using it it's maybe something to to keep an eye on it's maybe
|
|
something that's going to be available in a in an upcoming version of bash so that would be that
|
|
would be good isn't really useful feature to have but it's not officially there yet so my last
|
|
example is a quite a complex one and in this one I'm trying to parse a file of email addresses
|
|
sort of things that you might have put together from an address book or something the format
|
|
of email addresses is quite complicated and there is an rfc several rfc's that define it
|
|
it gets a lot more complicated than you could you would imagine it would because email goes way
|
|
back in time nobody uses a lot of the the features of it now of the email address I mean but the
|
|
script doesn't try to be fully comprehensive about this it's it's not the best way to validate
|
|
email email addresses to write a bash script to do it there are libraries that will do it
|
|
in various languages so but this is hopefully a useful example to to mess around with so the two
|
|
formats that the script is catering for is where an address consists of so-called local part
|
|
followed by an at sign followed by a domain so my example is something like vim at vim.org
|
|
so that's probably the simplest format email address that you can have but there's another one
|
|
that you'll often see which consists of a name where the name I think there must be some restrictions
|
|
of what characters it can contain but let's say it's letters and numbers possibly parentheses and
|
|
um that's followed by a local part at domain address but it's in a less than greater than
|
|
sort of diamond brackets as I think they used to be called so an example of that would be
|
|
the thing you might see when you look at email from the hpr mailing list it's often titled
|
|
hpr-spaced-list-space-lesson-sign-hpr at hackpubb.radio.org and then the greater than sign so those are
|
|
the two generic formats that we're going to be playing with so there's two files here there's
|
|
bash 13-ex4.sh which is the downloadable script but it's it's reading data from a file
|
|
which is bash 13 underscore ex4.txt which you can also grab and mess around with if you want
|
|
so this let's talk about this script it's not hugely long but it's it's fairly complex it's
|
|
can take me a while to to talk about so the script contains a variable called data which refers to
|
|
the the text file of of addresses I haven't put a path in there assuming it's in the same
|
|
directory as the script you could get fancy with that if you wanted to there's a check to see
|
|
whether that file actually exists and if it doesn't then it's reported as missing and the script
|
|
will exit with a with a false value I won't go into details about how that's done you can you
|
|
can see it's similar to the things I've used before now the regular expression is complex it
|
|
consists of two parts two components which I have written out as two individual variable
|
|
which are called part one and part two and then I've concatenate them together inside a string
|
|
and put them into a variable RE and the two parts are enclosed in parentheses in that final
|
|
statement the third statement and in front of the open parentheses is a circumflex and after
|
|
the closed parentheses is a dollar so it's it's an anchored thing to the to the line
|
|
of reading from the far it's not really necessary to do that but putting them in parentheses is
|
|
definitely recommended two variables are being substituted in a double quoted string and they're
|
|
separated by a vertical bar so we have a regular expression which has two alternative parts
|
|
if we look at the expressions themselves look at part one that's the one to match the simpler form
|
|
of the email address and it just consists of in square brackets the low case 8 is z uppercase 8 is z
|
|
0 to 9 and underscore that's the first first character that's because there are some restrictions
|
|
on how an email address can start it can't begin with a with a dot for example can begin with
|
|
the number and it can begin with another letter indeed it can begin with an underscore then after
|
|
that we've got it got more or less the same thing except that inside the square brackets there's
|
|
a dot as well and that's followed by a plus look after the closed square bracket so that means
|
|
any of those characters one or more time that's followed by an at sign and then the square
|
|
bracketed list is it follows and it includes instead of an underscore it includes a hyphen and
|
|
that hyphen has to be at the end because the hyphen can be used to separate ranges of things so
|
|
you have to put it at the end so it doesn't look like it might be a range I think you put it as
|
|
the first thing as well in fact I know you can after the closed square bracket we've got a plus
|
|
so that's the domain part which can be any any character any letter or number can be a dot has to
|
|
be you have to have a dot in between the the components you can have hyphens in the in the name
|
|
there's probably other things that could go in there I didn't want to get too complex
|
|
part two is pretty much the same except that that expression we just looked at is enclosed
|
|
in a less than greater than and prior to it there's another bracketed parenthesized expression
|
|
which consists of open square bracket circumflex less than closed square bracket plus which means
|
|
one too many not less than signs in other words it's the characters from the start of the string
|
|
to the less than sign that begins the address part can be anything as long as they're not
|
|
less than signs and so I think I've mentioned this before and it's a common thing that you see in
|
|
regular expressions so this is quite a complex regular expression it could be simplified
|
|
yet further I think or indeed you could do some clever things with a string substitution in
|
|
bash but I didn't want to get into that because it would just just make life too complex for
|
|
for anybody you're reading anything so the main script is a loop and it's one of these wild loops
|
|
which on each iteration is reading a value from somewhere other into a variable called line
|
|
and the way that's done is the final line of the wild loop and done line is followed by
|
|
a less than sign and the name of the data violin quote in double quotes so the loop will
|
|
attach itself to that file if you like and by using the read command will whenever read wants to
|
|
read something it will read from that file again we've seen this before it's a common trope
|
|
in bash groups inside this loop we first will have an if statement which is simply comparing the line
|
|
with the regular expression so if it matches then we want to check things about the contents of
|
|
bash rematch and here's where things get a little little complicated so let me just digress for
|
|
a bit there's there's a fair that my notes are maybe possibly a bit more comprehensive than what
|
|
I've just talked about I wasn't reading what I was doing off the top of my head really because we
|
|
have capture groups within capture groups and we've got two sets of them and in one case one set
|
|
will be triggered and in another case another set will be triggered we need to do some quite clever
|
|
checking on what comes back to determine what type of address we've got and the way that I
|
|
worked this out was by using what you see commented out in the script at just after the regular
|
|
expression match there's a statement declare space hyphen p space bash rematch without any dollars
|
|
or anything there the name of the array declare is a is a command within bash that allows you to
|
|
create various things various variables but in amongst them there are you can create a raise
|
|
this way when you use the hyphen p option it means to print out its contents and it's at its
|
|
attributes so we're asking that to print out the attributes and contents of bash rematch
|
|
it's commented out so it's not actually being run at the moment but what I did was I ran this
|
|
script using that declare in order to see what bash rematch would hold and a bit further down
|
|
the page it shows what is produced for two addresses two of the different types of addresses
|
|
so the first one was the and these are all dummy addresses by the way which I generated on a
|
|
so if I found that we'll generate batches of dummy addresses for you few real ones I chucked in
|
|
but things like vim vim.org and stuff like that anyway this one is kawasakiatme.com and when this
|
|
is checked through the regular expression comparison then bash rematch element zero not surprisingly
|
|
gets the whole address element one also gets the whole address and element two similarly elements
|
|
three and four get nothing element zero always gets the whole regular expression the whole not the
|
|
whole regular expression but everything that the regular expression matches since we put brackets
|
|
around the whole thing parentheses around the whole thing in order that we can contain the two
|
|
alternatives we we get the same in element one as well of bash rematch so it's also matching
|
|
everything because everything all both of the the two alternative regular expressions are inside
|
|
this thing so so you can ignore element zero and element one if the address matches the first
|
|
sub expression it will get stored in bash rematch element two if it matches the second sub expression
|
|
because there are the third and fourth capture groups exist in that sub expression then you'll see
|
|
results in there but in this particular case there's nothing being matched so they're empty so
|
|
the case where regular expression the the bash rematch element two contains something then we have
|
|
email address of the first type then if further down my notes you'll see that it's matching with
|
|
another address which is which consists of s space mayor space and then in less than greater than
|
|
s mayor at yahoo dot com so ignoring element zero and element one we find that element two
|
|
bash rematch is empty because that's when it matches with the first case the first sub expression
|
|
with element three and four have first of all the the name portion and secondly the address portion
|
|
of this email address so the script uses the fact that element two of bash rematch is zero length
|
|
in order to determine which type of address was matched so go back to the actual script itself
|
|
you'll see there is an if that goes if and then an extended test hyphen locates z then bash rematch
|
|
element two so if bash rematch element two is zero length then we know that it's the type two
|
|
mail address which is the name and then the local partner main inside so-called diamond brackets
|
|
so I've commented it this is it this there's room for confusion here in fact so I was writing this
|
|
I got confused just why I added comments so in that particular case we put inside variable name
|
|
we store bash rematch element three and in a variable email we store battery no match element four
|
|
if it's not zero if it's not zero length bash rematch two then it's type one address or
|
|
just the simple local partner domain format so we set name to nothing because there was no name
|
|
portion there and we set email to bash rematch two element two and then that's the end of that
|
|
if that checks which which version it is we simply print out the contents of name and the contents
|
|
of email because name is going to be blank in some cases but that seemed to me to be reasonable
|
|
you could you could permit printing out name when there is no name but I thought it was more
|
|
useful to report that there was no no name explicitly because the other branch of this if which
|
|
the if which was testing the regular expression if it doesn't match at all then we say not recognized
|
|
and report the line that failed there's a there's an empty echo towards the end of the loop which
|
|
just causes blank line to be produced between each iteration through the file if we look further
|
|
down the page of everything relating to this example then I included an excerpt from from what
|
|
was what was produced when the script is run and this is the the one with the declare command
|
|
commented out as it as it will be if you download it and it's first email address that it finds
|
|
is a failed spa at Yahoo dot CA who has a name that was the the second type of of mail address
|
|
so the name field is populated the email field is populated second one is somebody called
|
|
mcraw4atlive.com within there was no name the third instance is one where for some reason
|
|
other email address of dot four two this is one I had it's because it's illegal at unknown dot
|
|
Mars it's illegal because it begins with the full stop so it says not recognize it doesn't
|
|
doesn't match anything that's what it does if you feel interested enough to investigate
|
|
further you can download all of that stuff and run it yourself and see what it does
|
|
indeed adds more and see how it baves you could also enhance the regular expression to make it
|
|
more generic but I'm not sure as I say that bash is the correct vehicle for this but it makes the
|
|
point about regular expressions and bash rematch and capture groups etc etc so hopefully find it
|
|
well that's the end of that at the end of this group and I hope you found this to be useful
|
|
there's some powerful features within bash as I'm sure you're well aware by now so I hope
|
|
this helped to reveal this particular set of features to you okay then bye bye
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community
|
|
podcast network that releases shows every weekday Monday through Friday today's show like all our
|
|
shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast
|
|
and click on our contributing to find out how easy it really is hecka public radio was found
|
|
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
|
|
at binrev.com if you have comments on today's show please email the host directly leave a comment
|
|
on the website or record a follow-up episode yourself unless otherwise status today's show is
|
|
released on the creative comments attribution share a live 3.0 license
|