Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
310
hpr_transcripts/hpr2610.txt
Normal file
310
hpr_transcripts/hpr2610.txt
Normal file
@@ -0,0 +1,310 @@
|
||||
Episode: 2610
|
||||
Title: HPR2610: Gnu Awk - Part 12
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2610/hpr2610.mp3
|
||||
Transcribed: 2025-10-19 06:32:25
|
||||
|
||||
---
|
||||
|
||||
This is HPR Episode 2610 titled Genoaq Part 12 and is part of the series Learning Auk.
|
||||
It is hosted by Dave Morris and is about 34 minutes long and can in an explicit flag.
|
||||
The summary is advanced use of a range.
|
||||
This episode of HPR is brought to you by an honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code HPR15.
|
||||
That's HPR15.
|
||||
Better web hosting that's honest and fair at An Honesthost.com.
|
||||
Hello everybody. Welcome to Hacker Public Radio.
|
||||
This is Dave Morris and I'm doing part 12 of the Genoaq Series or Learning Auk as we
|
||||
calling it. So I started talking about arrays in episode 10 and I thought I would
|
||||
continue that in this episode looking at some of the advanced elements of arrays.
|
||||
Now the stuff I'm talking about today is specific largely I think to the Gnu version of Auk.
|
||||
That means if you're using Auk which is not the enhanced Gnu version then some of these
|
||||
might not be available so you need to need to check to be sure. So I'm talking about arrays
|
||||
but I'm also going to finish off with an example of using or to solve a problem that I had.
|
||||
This is not relative to arrays but just because I thought it was useful if you had some real world
|
||||
examples of using or rather than sort of fairly sterile examples that you tend to find in these
|
||||
episodes otherwise. So I'm going to start talking about Pat Split. I mentioned the split function
|
||||
in the last episode I did on this episode 10 but there's a more powerful function for splitting
|
||||
strings into array elements and it's called Pat Split because it splits according to patterns.
|
||||
It takes a series of arguments. The first one is a string which is I think to be chopped up
|
||||
and it's going to be chopped up according to the third argument which is called Field Pat which
|
||||
defines the way in which the string is to be split and it's put into the pieces of put into
|
||||
an array which is the second argument. There will be separators between each of the fields
|
||||
or there may be anyway and they yeah I think there would have to be, wouldn't they?
|
||||
And they are put into a further argument which is an array which is denoted by steps in the
|
||||
example. We'll look at this in a bit more detail. This is very similar to the way that split works
|
||||
and you can see the examples I gave you there. But the main difference from split is that
|
||||
this Field Pat argument, the third one, is a regular expression which defines the field rather
|
||||
than the separator. So I've got a bunch of examples here and I've gone for splitting up
|
||||
comma separated stuff. Now what I've done here is to write a script which deals with $0,
|
||||
the input record. I'm just giving the example one record but this would work with multiple records of
|
||||
course. I'm using $0 and now you could just as well have ignored Pat Split and use the standard
|
||||
splitting mechanism but we haven't really covered how you can do that using a regular expression.
|
||||
There is a field built in thing called F Pat which is similar to Fs which I should do that
|
||||
but hasn't been covered yet we'll be adding that into the series a bit later on.
|
||||
So I've got a bunch of examples which I've called org12 underscore EX and then a number dot org.
|
||||
This one is EX1 and what we're doing here is we're using Pat Split to split $0 into an array called
|
||||
A. We're doing it by finding fields which consist of 0 to any number of non-comers. So it's
|
||||
the regular expression is in slashes and it's open square bracket then a circumflex which means
|
||||
not and then a comma close square bracket asterisk. That means 0 to any number of characters which
|
||||
are not commas. So in other words anything that consists of things which are not commas followed
|
||||
by a comma followed by not commas and a comma will fit that. Then having split it there's a loop
|
||||
which goes for I in A in remembering that that's the way that you walk through an array and it
|
||||
prints out the value of the array A index by I. So if we feed it an apple a day keeps the
|
||||
doctor away with commas and sort of spaces then the output is the same sentence with spaces
|
||||
in between. I'm printing them out without new lines and then I putting a new line on the end so
|
||||
you see an actual string. I've used a similar sort of approach throughout these examples.
|
||||
In the example not the thing that you can download which is the the orc script but in the actual
|
||||
example in the in the notes I've showed shown the process of making a bash variable X into the
|
||||
string an apple a day keeps the doctor away then using bash is editing features to replace all
|
||||
the spaces by commas and then feeding that to the orc script which then removes them again.
|
||||
It's a silly example but you get I hope you get the idea from that. Now if you wanted to do a
|
||||
more complex regular expression example two shows that this example takes the expression the the
|
||||
string I should say a bird in the hands of worth two in the bush but I turned the word bird into
|
||||
a red bird separated by commas and enclosed in double quotes so in standard CSV format you can have
|
||||
elements of the comma separated variable list which contain spaces or commas indeed enclosed in
|
||||
double quotes so I've just emulated that. Then when it's printed out it's printed out with each of
|
||||
these elements separated by spaces and I put angle brackets around each one just to make them
|
||||
stand out more clearly and you can see that the the red bird string is is is one one entity.
|
||||
The regular expression consists of two sub expressions enclosed in parentheses with a vertical bar
|
||||
in between them so it's it's an all type expression. The first one is the same as in the previous
|
||||
example with a series of zero or more not commas if you like to put it that way. The second one
|
||||
looks for a double quoted string containing one or more things which are not double quotes so this
|
||||
technique of saying the thing that encloses a string followed by any number of characters which
|
||||
are not the enclosing characters is a is a technique you'll often see in regular expressions so
|
||||
that works fine with the with example as you will see and that's EX2 then in EX3 we've got an
|
||||
example where the pattern is quite simple but what we're doing here is we are saving the separators
|
||||
so the patch split is simply using a series of letters capital or lowercase letters one or more
|
||||
so any sequence which which matches that is the field definition field pattern we're saving the
|
||||
result in an array s the script prints out all of the elements of the array which are which captured
|
||||
by splitting and of course I've called the array a because I've not got much imagination no new lines
|
||||
just spaces in between the elements followed by a new line at the end and then similar loop to
|
||||
print out the contents of the array s I might say similar but it's not quite the same because
|
||||
this time it's a counted loop because when you run patch split it returns the number of fields
|
||||
that it found and I captured that in a variable called FLDS short for fields so I use that in the
|
||||
loop setting i equal to one then adding one to it until while it's less than or equal to that
|
||||
number of fields so that prints out all of the separators and then it puts a new line on the end
|
||||
so the result is you get the the words in the sentence fed to it followed by a line containing
|
||||
all the separators what I fed to it was the expression grinning like a cheshire cat where each
|
||||
word is separated by a number of hyphens so the first thing you see is grinning like a cheshire cat
|
||||
separated by spaces followed by all of the different hyphens separated by spaces just so happens
|
||||
that the separators the hyphens are the same length sequence of hyphens same length has the word
|
||||
before it and I just wrote a little box script to do that which I've included in the notes here but
|
||||
I've marked it skip unless you're really interested so I won't read this one out you can dig into it
|
||||
if you really want to it is available for download if you want to grab it and mess around with it
|
||||
now the printing of the array s doesn't begin at 0 it begins at 1 but there is a 0th element
|
||||
because it captures past split captures the separators prior to the first field well there aren't
|
||||
any in this case so I didn't bother to print it but it's worth bearing in mind because it can be
|
||||
of interest okay that's all I'm going to say about past split let's move on to sorting arrays
|
||||
basically there are two main ways to do this the first one is to use an extension in gnu ork
|
||||
which is a built-in array variable called proc info all in uppercase the element of the array is
|
||||
has the the index sorted underscore in because that's a that's a string has to be in double quotes
|
||||
so proc info square brackets quote sorted in quote closed square bracket that's the the magic
|
||||
variable which can be used to control how arrays get sorted in the original version of ork the
|
||||
non-gnu version then arrays came back in an arbitrary order when you you loop through them
|
||||
so sorting them could be a bit of a pain and I know this because that that was one of the things I had
|
||||
to do in my early computing career when I started to use ork there was no sorting built-in
|
||||
the thing you put into the the proc info element is a string predefined string which begins with an
|
||||
at sign and consists of various keywords and the the default one is at unsorted which means that
|
||||
the array come back as in standard ork in an arbitrary order then there's a bunch of others and
|
||||
look at a read them all out because there's quite a number it's a little table I put together of
|
||||
them take for example one that I quite like to use and one I've used in the example which is
|
||||
at VAL underscore STR underscore ASC that stands for values the values of the array as opposed to
|
||||
the indices STR treat them as strings ASC in ascending order the notes here say order by element
|
||||
values in ascending order scalar values are compared as strings so whatever values of place are
|
||||
found in the array elements be they numbers or strings will be treated as strings and sorted
|
||||
accordingly so this is this can be quite useful I certainly would have been more than delighted to
|
||||
have had this one I had various tasks to do using ork back in my career setting this value is determined
|
||||
it determines the sort order but before the loop scanning it begins you can't change it during
|
||||
the loop while the loop is scanning and what's more important perhaps is that whenever you set
|
||||
this value prox info sorted in then it's effective throughout the entire script there's no sort of
|
||||
sculpting or localization so if you have a script that's an ork script that's printing a
|
||||
raise in several instances they're all going to be sorted in this way you can change the value
|
||||
between instances writing it out of course but you can't and you can also switch it off by setting
|
||||
it to unsorted but it has a wider effect than might be obvious there's a bit more to what to
|
||||
this thing than I've mentioned here and I've just alluded to it because arrays can be more complicated
|
||||
than we've seen so far plus also this prox info sorted in can also contain the name of a function
|
||||
which will perform sorting on the array for you it's just the function that you have to define
|
||||
we haven't looked at functions use the divine functions yet I'm not sure whether we will go into
|
||||
this when we when we do get to that point I've pointed to the GNU or manual section 8.1.6 which
|
||||
covers this in a lot of detail so if you really need to use this then that's the place to go
|
||||
so there's an example which is called EX4 another downloadable one and it consists of a
|
||||
begin rule and in the begin rule prox info sorted in is set to at val underscore string underscore
|
||||
ask the one I mentioned before and we just use split the split of dollar zero into an array
|
||||
it's doing the split by space which is the same sort of split you would get in in default
|
||||
anyway but if you split stuff in the usual way with all you you can't easily sort it
|
||||
and it doesn't go into an array then the script prints out the elements of the array and it will
|
||||
come out they will come out in sorted order he uses for i in a as we've used before it prints out
|
||||
the value of i and then the value of the value of the a array index by i I've fed it the string
|
||||
and Englishman's home in his castle because it's sorted on the value it comes out as an
|
||||
Englishman's castle his home is and you'll see the indexes are not in sorted order but the values
|
||||
are in sorted order alphabetically sorted with the capitalized letters before the lowercase ones
|
||||
it's quite I think that's quite a potentially useful thing I have certainly used this but the
|
||||
sorting capability in the past to capture frequency information from bits of data and
|
||||
frequencies are often a thing that were quite important bits of knowledge in the environment I
|
||||
worked in and having a sorted list of frequencies was often a useful thing to have so sorting
|
||||
in this way or alphabetically sorting the names that you were you were doing frequency
|
||||
counts on or something like that was often quite a desirable thing for various reasons so let's
|
||||
now look at the functions which are available for array sorting be easy mention them when you
|
||||
think his review of string functions in episode 11 the functions we're going to look at are called
|
||||
a sort and a sort i now the two functions have pretty much the same arguments I have listed them
|
||||
separately and described each of the arguments separately in the notes the arguments are the source
|
||||
which is the array that you're you're going to be sorting then second one is called
|
||||
desks which is an optional one which is the the place you you're going to put the results of the
|
||||
sort and the third argument which again is optional how is a way which you can define the type of
|
||||
sort the how argument not too surprisingly can be any of the strings that we've already seen
|
||||
these at unsorted and at val string desk and ask and so forth that we saw in the context of the
|
||||
proc info stuff anyway let's look at the examples always through these fairly quickly there are three
|
||||
of them first one is ex five and I've made them all fairly trivial where I've defined an array
|
||||
and called it a and the arrays indexed by the numbers one two and three and in it I've just put
|
||||
names Jones was the first one x I think the mr x when I wrote that I think and Smith and then I
|
||||
used a sort a on that array and printed out the results so using a sort on that array a which
|
||||
has got the values of one two and three the indexes results in nothing very nothing very exciting
|
||||
I had to do a double take of this one example five because the array is being loaded up with indexes
|
||||
one two and three with the strings Jones x and Smith then it's being sorted the sort will cause
|
||||
the the values to be sorted so when it's printed out you get Jones Smith and x in alphabetical order
|
||||
but the indexes have been changed to be one two and three against the Jones Smith and x so in
|
||||
other words the potentially the indexes are completely destroyed and are replaced by the numbers one
|
||||
two and three whatever whatever's appropriate for the number of elements and it's not very
|
||||
obvious in the examples I apologize for that but it's it's an odd thing to do in some respects because
|
||||
you're taking an array which has got indexes which one assumes are important and it's reordering
|
||||
the indexes so one stays at Jones two instead of being x becomes Smith and three instead of being
|
||||
Smith becomes x so it's it's a slightly odd thing to do I guess you'd say I think the the
|
||||
prog info method is better in many ways example six I have done the same thing Jones x and Smith
|
||||
in an array a but I've instead of using numeric indexes I've given them characters a b and c
|
||||
what I've done this time is used a sort on a but I've said the destination is to be a b so in doing
|
||||
that the a sort first of all copies a into b and then sorts b and does it does it stuff with b so
|
||||
there's a loop which loops through the array b and it prints out one Jones two Smith three x
|
||||
the same as before but then the second loop prints out the array a which goes a Jones b x and
|
||||
c Smith so in that case the indexes have not been destroyed but they are being messed up in the
|
||||
in the first one example x seven uses the other function a sort i and it creates an an array a
|
||||
where the the indexes are strings third second and first so third is Jones second is x first
|
||||
is Smith then a sort i a that when that's printed out you see that what you've got is an ordering
|
||||
of the indexes but the actual values have been thrown away so you might wonder what an earth
|
||||
I think I might be able to explain that in a moment example eight number four in this group
|
||||
uses a sort i but with the a desk argument so that you you don't destroy the original it's pretty
|
||||
much the same except that it prints out it prints out the result by using one array to index the
|
||||
so it loops it's sorting it's a sort i the very the array a into b and then it loops through b
|
||||
so for i and b and then for each element it prints out the array b indexed by i and then it
|
||||
prints out the array a indexed by b i so the result is first call on Smith second call on x
|
||||
third call on Jones so effectively it sorted them into the correct order by index without messing
|
||||
up any of the data all these the data has been messed up in the array b but b the results in b
|
||||
have been useful in indexing a hope that makes sense and i have to say that when i was having to
|
||||
do sorting of this sort of stuff myself using a just basic sorting algorithms in ork then that
|
||||
was the technique that i used but it's it's a little it's a little odd until you get the get the
|
||||
idea of it and that's why i think they are ork is changing these arrays because it's assuming you
|
||||
they're going to use them as ways of indexing the original data the next example the fifth one
|
||||
in this particular group e x nine uses the same sort of idea the three elements in array indexed
|
||||
by a b and c is using a sort from a into b but it's using the how value the how argument and it's
|
||||
using at v al underscore str underscore desk so descending by value treating them as strings
|
||||
then it's using array b to in the loop and it's going it's printing out the index of b and then
|
||||
the contents of b element a of a was Jones b was x element b was x element c was Smith and when it
|
||||
comes to print these out you see x Smith and Jones listed out in in that order and descending order
|
||||
with the index is one two and three this is useful but i would i would i would offer that the
|
||||
use is moderately limited okay but i've got a section here entitled yet more about arrays but
|
||||
it's really just to say i'm not going to do any more about arrays just now there is more to be said
|
||||
there is a sort of multi-dimensional array capability in ork with friend original ork it had this
|
||||
and it's still available in canoe ork and there's a considerable enhancement in that you can have
|
||||
arrays as array elements too in in canoe ork but i'm not sure that we're going to be covering
|
||||
these topics in this series there's loads of information about this in the canoe ork manual if
|
||||
you want to dig deeper however if you if you feel if we receive any requests to recover this
|
||||
they're in more depth then we'll reconsider doing something about it i must admit i have never
|
||||
used multi-dimensional arrays nor array arrays of arrays which can be arbitrarily deep and
|
||||
there are some quite reasonable facilities for manipulating them and walking through them and
|
||||
but um it's it's not a thing i've ever used in ork if i was to do that i would use a different
|
||||
scripting language i have to say time is marching on so let me go quickly through my real world or
|
||||
example this is about things i do and one of the things i do is to process show notes for
|
||||
hpr which are sent in with episodes many people send in their show notes as plain text which is
|
||||
but we need html for loading into the hpr database so what i do with them i've got a series of
|
||||
scripts in with which i check them pull the the notes out of the file that we get from the form
|
||||
and i edit them to fix any errors turn them into markdown and then generate html using
|
||||
a tool called pandok as part of that process i look at the html that's generated locally
|
||||
i've grabbed this and and i'm working on my local workstation and um i make a copy of the html
|
||||
in a format which is easy to browse and pandok is good at doing this it makes it turns them up down
|
||||
into standalone html which i can view in a browser and and it looks pretty much how
|
||||
look when it's on the hpr site so that's the point at which i can say oops there's a mistake here
|
||||
and go and fix it and move on from there i'd to make the html copy i want for viewing locally
|
||||
pandok has recently changed to the extent that you need to provide further information the further
|
||||
information is a couple of lines of metadata which has to be in a format known as yamol yamol is
|
||||
a sort of simplistic data format which is quite well defined but simple to to produce
|
||||
and human readable and so forth there are alternative ways to do but i'm using the yamol option
|
||||
so the way this should look is there should be two lines of metadata with a three hyphens above
|
||||
three full stops below and the two lines consist of title colon the word title colon
|
||||
lowercase space then the title of the show which has to be enclosed in quotes or should i
|
||||
enclose it in quotes anyway the second one is author colon space then the name of the host and
|
||||
i enclose that in quotes too and that's used to generate headers in the final document this is
|
||||
this is just for my own benefit so i wrote an org script to generate this yamol metadata and i'm
|
||||
embedded that in the bash script that i used to run pandok so i've included this bit of org in the
|
||||
notes here and it consists of 14 lines this is part of another script as i said the first line
|
||||
is org space minus f and then a space minus a hyphen character then the the name of a variable
|
||||
which is then piped into redirected into an output file again defined by a variable the first
|
||||
variable is called dollar raw file the second variable is called dollar tmp1 the temporary file
|
||||
be thrown away afterwards but the end of the line and this is where we're digressing a little bit
|
||||
from org into some of the areas of bash consists of a thing called a heirdoc and heirdoc is the
|
||||
way in which you tell bash there is some data that's to be in added or given to or stored in a
|
||||
file or given to a program and you in order to do this you need to use two less than signs followed
|
||||
by a word the word has to be has to have no no spaces in it i think it can tell you another
|
||||
characters i usually just make it a series of letters this particular one i've called end
|
||||
org all in capitals i put it in quotes and i'll mention this in a moment everything from that line
|
||||
up to a line that only consists of end org starting in column one is data to be chewed up by
|
||||
and because the org command uses minus f which is telling org where the program files to come from
|
||||
the script itself and the argument to minus f is a hyphen that hyphen means get it from
|
||||
standard the standard input channel so it's telling org effectively that what follows is
|
||||
the program it's just a convenient way of including an org script in us in another script
|
||||
immediately after the invocation to to org you can put the whole thing in quotes but if the script
|
||||
itself uses quotes things get really convoluted this particular case includes both single and
|
||||
double quotes so using quotes to enclose it would be a real pain the quotes around the the
|
||||
here doc terminator tell bash not to interpolate any dollar signed variables in the in the data
|
||||
by default it will actually scan this data and if it finds dollar something it will assume it's
|
||||
at the name of a bash variable and it will interpolate it if you put the here doc terminator
|
||||
in single quotes then it won't do that and i've got dollars and stuff in this script the script
|
||||
itself begins with a begin rule and the begin rule simply prints out the three hyphens that we
|
||||
need to start the thing and it ends with on line 13 i put line numbers on this one for ease of
|
||||
reference it ends with an end rule which prints out the three full stops at the end of the metadata
|
||||
then there are two regular expressions in the the main script and these are things which are going
|
||||
to be matched against the input data the first one is a circumflex title column with a capital T
|
||||
and what this is meant to do is to match the string title which is in the the input file
|
||||
where that input file is the one that's come from the hbr server and contains the data that's
|
||||
been fed in by the the host submitting the show and has been turned into into this file so one of
|
||||
the one of the the items on the form is the title of the show so we're looking for the the result
|
||||
of that so the rule itself uses the sub function which we've looked at in the previous show
|
||||
which matches the string circumflex title column circumflex being the start of line as you'll
|
||||
remember and uses backslash s after that because that means a universal white space sequence or
|
||||
single white space I should say so that's a space or a tab I think most of these when they're
|
||||
returned consist of one tab but not quite sure so I just did this to be safe and the sub the second
|
||||
argument to sub is simply an empty string what it's saying is the bit of the the line that's that
|
||||
comes in the one that begins title chopped the bit off that says title in his followed by white
|
||||
space removes it entirely so all that's left is the actual title then the next line line five is a g sub
|
||||
g sub and recall is a means of doing multiple substitutions on a line and here it's looking for
|
||||
single quotes and if it finds any it will replace them by by two single quotes and that's because
|
||||
YAML needs if you've enclosed a string in quotes and you're wise to do so then if you want to embed
|
||||
single quotes within it then they have to be doubled so that's what it's doing ready for YAML
|
||||
then it finally prints line six it prints the string title column in lowercase followed by one
|
||||
space followed by the final result of these bits of editing in single quotes followed by a new line
|
||||
and it's actually printing dollar zero which is the entire line that's been matched by the
|
||||
the regular expression the second regular expression is that the rule began by a regular
|
||||
expression is looking for host name and that's doing the same sort of thing except looking for
|
||||
the name of the the particular host in this in this file that's come back from the form on
|
||||
the hpa website and it's doing pretty much the same thing i wouldn't explain it again because
|
||||
it's pretty much identical when that's finished the result should be that the the four lines of
|
||||
metadata should be in the file whose name is in the variable tmp1 and then a bit later on in the
|
||||
bash script there's a long line which calls paddock to do the necessary thing and as part of its
|
||||
arguments i've printed them all out here in the notes but i don't really think i should explain
|
||||
them because i'm not sure anybody's interested but essentially it's it's given paddock two data files
|
||||
called dollar tmp1 and dollar extract which it's to process and produce some results in a file
|
||||
called full html along the way it's told paddock to include the hpr css which it's grabbed from
|
||||
the website so it means that the the html it's produced looks identical to the sort of html that
|
||||
the hpr website generates itself that took a lot of explanation but it's really not a very
|
||||
complex org script i thought it might be of interest see the sort of thing that what gets used for
|
||||
at least the way i use it and it also shows an example of using a bash here dock which people might
|
||||
not be that up to speed with so that's it that's the end of my show today there are all of the
|
||||
examples i've mentioned during colluded in the show there's an e-pub version of the notes
|
||||
okay then bye bye
|
||||
you've been listening to hecka public radio at hecka public radio dot org
|
||||
we are a community podcast network that releases shows every weekday Monday through Friday
|
||||
today's show like all our shows was contributed by an hpr listener like yourself
|
||||
if you ever thought of recording a podcast then click on our contributing to find out how easy it
|
||||
really is hecka public radio was founded by the digital dov pound and the infonomican computer club
|
||||
and it's part of the binary revolution at binrev.com if you have comments on today's show
|
||||
please email the host directly leave a comment on the website or record a follow-up episode yourself
|
||||
on this otherwise stated today's show is released on the creative comments
|
||||
attribution share a light 3.0 license
|
||||
Reference in New Issue
Block a user