292 lines
26 KiB
Plaintext
292 lines
26 KiB
Plaintext
|
|
Episode: 2824
|
||
|
|
Title: HPR2824: Gnu Awk - Part 15
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2824/hpr2824.mp3
|
||
|
|
Transcribed: 2025-10-19 17:21:30
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is HPR Episode 2,824 Entitled, Gnuok Part 15, and is part of the series Accessibility.
|
||
|
|
It is hosted by Dave Morris and is about 32 minutes long and carries an exquisite flag.
|
||
|
|
The summary is re-irection of input and output Part 2.
|
||
|
|
This episode of HPR is brought to you by an Honesthost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HPR15.
|
||
|
|
That's HPR15.
|
||
|
|
Better web hosting that's Honest and Fair at An Honesthost.com.
|
||
|
|
Hello everybody, this is Dave Morris for Hacker Public Radio.
|
||
|
|
It's a nice day, I've got the door open, so you might hear background noises from the birds and stuff.
|
||
|
|
Hopefully nobody in the vicinity is going to start up a lawnmower, let's see.
|
||
|
|
So this is Gnuok Part 15, and it's about a series that be easier myself for doing.
|
||
|
|
I'm doing a second of a pair of episodes looking at re-direction in all scripts.
|
||
|
|
This one I'm going to talk primarily about the Getline command, which is used for explicit input
|
||
|
|
as opposed to the usual implicit thought that we've seen up to now, and it can include re-direction.
|
||
|
|
Now the Getline command and its uses is quite a complex subject.
|
||
|
|
This show is going to be a bit longer than usual, but it's no way it's going to cover
|
||
|
|
all of the the ins and outs of this subject, so I've redirected you to the Gnuok users guide
|
||
|
|
for the full details, there's links in the show notes, there's long notes for this particular
|
||
|
|
episode. So let's start off with the reminder of how ORC processes its rules. I think we alluded to
|
||
|
|
this, but we've maybe didn't go into enough detail about this as we've been going through the
|
||
|
|
series. We're looking today at how you can change the default rules, the default methods,
|
||
|
|
but I thought it was worthwhile just to look at the standard approach to this sort of stuff.
|
||
|
|
So when the ORC script reads a line from a file or from standard input, then it scans it and triggers,
|
||
|
|
that causes it to go through all of the rules except for the ones which have a begin and end in
|
||
|
|
front of them. And the rules are the things that make up the script, there's some sort of a test
|
||
|
|
followed by bits of ORC inside curly brackets. If a rule matches, then it's going to be run,
|
||
|
|
and that process will continue until all of the rules have been checked. So it's entirely possible
|
||
|
|
that multiple rules will match and they will all be executed if so in the sequence that they're
|
||
|
|
encountered. It's important to bear in mind that they are actually here in that sequence.
|
||
|
|
So what I've done for this show is to prepare a very very simple data file with three lines in it
|
||
|
|
and a very simple script which runs against it. Just as an aside, I'm using a command called
|
||
|
|
there's a there's a thing called Laura Mipson which is sort of fake Latin that tends to be used to
|
||
|
|
fill out forms or just to use as placeholders in blogs or something like that. And I've mentioned
|
||
|
|
how you can get hold of this if you want to use it. So I've actually noted how I used it on the
|
||
|
|
the command line shell the shell command line that I used print f space and then in double quotes
|
||
|
|
percent s backslash n closed double quotes space then a command substitution dollar open
|
||
|
|
parenthesis lawram space w space three close parentheses. So that one that does is to run this
|
||
|
|
lawram command and ask it to generate three words then there's a redirection a greater than sign
|
||
|
|
to a file called org 15 test data one and I've provided this particular file with the show.
|
||
|
|
So the script which I've shown here and is again downloaded if you want to play with it is
|
||
|
|
standalone org script and it contains three rules none of the rules have any matching things in
|
||
|
|
front of them. So there's no tests that have been carried out. They're just three rules that will
|
||
|
|
be obeyed. The first one simply prints out the string r1 of which rule one followed by three
|
||
|
|
hyphens just as a sort of deliverter. The second rule prints out r2 and then that's followed by the
|
||
|
|
contents of dollar zero. I won't read these out in minute detail because I think you should
|
||
|
|
know how to do this by now. The third rule prints out r3 followed by the contents of dollar zero
|
||
|
|
again. So it's really the same as r2 except that it's got a different rule number and we've got
|
||
|
|
the the data file contains two three nonsense Latin words. I think they're nonsense. Some of them
|
||
|
|
are not actually but anyway it doesn't make a lot of sense. I learned Latin at school but I've
|
||
|
|
erased it all from my head since then. So when you run it but then very exciting it simply prints out
|
||
|
|
r1, three hyphens, r2, first word, volupt, tatibus, r3 prints out the same word. Then r1 again for the
|
||
|
|
second word and living on the edge by trying to pronounce these where at and then third time
|
||
|
|
round three hyphens sunt. So wow gosh. So basically what it's showing is that each rule is run
|
||
|
|
for each line read from the data file. The first rule doesn't do anything at all with the data but
|
||
|
|
it's still going to be triggered because there are no criteria for trigger. It's going to happen
|
||
|
|
whatever's been read in and it's going to happen for every line that comes in. So there's nothing
|
||
|
|
to stop any of these rules from running. So that's how the basic thing works. I think you probably
|
||
|
|
knew this but I thought it was worth. If you'd asked me before I started looking at this writing
|
||
|
|
this particular episode, how does this work? I'd have probably scratched my head a bit. So I just
|
||
|
|
thought it was worth making it entirely clear how it works. The get line command then is a way of
|
||
|
|
changing how orc reads lines. Normally they're they're all being read one of the time from the
|
||
|
|
whatever the data source is and there's all that stuff about matching patterns in invoking rules
|
||
|
|
etc. This is different from the way that other programming languages handle input, though some
|
||
|
|
can be coerced to do stuff similar in a similar way. But the way that orc reads its data and processes
|
||
|
|
it is one of its great strengths I think. Now the get line command can be used to read lines
|
||
|
|
explicitly outside the usual read pattern match action cycle. So this is an example of its
|
||
|
|
use in a simple way. If it's used on its own with no arguments, it just reads in the next line
|
||
|
|
and splits it up into fields in the normal way. If you use the normal input, it affects how the
|
||
|
|
data is read and how rules are executed. So if get line finds a record it returns a one. So
|
||
|
|
there are there are flags that it returns and if it encounters the end of file it returns zero.
|
||
|
|
If there's an error while it's reading it returns a minus one and sets a variable called EWRNO
|
||
|
|
in cover which contains a description of what went wrong. So I've given you another script which
|
||
|
|
is basically the same as the first one. It's called org15 underscore EX2.org and the only
|
||
|
|
difference is that rule two, the same three rules except that rule two also contains a get line.
|
||
|
|
So if we run that script against the same set of data we will already use. Then you get a different
|
||
|
|
output. For the first line you get R1 is triggered so you get the three hyphons. R2 is triggered and you
|
||
|
|
see the first word of the file which is this roll up tattibus. But then the get line is invoked
|
||
|
|
and that goes and gets the second line out of the file and R3 is then triggered because it's
|
||
|
|
the next one in sequence and it simply prints out that line. So the get line has caused the normal
|
||
|
|
sequence of reading to to change. Then the next iteration R1 three hyphons R2 contains the last
|
||
|
|
line of the file. Sunt and the get line will not get anything back. So $0 which is printed by R3
|
||
|
|
will not be different as it was in the previous iteration. So simply the script simply prints out
|
||
|
|
the same line again. Hopefully that helps to clarify the effects of get line and against the normal
|
||
|
|
way that org works. So I've written a slightly more usable or useful or perhaps it's not all
|
||
|
|
useful but a script anyway which demonstrates a thing that might be more useful. Though it needs
|
||
|
|
work to make it generic. What we've got here is a file of text, another one of these files of
|
||
|
|
lore and text where I've simply written out a number of lines and I've then split the lines and
|
||
|
|
put a hyphen on the first one at the end of the first one. So there are actually six lines in
|
||
|
|
the file and they're in pairs. The first one of which has got a hyphen as the last character.
|
||
|
|
What this is meant to signify is that it's a continued line and you want the script to stick
|
||
|
|
together. The script detects that a line finishes with the hyphen and then it concatenates them
|
||
|
|
and you can see running it what it's produced. So the general rule I won't go into detail of what's
|
||
|
|
in here but in general if the last field of a line is a hyphen then that hyphen is deleted
|
||
|
|
and the line is saved in a variable called line then the get line a get line call then refills
|
||
|
|
dollar zero and then that is printed preceded by the saved line. That's how you join two lines
|
||
|
|
together. If there was a line without a hyphen on the end which is entirely possible then it
|
||
|
|
would just be printed. It didn't actually put that in this example. I should have done this but
|
||
|
|
I'll let you play with that. Like I said this is very simplistic script. It doesn't cater for
|
||
|
|
errors in the way in which it's laid out and if you put hyphen on as the last element but you
|
||
|
|
not left a space in front of it and it's concatenated to the previous word then this algorithm will
|
||
|
|
spot it and it really you should should be doing that if you were trying to make it into something
|
||
|
|
actually useful. There's quite a sophisticated example in the Canoe Walk users guide and I've
|
||
|
|
given a link to it section 4.10.1 where something vaguely similar is being done in a more elegant
|
||
|
|
and resilient way. So get line can be followed by the name of a variable and in which case the record
|
||
|
|
is read from the main input stream into that variable. Now the record is not split into fields
|
||
|
|
under these circumstances and variables like NF the last field is not changed because the field
|
||
|
|
splitting process has not been invoked. However since the main input stream is being read things
|
||
|
|
like NR, the variable NR which is number of records will be changed because these are being counted
|
||
|
|
by Orc. I haven't gone into great detail about the side effects of this. You can find more about
|
||
|
|
it in the manual. There's also a possibility of reading from a file not too dissimilar from the
|
||
|
|
way print and print F work as we saw in the last episode. You would write get line then a less
|
||
|
|
than sign and the name of a file. The name of the file has to be a string expression or a variable
|
||
|
|
and the expression representing the file can also be used to close that file. So there's a little
|
||
|
|
snippet here which sets a variable input to some other variable path a slash in double quotes
|
||
|
|
and a variable file name. So the assumption is that path and file name are two bits of get you to
|
||
|
|
a particular file and then you put slash between them you're on a unique system. Then get line
|
||
|
|
less than sign input will open that file and read from it and then once that's happened you can type
|
||
|
|
close an in parentheses input and it will close that very far and using variables for this is
|
||
|
|
extremely wise because otherwise you'd have to rely on your ability to exactly type the same string
|
||
|
|
twice or about the noises off. Okay so you can also of course read from a file into a variable.
|
||
|
|
So there's reading one line at a time as we said so you can read from that file into into a variable.
|
||
|
|
I've got an example which is org15 underscore ex4.org which it actually consists of a script that
|
||
|
|
reads from fruit names, the file fruit names that we created in the previous episode. These two
|
||
|
|
episodes were actually one originally so they sort of refer to one another a bit but so what it
|
||
|
|
actually does is it's all done in the begin rule. No other rules in this script and what it's doing
|
||
|
|
is just just reading in the file and printing it out. I did add a few another fiddly bit into it
|
||
|
|
so when you if you look at it it's looking at a variable called argc all in capital say argc so
|
||
|
|
we need that to be two because it actually includes the the name of the script as the first element
|
||
|
|
and so we need that when the script is invoked we need it to have an argument referring to the file
|
||
|
|
you want to to process. So it checks to see if it is two and if it's not it prints out needs a
|
||
|
|
file name argument and it sent it to std e2 standard error output and exits. I just put that in
|
||
|
|
because I thought it would be useful to show how you can you can do that type of thing. Then the
|
||
|
|
actual data file is picked up from the array argv in capital square brackets one so that's that
|
||
|
|
first element. Did I say yeah it needs to be there's two elements in it there needs to be two
|
||
|
|
elements in it but they're addressed as zero and one. I think I didn't make that clear enough.
|
||
|
|
So we have a while loop and in the while loop we have in parentheses get line line less than data
|
||
|
|
so data's got the name of the file so it's going to be reading from that file and after the
|
||
|
|
parenthesis get line with its various arguments we have a greater than zero so we're looking to see
|
||
|
|
if the answer if the value that comes back from get line is one or zero because when it's zero
|
||
|
|
there's no more data history is the end of the file and the loop just has one command that it is
|
||
|
|
invokes which is print line so it's get lines read into a variable called line and it's simply
|
||
|
|
printed out and then after that while loop there's a close command which in parentheses uses the
|
||
|
|
variable data so it closes that file so very very very trivial it simply reads the file and prints
|
||
|
|
because as a seasoned orc user you will be aware that you could simply have written this as on
|
||
|
|
the command line orc single quotes open curly brackets print close curly brackets close single
|
||
|
|
quote space fruit names and it would have done exactly the same thing about anywhere near as much
|
||
|
|
fuss but this was for demonstration per next the key um get line facility gets a bit more
|
||
|
|
sophisticated and you can read from a pipe in a walk now the way you do this is to provide a
|
||
|
|
command a vertical bar and get line or command vertical bar get line and then the name of a variable
|
||
|
|
read what what happens is that the get line the orcs runs the command as a subprocess and it gets
|
||
|
|
lines from that command and either does usual splitting field splitting or it stores it in a variable
|
||
|
|
so org15ex5.org is a simple orcscript which runs as its command which is being stored it's all
|
||
|
|
it's all within begin rule the command sort called cmd is wget command so you need to have wget
|
||
|
|
installed on your urlinux system or indeed a bsd system if you wish wget space minus or hyphen
|
||
|
|
log case q then url which is the hack of a radio stats page or read out it's here in the notes
|
||
|
|
then hyphen capital O that means output to then that's followed by a file name which is simply
|
||
|
|
hyphen in which in which case it means to output it to this it's standard out channel well that's
|
||
|
|
all in double quotes so it's a string for org so then there's a while loop which does a similar
|
||
|
|
thing it inside the parentheses of the the test that's done every time the loop runs each iteration
|
||
|
|
it's got cmd in parentheses vertical bar get line close parentheses and then we compare the output
|
||
|
|
from that to zero we want it to be greater than zero because once the output ends then get
|
||
|
|
one more return to zero which means stop so inside the loop which has got a body with curly
|
||
|
|
brackets in closing it because it's a bit more complicated than the previous while loop we used
|
||
|
|
we've got an if statement where it's testing to see if dollar zero and then a tilde meaning
|
||
|
|
compare this with regular expression and the regular expression is carrot that up our old thing
|
||
|
|
shows in q colon so we're looking for a line that begins with shows in q close parentheses there then
|
||
|
|
if that matches then we want to print f q shows on hpr percent d we isn't print f did i say that
|
||
|
|
percent d backslash in and we want to print out field number four once the loop has completed then
|
||
|
|
we close the pipe which we do by giving close the command that we set up earlier in variable cmd
|
||
|
|
so the statistics is a number of lines stats you get from each there's a number of lines which
|
||
|
|
contain various attributes of current state of hpr one of them is the number of
|
||
|
|
shows in the q and what this does is it it picks out just that particular piece of text
|
||
|
|
so when you run it and i just run it in real time and it comes back and says q shows on hpr
|
||
|
|
colon 27 because there's 27 in the q just at this precise moment which is the 23rd of april
|
||
|
|
so i did another example which is essentially the same but uses a slightly different approach
|
||
|
|
and this is a 15e x6 but we're using get line var named variable to store the stuff so it's the
|
||
|
|
same command is the same there's a while loop what while loop does is simply gets lines from the
|
||
|
|
the server and it just doesn't do anything at all with them it simply gets them one at a time
|
||
|
|
until they've all been collected and then the connection is shut down but what that means is
|
||
|
|
that the last line that came back can is still still stored in the variable line so we use split
|
||
|
|
to chop that up into an array called fields using a comma as the delimiter then we can print out
|
||
|
|
q shows on hpr colon space percent d backslash in as the format spec for printf comma fields square
|
||
|
|
bracket 10 the 10th item 10th element of this last line which is a comma separated line contains
|
||
|
|
a number of shows that are in the q so you get back the same answer 27 just to demonstrate that's
|
||
|
|
a different way of doing so the last thing i want to say about get line is that orc provides or
|
||
|
|
this is canoe orc some of the other orc variants don't offer this but there's the capability of
|
||
|
|
accessing a co-process and a co-process is a sub-process but it can be written to
|
||
|
|
and read from so in the context of the print and print f commands we can send data to the
|
||
|
|
process the co-processes with the sequence vertical bar ampersand as an operator not just a plain pipe
|
||
|
|
but with an ampersand after it and i already mentioned this in the last show number 14
|
||
|
|
and not too surprisingly you can use get line to read this data back using the same operator
|
||
|
|
it's you can bring it back against fields or you can put it in a variable so i'm not going to go
|
||
|
|
into a lot of depth this is quite advanced and there's a lot of it a lot of information about it
|
||
|
|
in the canoe orc uses guide there's a get line and go and co-processes section and there's a whole
|
||
|
|
subject of two-way IO you can write some quite sophisticated stuff using this so i've written a
|
||
|
|
simple thing which i've called org15 underscore ex7 and it demonstrates a thing that you could do
|
||
|
|
with this feature now in this particular example i've got an sq like database which i haven't
|
||
|
|
provided for download this is a copy of one that i used to keep track of the hpr episodes on
|
||
|
|
the internet archive this is going to be added to the next database design but won't
|
||
|
|
sustain alone database and for the purposes of this example it's called orgtest.db now the way
|
||
|
|
that you talk to the database is by sending it commands in structured query language i have
|
||
|
|
mentioned this in other shows you might be aware of it but the essence of what what what we're
|
||
|
|
going to do here is to send it to command which consists of select which is the sort of from
|
||
|
|
the verb used in sql or structured query language which lets you get data out of a database
|
||
|
|
select space then id comma title these two fields of the database that i have defined ideas the
|
||
|
|
show number title is the show title from is the next part of the sequence and episodes is the
|
||
|
|
name of the table then follow that with where id equals and then some placeholder semicolon
|
||
|
|
we don't actually type the placeholder in this particular case but what we're going to do is
|
||
|
|
we're going to use a print f to generate it so whatever goes in that placeholder you'll get back
|
||
|
|
the answer in the form the show number and the title for a given hpr show so what we have in this
|
||
|
|
script is we have two rules i've got a begin rule where we're declaring things and we're declaring
|
||
|
|
db a variable called db which is being set to orc test dot db the name of the file
|
||
|
|
telling the little database with a command the command is sqlite 3 that's the the command
|
||
|
|
which you use on the command line which must be followed by the name of a database which you can
|
||
|
|
then either use interactively or you can feed it commands through that that route and then the
|
||
|
|
third variable is called query tpl i tend to use tpl to mean template and it in it it's a it's a
|
||
|
|
string it's actually a template for print f or format template and that select id title from
|
||
|
|
episode to id equals thing i mentioned before is is in it and the placeholder is percent d and
|
||
|
|
a semicolon backslash in so that's the begin rule and it set these variables up then what we want
|
||
|
|
to do is to read the script wants to read numbers and these numbers will be show numbers that
|
||
|
|
it's to interrogate the database for so the test that we're using for this rule is that dollar zero
|
||
|
|
the entire line matches a regular expression which consists of the the digits naught to nine one or
|
||
|
|
more times with nothing else on the line starts on the line and it it ends the the line ends after
|
||
|
|
the last digit could have been more sophisticated then a light spaces around it but I didn't think
|
||
|
|
it was worth the trouble for this demo so this particular rule then uses print f with the format
|
||
|
|
that we already declared called query tpl and we feed it dollar zero as the variable that's
|
||
|
|
going to be fed into that command that's sql command we send that to the variable cmd which is
|
||
|
|
running as a co-process and we do it through a vertical bar and ampersen so what that will do
|
||
|
|
the first time it's invoked is it will cause the co-pressors to start up and it will feed
|
||
|
|
the co-process will be running sql light on the database expecting individual commands to come in
|
||
|
|
and the first command it will get will be generated by this print f then the next line is using
|
||
|
|
the command on the left side and a vertical bar and an ampersen with get line following it
|
||
|
|
and get lines followed by the name of a variable which is result so command vertical bar ampersen
|
||
|
|
get line space result so what that will be doing is it will be talking to the co-process and we'll
|
||
|
|
be pulling back anything that is produced by that query onto the database as the variable result
|
||
|
|
and the last line is print space result who prints its content so when I've actually done there's
|
||
|
|
many ways that this could be run the simplest one for the demonstration purposes be to feed it
|
||
|
|
some numbers in a file which is what I did I called it what 15ex5 data but I haven't included it
|
||
|
|
in the show because it's no point it's just a line with just a file with three lines in it and
|
||
|
|
I've included the lines the numbers per one per line 27612789 and 2773 so when you run it with this
|
||
|
|
data file it just simply returns 2761 HPE Archimension use of February 2019 2789 pacing in storytelling 2773
|
||
|
|
lead acid battery maintenance and calcium charge volt that's that's all that I mean it looks pretty
|
||
|
|
simple the the the process the co-process will just keep running until it till the orkscript runs
|
||
|
|
out of data when the orkscript runs out of data it will simply exit when it exits the co-process
|
||
|
|
will be killed off by ork you could if you wish to do an explicit close on that co-processor and
|
||
|
|
that would that would make it go away I didn't do that here because it didn't seem to be entirely
|
||
|
|
necessary to do but so you get some sort of idea of how you could be running a co-process
|
||
|
|
which is just sitting there waiting for stuff to be thrown at it and coming back with answers
|
||
|
|
and you can write a script which will converse with it okay that's all I'm going to say then
|
||
|
|
about get line this particular show I'm going to finish off with a finale which is pretty much
|
||
|
|
an announcement now there's a lot more that could be said about this redirection subject input
|
||
|
|
and output as well as about co-processes as we said and there's many more subjects within
|
||
|
|
GNU more that could be examined but we feel that now's the time to bring this series to an end
|
||
|
|
be easy and I feel that the areas of ork a GNU ork that we've not covered in this series might be
|
||
|
|
left that's left for you to investigate further if you have the need we both feel that ork is a
|
||
|
|
very useful tool in in many respects but doesn't stand comparison with more advanced scripting languages
|
||
|
|
such as Python, Ruby and Pearl. Pearl in particular borrowed many ideas from ork and has extended
|
||
|
|
them considerably over the years and Ruby was designed with Pearl in mind and although it's
|
||
|
|
probably done some of the things as a language better than Pearl and Python which came out the
|
||
|
|
subject from a different angle has innovated enormously and is in extremely widely used language
|
||
|
|
so there are others which I won't go into but just to give you a flavor of the fact that there's
|
||
|
|
many other languages which are good for text processing other than all so although GNU
|
||
|
|
wants advanced considerably since it was created I think it shows its age quite a lot and its
|
||
|
|
usefulness is a bit limited now there are cases where quite complex scripts might be written in all
|
||
|
|
but the way most people tend to use it as part of a pipeline or inside shell scripts are various
|
||
|
|
sorts where you might write a complex script in Pearl Python or Ruby for example taking on a large
|
||
|
|
project solely in ork seems like a pretty bad choice today so before we wind up this series it's
|
||
|
|
planned to produce one more episode number 16 and in it Beasy and I will record a show together
|
||
|
|
exactly how I'm not sure I'm more perhaps but something more sophisticated perhaps at the time of
|
||
|
|
writing at the time of recording there's no time scale though we don't want to let it sit for
|
||
|
|
too long but we'll endeavor to do this as soon as our schedules allow and we really wanted to
|
||
|
|
review what has got us here and give a bit more information but why we feel it's not worth
|
||
|
|
carrying on any further with the with the series and just sort of give you our two different
|
||
|
|
views on what we've been doing over these years now we've been doing this for a couple of years
|
||
|
|
a bit more not sure gonna have the dates to hand but anyway that's that's the the plan so I hope
|
||
|
|
you've enjoyed the series as a whole and have found it useful okay that's it bye bye
|
||
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community
|
||
|
|
podcast network that releases shows every weekday Monday through Friday today's show like all our
|
||
|
|
shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast
|
||
|
|
and click on our contributing to find out how easy it really is hecka public radio was found
|
||
|
|
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
|
||
|
|
at binrev.com if you have comments on today's show please email the host directly leave a comment
|
||
|
|
on the website or record a follow-up episode yourself unless otherwise stated today's show is
|
||
|
|
released on the creative comments attribution share a like three dot org license
|