- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
370 lines
34 KiB
Plaintext
370 lines
34 KiB
Plaintext
Episode: 2526
|
|
Title: HPR2526: Gnu Awk - Part 10
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2526/hpr2526.mp3
|
|
Transcribed: 2025-10-19 04:48:27
|
|
|
|
---
|
|
|
|
This is HPR Episode 2,526 entitled Genuorg Part 10 and is part of the series Learningorg.
|
|
It is hosted by Dave Morris and is about 42 minutes long and carries an explicit flag.
|
|
The summary is more about arranging Genuorg.
|
|
This episode of HPR is brought to you by Archive.org.
|
|
Support universal access to all knowledge by heading over to Archive.org forward slash
|
|
Donate.
|
|
Hello everybody. This is Dave Morris. Welcome to Hanker Public Radio. Today I'm going
|
|
to talk about Orc, Genuorg. This is show number 10 in the series, which be easier
|
|
myself have been producing. This episode I want to talk more about use of arrays in Orc
|
|
because arrays are pretty cool and important to understand. I also want to talk about some
|
|
real world examples of the use of Orc. So far we've been looking at trivial examples
|
|
that demonstrate the point, but they're not really related to how you'd really use
|
|
it. So I really wanted to see if I could add a few of these from time to time just so
|
|
you can see how it can be used. So let's just quickly recap the whole array business.
|
|
We know that arrays in Orc are associative, that is, the index used to refer to an element
|
|
is a string. The contents as well can be a string or a number or whatever. It can indeed
|
|
be nothing. They can have empty array elements, which I guess you would regard as containing
|
|
empty strings. Remember that an associative array is also called a hash. You might see the
|
|
same term used in other context. The array index thing you use to index a specific element
|
|
is also referred to as a subscript. I guess that's more mathematical. We know that array
|
|
elements are referred to by the name of the array and then the index expression in square
|
|
brackets. So for example, if you had an array called fruit and there was an element index
|
|
by apple, you would type fruit, or cases, and then in square brackets the string apple.
|
|
Remember a string in Orc is enclosed in double quotes. So that means the element of the array
|
|
fruit, which is indexed by the string apple. Now the index value is actually an expression,
|
|
so it can be of any level of complexity. So I've got a little example in the notes, two variables
|
|
in one and in two. First one is set to the string A, P, B. Second one is set to the string
|
|
L, E. And if you did print fruit square bracket in one, then a space in two, close square brackets.
|
|
The two strings are concatenated. You just have to put string variables next to one another,
|
|
and they get concatenated, they make the string apple. So it would index and print the contents.
|
|
We saw earlier as well in the series that you can check to see whether an array element
|
|
exists by asking whether the index exists. So the expression is index, whatever that happens to
|
|
be, whether it's a constant or a variable in, word in, and then the name of the array. So for
|
|
example you might type if open parentheses, and then in double quotes apple, that's a string,
|
|
constant string, space in, space fruit, close parentheses, and then the thing that depends on
|
|
that, if would be print fruit with an index of apple, perhaps you might say yes it exists or
|
|
whatever. You also remember that looping through the elements of array is achieved with a special
|
|
version of the four statement where you type four space open parentheses in, whatever the index name is,
|
|
in fruit, for example. So it's another case of the in, of the in expression. So in, in fruit,
|
|
close parentheses, and then you could do print, fruit, square bracket, end, where end is, is a
|
|
variable name, which is being given successive values of the indexes. Now the order in which you get
|
|
these is arbitrary. It's, I think it's constant every time you run it for a given awk, but if you run
|
|
awk on one string, then on another, particularly if they're different versions of Linux or Unix and
|
|
stuff, then there's no guarantee that you're going to get the same thing. There's no guarantee full
|
|
stock that you're going to get these things in order. We'll look at this in, in another episode,
|
|
because a canoe awk gives you better capabilities in this area, but that's not a thing for today.
|
|
So quite often you might want to use a number as the index of an array. Remember the subscripts
|
|
are always string in awk. If you, if you typed something, if you typed a number as an index,
|
|
then it will be automatically converted into a string. So if you typed something like the
|
|
array name data, square brackets, 42, close square brackets equals some number. I quite like 8388607,
|
|
which is 2 to the 23 minus 1, used to work on 24 bit machines years ago and somehow that number
|
|
stuck in my head. So the integer number 42, in this case, is converted to the string 42. Surprise,
|
|
surprise. And it just works. You could either type it with quotes around it or not, but it
|
|
all can handle other number bases. So it will handle octal. So if you, as the standard Unix convention,
|
|
which personally I've never fully understood, but we're leading zero to notes that the number is an
|
|
octal number. So if you typed data square brackets 0, 5, 2, then that, it means the same as data
|
|
square brackets 42, because decimal 42 is octal 52 or 5, 2 would be a better way of saying it.
|
|
So that could be confusing. You need to be careful about that, which is why I'm telling you this.
|
|
So you, you don't accidentally trip over things. You can also type hexadecimal numbers. So
|
|
data array data square brackets 0x2a, doesn't matter what case, but I've used uppercase for the A,
|
|
that is the hexadecimal equivalent of 42. So you would get the same element. Now the way in which
|
|
numbers are converted into strings in Ork is important to understand. There is a built-in variable
|
|
called con-fmit, c-o-n-v-f-m-t, all in uppercase, and it defines the conversion for floating point numbers.
|
|
Intages just exist as they are. Behind the scenes, there's a function, which is called sprint f,
|
|
which is used to convert numbers into strings. Sprint f is like print f, surprisingly, but instead of
|
|
printing anything, it converts a number into a string using a formatting specification. You'll
|
|
remember, we saw print f in the episode nine. So by default, this, I'm going to call it con-format,
|
|
is the string. The format is the string percent dot six g. What this means is numbers
|
|
would be printed using the g format, which means if you look at the manual, print a number in either
|
|
scientific notation or in floating point notation, whichever uses fewer characters. So that's fun.
|
|
The number six here aims to format the number in a width of six, well six digits plus the decimal point.
|
|
You can change the setting of this con-format thing in your own script if you wish, so you can,
|
|
you can change the way this behaves. Come on to this in a moment. So knowing that this is what's
|
|
happening, I've got a little example script, which whole script you could type on the command line,
|
|
where there's a begin rule, and in it a variable called x is set to the expression 100 divided by
|
|
three, and then that variable is used to index an array called data. We set data square brackets
|
|
x to a value. I don't know what made me think of custard when I was typing this, but that's what I did.
|
|
Then print, we print x, and then we print the context, contents of data square bracket x. So what we
|
|
get, from dividing our 100 by three, is 33.3333. So that's seven characters along, and we get
|
|
the string custard from that print statement. So what that means is the index up to the data array
|
|
is the string 33.3333. Things get more weird if you change that example, and instead of
|
|
dividing 100 by three, you divide, well, that'd be a thousand million, what is that called?
|
|
So I have a special name, I can't remember, but it's one with nine zeros after it, divide that by
|
|
three. I chose three because it produces a number full of threes. Set data x to some value, and then
|
|
do the same again, print x, and print the contents of the array variable index by x. Well, what we get
|
|
for x is 3.33333 e plus zero eight. So that's scientific notation, which means 3.3333, times 10 to
|
|
the eight. So that's a pretty weird index to an array. It works fine, but it is rather weird.
|
|
And there's nothing wrong with that. But if you're in the habit of fiddling around with this
|
|
format spec, convfmt, then if it's set to one thing when you store something, and it's set
|
|
something else when you retrieve it, you're going to have problems retrieving it or could have.
|
|
So being aware of that is important, I think. So next thing then, what about if the variable
|
|
you're using as a subscript is not initialized? Remember, an uninitialized variable in a walk,
|
|
when you treat it as a number, it's regarded as being zero. But if you're treating it as a string,
|
|
then it's regarded as being the null string, an empty string. So I've made a little example,
|
|
which is downloadable from the show, which I've also listed in the the notes, the long notes
|
|
what this does, it slurps some data from standard in into an array, and then it prints it back out
|
|
again. The actual, the thing that's wrong, I made this an executable script by the way, so it starts
|
|
with hash, bang, user bin or minus f. We've seen that before. I'm just echoing a string into it,
|
|
so it can work on it. But in order, I actually wanted to give it three lines. So I used echo,
|
|
hyphen e, and then a string of capital A, newline, capital B, newline, capital C, and the hyphen
|
|
e format of echo lets you put these escape sequences in there, so I've used backslash in.
|
|
So I'm echoing that into this particular script that's being invoked. Now that's the script
|
|
read stuff. It reads using, it stores, I should say, using a variable L into an array called A.
|
|
I wasn't very original with names, but just saved some typing. So A square bracket L equals
|
|
$0 is the first line of the main rule. Now L will be uninitialized the first time it's used. So
|
|
L is going to be converted to a string, and since it's not being treated as a number,
|
|
it will produce a null string. That's a perfectly valid array index. Then the next line,
|
|
L is L plus plus, so L is incremented. So now it takes a numeric value, and in that context,
|
|
it will be set to 1, because the thing that's computing the addition will treat its
|
|
uninitialized state as 0, and it will add 1 to it. The third line is the print statement,
|
|
which prints out, N-R. I remember N-R is the built-in variable that counts the number of records
|
|
that your org script is reading. Print that out, followed by space, followed by the contents
|
|
of that line. So you'll see an example that I was mentioning earlier that you see lines 1, 2,
|
|
and 3, and the contents are A, B, and C. Now the end rule for this little script consists of
|
|
a print statement, where print numeric subscripts. Then it uses a for loop, where i is set to,
|
|
i is the control variable of the for loop, it's set to L, which will be the maximum number that
|
|
was counted as the main rule was looping. Minus 1, because it would have been added to after
|
|
the last line, was read, and it's counting backwards to 0. So it prints out the value of i, a colon,
|
|
a space, and the contents of the array A with the index i. So when you look at what that produces,
|
|
it produces under the title numeric subscripts 2 contains C, that's element 2 contains C,
|
|
element 1 contains B, element 0 names nothing. We can't see A, A's vanished. I think you've already
|
|
ahead of me, probably you understand why. Then in the end rule there's another bit of code which
|
|
consists of a print statement saying which reports actual subscripts. And that's followed by
|
|
another for loop using the index in array format. So it's for space open parenthesis i in A,
|
|
close parenthesis. Then the thing that depends on that is print i, contents of i, and the colon
|
|
in a space, and then A, square brackets i. So when that's run, this is that's going to walk
|
|
through all of the subscripts of this array. The first thing we see is a line which consists
|
|
simply of a colon, a space, and letter A. So in other words, the first subscript, the first index
|
|
that we get back, is a null string, and that's the element that contains the very value A that was
|
|
read in. Then we get zero colon, followed by nothing. Then we get one colon B, two colon C. Why did we
|
|
have a zero, an element zero? Well, the previous loop, when it was looking for it, the
|
|
browsers looking actually created it, and created it empty. So this, this is, there's a lot of
|
|
potential confusion, which is why I'm going to this explanation. This is also covered in the
|
|
glu walk manual. Now the two lines that consist of A square brackets L equals dollar zero,
|
|
followed by L plus plus. If that had been replaced by A square brackets L plus plus,
|
|
closed square brackets equals dollar zero, we wouldn't have had this problem. That's because
|
|
L is uninitialized, but L plus plus means return the current value of L, and then increment it.
|
|
So it's a post increment, it's called. Because this is being done in an arithmetic context,
|
|
it forces the first value returned to be zero, and then it increments it to one. So you would find
|
|
that the letter capital A would have been stored in element zero, B in one, C in two, as would have
|
|
been intended. As opposed to what we got in the previous example, where things got a little bit
|
|
confusing. Okay, now the couple of things about arrays. First of all, you can delete array elements.
|
|
There's actually a statement, delete, which is followed by the name of the array, and then in
|
|
square brackets, an index. So you're deleting an individual element of an array. You could have
|
|
tidied up the messy example before by simply typing delete space A square bracket zero. It would
|
|
have got rid of the empty element. Should you have wished to. You can also delete an entire array
|
|
with the statement, delete, and then the name of the array. The array doesn't vanish, it just
|
|
becomes empty. So that name still declared defined in your script. So if you then went and
|
|
tried to make a simple variable, an ordinary variable, a scalar variable. So you set, you did,
|
|
you know, your arrays called A, and you typed A equals 42, then your script would fail with an
|
|
error message because A already exists as an empty array. The other thing I thought would be useful
|
|
to know about this stage was the ability to take a string and chop it up and put the bits in an array.
|
|
I haven't got an example of this being done, but we'll, I will come up with one in the future
|
|
episode when I talk a bit more about this, I think. There are actually two functions in ORC,
|
|
which can generate arrays from strings by chopping. The functions are called split and pat split.
|
|
We're going to just look at split in this episode. We'll come back and look at arrays later
|
|
in a further episode. So the general format of the split function is split. Let's name the function,
|
|
open parentheses, then there's a string, comma, and then the name of an array. By the way, the string
|
|
can be a constant string, a thing in quotes, or it can be the name of a variable holding a string.
|
|
The array is the name of an array. Then optionally after that will be a field that's referred to as
|
|
field set, comma field set. So that's the name of, that is a regular expression that's used
|
|
as the thing that is to be used for chopping the string up. Then there's a further optional
|
|
argument, which is denoted as seps, so it's another array, which is to hold the separators that
|
|
are found as split is chopping the array up. The seps bit, the fourth argument, is a
|
|
GNU ORC extension. Other ORC versions will not have it. So when this splitting is done, successive
|
|
pieces are placed in the nominated array in elements 1, 2, and so forth. It's important to be
|
|
aware that the array is emptied before the splitting begin, so you can't use split to add to an array,
|
|
and if you give the name of the array, it's always initialized to empty and then is filled up.
|
|
If you don't provide the separator, then the built-in variable FS is you. We've talked a lot about
|
|
the field separator variable. So it's doing something very similar to what ORC does all by itself
|
|
when reading data from a file or from standard in. If you provide it, it's a regular expression.
|
|
Which FS is as well, and it can be a regular expression constant, or it can be a variable containing
|
|
a regular expression. That's an example coming up in a moment. The way that the separators are stored
|
|
is interesting. If you use a field separator of a single space, then split takes any leading
|
|
spaces and puts them into the separators array element 0, and any trailing white space on the line
|
|
is put into separator array n, where n is the number of elements in the array, which is capturing
|
|
the chopped up bits. When split is run, it returns the number of pieces placed in the array.
|
|
That's quite a powerful thing. It's why I thought it was worth talking about it here.
|
|
So I've provided a simple example, which again you can download it, and it's an executable script.
|
|
What this one does, it's a pretty simple one. You've seen this sort of thing before.
|
|
It will be fed a series of lines, and it will simply put each line into an array.
|
|
So the main rule is something which consists of one statement,
|
|
an array is called the array that we're using is called lines, and we're using nr as the index.
|
|
So line square brackets, nr, capital nr, you remember nr is the record number of the current
|
|
record, and we set that to dollar zero. So the array gets filled up with the lines that are
|
|
presented to the script. Then the n rule consists of a loop within a loop.
|
|
So the first loop is an array scanning loop, which is four space open parenthesis i in lines,
|
|
close parenthesis, and then an open curly brace because the the body of this loop is going to be
|
|
multiline. So we've got the line in, we're going to split it, and the split statement says
|
|
split. Now the thing we're going to split is lines, square brackets i, so the current record,
|
|
the current element that we're looking at in the array, which is one of the lines that came in
|
|
in the first instance. And we're going to store it in an array called fields, which is,
|
|
I've done with no vowels, F-L-D-S, and then that's followed by a regular expression constant,
|
|
which is one of those things enclosed in slashes, and I've got space asterisk comma, space asterisk.
|
|
So what that means is a comma with any number of spaces before it and after it is the separator.
|
|
We have another argument which is comma steps, of course, I've got an array there which are called
|
|
steps with no originality at all, into which the separators are going to be put, and then we close
|
|
the parenthesis. So at that point in the loop we've, we've got the line, we've chopped it up,
|
|
and it's in this array called F-L-D-S. So the next thing is another loop, and this one is four
|
|
open parenthesis, j, in F-L-D-S, closed parenthesis. This one's just got the one line depending on it.
|
|
So that line consists of a print F, and the format for the print F is a string,
|
|
which contains a percent S with a vertical bar either side of it. It's just to demonstrate that
|
|
there are no spaces before or after the thing. Then there's a space, and then in parentheses,
|
|
so in the format thing here, we've got another percent S. So there are two strings to be printed out,
|
|
one with vertical bars either side and one with parenthesis around it. Backslash N for a new line,
|
|
because you need that in print F, comma F-L-D-S square brackets, j, closed square bracket. So that prints
|
|
out the currently indexed element of the F-L-D-S array. Ceps is also printed out. That's the second
|
|
string, and that's also indexed by j. I'm not indexing the zero-th element in this case,
|
|
because well, it's probably going a bit too far. So there's an example here of how what happens
|
|
when it's run. We have an echo minus E, I think E, I should say, and we're doing another ABC thing,
|
|
so using the echo, I think, that actually put new lines in a string. We've got A, B, C,
|
|
no spaces, backslash N, D, comma, lots of spaces, E, lots of spaces, comma F, closed the string,
|
|
and then that's fed into this script. I'm not reading the names of these scripts, because they're
|
|
not really very meaningful. They're just there to, as a handle for you to download, if you're so
|
|
interested. So what you see is, we've just got letters A to F, and they are printed one per line,
|
|
no spaces around them, because we were using spaces as part of the delimiter. The first one's
|
|
separator is a symbol comma, the second is a symbol comma, the third one, because it's just
|
|
before a new line, there's no separator at all. Then for the D, we've got a comma and a bunch of
|
|
spaces for the E, we've got a bunch of spaces and a comma, and then for the F, we have no separator,
|
|
because that's the new line at the end of the string. So hopefully you can see what split has done,
|
|
and the potential uses of split from that. So okay, I'm leaving the array stuff alone for the
|
|
moment then. So I'm going to make this episode a bit longer than some of the previous ones,
|
|
that we have been up, we have been as long as 40 minutes in this series, but I wanted to talk
|
|
about a couple of real world, real world examples using Ork. These aren't specifically about arrays,
|
|
just some uses for Ork and why you should use it yourself. So the first one is scanning a log
|
|
file. I have a script that I wrote, not in Ork, which adds tags and summaries to HBRS episodes
|
|
that don't have any. I mentioned this whole project every month on the community news show,
|
|
probably boring everybody's death of it, but still. The script gets its updates as email messages,
|
|
and as it processes them it keeps a log, and I've put an example of a couple of lines from the
|
|
log in the notes here. Each log line begins with the date, followed by the time, and these are
|
|
dates and times for on the server, which is in California. It's followed by a message in square
|
|
brackets, which is the type of record in the log, the relative important, and then some stuff some
|
|
text relating to the event that is being log. One of these is the show that's been worked on,
|
|
and whether tags and or summaries have been, summary has been added. So what I wanted to do was to
|
|
report on the number of tags and summaries that have been processed in the previous month when I
|
|
do a community news, do the notes for the community news show. So what I want to do is scan the log file
|
|
for the total for the month. Now when I first did this I used a pipeline with GREP and WC and stuff
|
|
in it, but this is a really good case for using Ork. So my solution, which I've listed here, it's
|
|
only 12 lines long, and it's not only an Ork script, it's something you could put into a bash script
|
|
or type on the command line if you were mad enough to do it. I have been, but I'm not anymore,
|
|
but just so it's developing, it's useful to do it. So I wanted to just quickly go through it.
|
|
It consists of three rules. There's a begin rule, which is, there's numbers on this stuff,
|
|
so I can refer to them. So it's lines two to five, and what happens in the begin rule is that
|
|
a variable called RE is filled with an irregular expression. Regular expression starts with
|
|
a circumplex, which is the anchor for the star of the line, which I'm sure you'll remember from
|
|
many discussions of regular expressions. This is then followed by piece of the date,
|
|
and the date is generated dynamically using a built-in Ork function called ststrf times,
|
|
strifetime. Not sure it's meant to be pronounced, but there you go. I won't go into details about
|
|
how it does it. I think we should cover this in a little bit more detail later, this particular
|
|
function, but we're generating the current year as a four-digit thing, a slash, the current
|
|
month is a two-digit thing, and then a slash. So that means we're looking only for records,
|
|
which begin with the year and month that we're interested in. So this is then followed by two dots,
|
|
a space, any number, two dots mean two arbitrary characters, which matched another date,
|
|
part of the date. Then any number of any characters, followed by a space, and then
|
|
one to four numbers and a colon. So the, as I said, the two dots match the date part of the
|
|
the day part of the date. The dot asterisk covers the time, I'm not interested in the time,
|
|
and it also covers the square bracket info or whatever it is. Then we're looking for the line
|
|
after that has to have a number in it, which will be the show number followed by a colon.
|
|
We're not interested in anything else. The regular expression is actually a string concatenation,
|
|
there are three strings there. The other thing in the begin rule is that a variable called
|
|
camp is initialized to zero. That's just me being a good programmer, I guess. I just feel more
|
|
comfortable at initializing it. So the main rule is simply comparing using dollar zero
|
|
tilde re in the variable re, it's comparing each line with the regular expression. And if
|
|
found, if it matches, I should say, then it prints it out, and it prints it out with a number
|
|
proceeding it, which is the current contents of the camp variable, followed by the line itself.
|
|
The count is one of the arguments to print f as it's dollar zero, but count is preceded by
|
|
two pluses, which means pre-increment it. So it can start at zero, so the first time it's printed,
|
|
or it'll produce a one. And the format statement for it, the format specification, is percent zero
|
|
to D. So we'll get a two digit number with leading zeros. Okay, I won't go into any more detail
|
|
the end rule consists of one statement, print the string additions, followed by the contents of
|
|
the variable count, which will counter the number of lines that match. So I show some sample output
|
|
for February 2018, where all of the various additions are listed and the number is 23. That's
|
|
turned out to be I think 25. In the end, but this was this was done at some point during the month.
|
|
As I say, we better to pull that stuff in a bash script, probably the easiest thing. You could
|
|
put it in an orc script as well. Either way we'd do, maybe we'll do that. But the point is it's just
|
|
you, it's just doing some fancy matching in a in a fire, and it works very good at doing this sort of thing.
|
|
Okay, second example is a bit more complex. It's actually, it's a fairly simple script, but
|
|
preamble is necessary. So we're not losing doing that. So what I'm trying to do is to pause
|
|
a tab-delimited file with columns in it. So I currently look after the process of uploading HBR
|
|
episodes to the internet archive. And I use Python library to do this, and the library is called
|
|
Internet Archive, and one of the tools that comes with it is a bundle line tool. It uses the
|
|
library, and it's called IA, which is short for Internet Archive. The IA tool lets me interrogate
|
|
the archive, get data about existing shows, and let's me upload stuff and delete stuff and change
|
|
stuff and whatever. So it's an excellent generic tool for doing all this stuff. And in some cases,
|
|
I've found it necessary to replace the audio formats on the Internet Archive with copies that
|
|
have been generated on the HBR system. And that's because we want to ensure these audio files contain
|
|
metadata, audio tags that you'd see in your podcast or whatever. And these files are created by the
|
|
Internet Archive software by default. So you just need to upload one file, ideally a WAV file,
|
|
which we do, what we have been doing, and the archive.org software, the Internet Archive software,
|
|
does a process that refers to its derivation and creates a whole list of other formats. But in
|
|
doing so, it does not propagate the metadata. Since we're now pointing our podcast feeds at the
|
|
Internet Archive, the fact that they contain no metadata is causing some problems for people.
|
|
So we've been making versions which do contain tags and uploading them. So I need to be able to tell
|
|
which HBR episodes have got the tag list derived audio and which have got audio that we've generated.
|
|
The IA tool can do this, but it does it in a format that's hard to parse. So I wrote an org
|
|
script to help me. What the IA tool produces is tab-delimited lines. First line contains the names
|
|
of all of the columns. So it's like a spreadsheet type of layout. Not very readable for humans,
|
|
but fine for computer. You could put it in a spreadsheet probably, but that would be a pain.
|
|
The way it's formatted, and I don't really understand why, there's that sometimes certain columns
|
|
are missing or they're in different orders. So you can't say, give me column one and column six,
|
|
because column one's fairly constant, but column six varies. So you get different data out depending
|
|
on which show on internet archive you're looking at. So I wrote a script to try and overcome this.
|
|
The script is called parseIAaudio.org, and it's an executable org script. I've listed it in the notes
|
|
here. It's not particularly long, but it's got a fair number of comments in. I need comments in
|
|
my scripts because I come back to them later and I don't know what the hell is this. So I've left
|
|
them in for your edification. So there's a begin rule in this script, and it simply sets FS,
|
|
the field separator, to the tab character, backslash t. And there are two other rules. The first one
|
|
only runs when the first record is encountered. So it's checking for nr, the record number being one.
|
|
Now this is the header line. There is in these notes a snapshot, first three lines of one of these
|
|
files for show 2, 4, 5, 0, and you'll see it's got a bunch of titles separated by column titles
|
|
separated by tabs. So that's the one we're reading. We have a for loop which uses a variable,
|
|
numeric variable i, to iterate from one to the number of fields. Remember NF, capital NF,
|
|
its number of fields variable. And for each instance, it stores in an array called FLD,
|
|
some values. The index to FLD is dollar i. Now you'll remember we have covered this before.
|
|
i will be a number from one to whatever the number of fields is. If you put a dollar in front
|
|
of the variable, it means take that number and treat it, treat the whole thing as if you're referring
|
|
to a field. Remember the fields are called dollar one, dollar two, dollar three. So the first time
|
|
we're we're using dollar one, the next time around the loop dollar two. Each of these fields will
|
|
contain one of the names that we saw. So the first one is actually the word name. So we will be
|
|
storing that. And what we're storing is the value of the counter. So in the notes, I've got a list
|
|
of the sort of stuff that would be stored in this array based upon the example in the notes.
|
|
So FLD indexed by name, set to one, FLD, SHA one, set to two, FLD format,
|
|
set to three and so on. The second rule is invoked if two conditions are met. First condition is
|
|
the record number is greater than one. We already dealt the first line anyway, so that's pretty
|
|
obvious. The second test is a bit more weird. What it says is the field whose number is whatever
|
|
the header name returned, one, field one, it will be in the example. And the name in there,
|
|
the string that you get out of there should end with, this is a regular expression match,
|
|
either I've any of the strings, flack, MP3, og, opus, SPX and WAV. I'm looking only for the audio
|
|
file. There are a whole bunch of other files associated with the show on the internet archive,
|
|
but I'm only interested in the audio file, so it's just looking for those based on the extension.
|
|
If we match both of these conditions, we'll put a couple of fields from that line. What the way
|
|
we do it is we use the FLD array to get the particular fields. So FLD you remember contains
|
|
indexes which are the the titles and that first header line. I'm interested in the field which is
|
|
called name in there and the field called source. I just want them printed out and I want
|
|
them formatted in a particular way and you see there's a print app that does that.
|
|
It prints out two strings in one, first one is 15 characters long and second one we don't care.
|
|
So what is all this stuff here? This dollar parentheses FLD square brackets quote name,
|
|
close square bracket, close parenthesis. What that means is use the values stored in the array FLD
|
|
square brackets name which in the example was one and use that to reference a field. So the
|
|
thing I just mentioned the thing in parentheses would resolve to dollar parentheses one.
|
|
And that's just a way of writing dollar one. So just print out dollar one from the input line.
|
|
We have to put parentheses around this because we're using an array inside the parentheses and
|
|
if we simply stuck the dollar on the front of FLD or could be confused about what was meant. So
|
|
what the script is doing is printing specific columns for certain lines. Selecting particular
|
|
lines is selecting particular columns but we don't know exactly which columns we want. We just know
|
|
what the name is. So we're simply using the field names to get the columns that we want.
|
|
Hopefully that's made sense of that. It's actually a really simple concept but the way it's done is
|
|
I think probably the only way it could be done if you think otherwise and come back with some
|
|
suggestions but it um it deals with it deals with this problem. I should say that most of the queries
|
|
that you send to the internet archive interface returns Jason formatted result. Jason's a whole
|
|
other ballgame. It's not something that or can easily pass probably could but would be such a pain
|
|
to do. For some reason this one returns a tab-delimited file and it varies where all the fields are
|
|
but all was well able to to deal with. So I hope that was useful and helpful and possibly even
|
|
interesting and that's all I'm going to say about this one. That's it really. All right then. Bye bye.
|
|
You've been listening to Hacker Public Radio at Hacker Public Radio. We are a community podcast
|
|
network that releases shows every weekday Monday through Friday. Today's show like all our shows
|
|
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast then
|
|
click on our contributing to find out how easy it really is. Hacker Public Radio was found
|
|
by the digital dog pound and the infonomicon computer club and it's part of the binary revolution
|
|
at binwreff.com. If you have comments on today's show please email the host directly leave a comment
|
|
on the website or record a follow-up episode yourself unless otherwise stated today's show is
|
|
released on the creative comments, attribution, share a live 3.0 license.
|