Files
hpr-knowledge-base/hpr_transcripts/hpr2544.txt

290 lines
26 KiB
Plaintext
Raw Normal View History

Episode: 2544
Title: HPR2544: How I prepared episode 2493: YouTube Subscriptions - update
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2544/hpr2544.mp3
Transcribed: 2025-10-19 05:13:39
---
This is HBR episode 2,544 entitled How I Prepare an Episode 2,493 YouTube Subscription Update.
It is posted by May Morris and is about 33 minutes long and can remain an explicit flag.
The summary is in show 2,493 I listed some of my whitey subscriptions here now.
This episode of HBR is brought to you by an honest host.com.
At 15% discount on all shared hosting with the offer code HBR15, that's HBR15.
Better web hosting that's honest and fair at an honesthost.com.
Hello everybody, it's Dave Morris. Got a weird show for you this time.
I hesitated over whether I should do this to be honest but I thought it might be of interest to somebody.
So what it is is that in show 2,493 I listed some of the new YouTube channels I've added to my
subscription list and because I'm very very lazy and I'm always looking for shortcuts,
I used programming techniques to data manipulation techniques to prepare the notes.
So I cut and pasted stuff from the YouTube pages for some of the text.
I couldn't find a simple way to automate that but the basic list of YouTube channels was generated programmatically.
So I thought it was worth making a show about how I did this.
So I hope that's enough information to you.
So if you this is going to be something incredibly boring to you, you can switch off now.
But I hope you might find it quite interesting because there's a bunch of different techniques
that are being used here to achieve what I wanted to achieve.
And it's part of the process of data manipulation which is what I've done for pretty much all of my
working life. Somebody gives you data in a weird, weird format and you have to turn in something
we're usable. And you've heard lots of other people talk about this. Josh from my honest host was
talking about the whole business of being lazy and coming up with scripted ways of achieving
things of this nature on the New Year's Eve show. So in order to do what I wanted to do,
I needed the YouTube subscription list, my YouTube subscription list.
And I'll explain how I got that in a moment. I needed the XML style it tool. This is
something that Ken has mentioned on many occasions. He uses it regularly, I think.
I've not really got very deeply into it until I did this project. It's not very deep yet.
Well, it certainly learnt some stuff about it. Then the third component was the, there's a package
called the Template Toolkit which I'll enlarge upon in a moment. And I use that to generate the
markdown that I use for my show notes. And then I use Pandoc to convert markdown into HTML.
I won't talk about Pandoc in this episode, but I'll talk about the three other steps.
So first off, if you are a YouTube user and you want to get your subscription list,
then one technique, maybe there are other techniques that you could scrape the page and stuff,
but you, I discovered that you, there's a thing called the subscription manager,
which should be available to you as a YouTube user. And I've given the link to it and so forth
in the notes. And you select the Managed Subscriptions tab. And at the bottom of the page is an export
option, which when you click it, generates OPML. And this is by default written to a file called
subscription manager on your, whatever you're, you're using at that time. So what's OPML?
I certainly mentioned it before, but I've never gone into much detail. A little plan to do a
lot of detail now. It's an XML data format and it's designed to be used by some sort of application,
like a pod catcher or something that uses RSS feeds. You can also use it if you're,
you're dealing with videos, if you're using some sort of offline video viewer or something.
I thought it would be a convenient format to parse in order to get the, the basic channel
information that I wanted. So the list of channels and stuff. And as I say in the notes,
it's possible to do this by scraping the YouTube website, but you'd need to write something very
sophisticated in my terms, sophisticated anyway. If you have done this type of thing and you know
of a better way to achieve this, then let us know. Send in a comment or do a show about it, perhaps.
Given that I've got the subscription manager file, I used the XML style tool to parse it.
It's a command line tool and I run Debian testing and I was able to install it from the
repository with a simple apt-get. There are other tools that can be used to do this, but XML style,
it is a very powerful and quick Swiss Army knife type tool for doing analysis and
parsing of XML. Ken has mentioned that he one time was going to do a show about this or even more.
More than one show because it's quite complex. So I hope you'll do that at some point.
It certainly deserves some description on HPR I would have thought. It's even worth a short
series. I'm just going to mention how I use it to generate a simple comma-separated variable file
from the OPML. The first thing I did was rename this file called subscription manager to the name
yt underscore subs dot OPML just so I knew what an earth it was in the future. Then I discovered
how to use XML style it to do an analysis of a bit of XML. XML is a sort of hierarchical tree
structure of what I guess you could call objects entities or something of that sort and the
command I used was XML style it and that this is followed by a sort of sub-command EL
letters E and L in lowercase and then space hyphen U and then the name of the file yt underscore
subs dot OPML. What that does is it simply shows you that within the tree structure of the XML
there's a top-level OPML it's a bit like a directory structure then slash body then slash outline
then slash outline again there's a fairly simple structure. You can work out the structure of
XML by using various tools which will print it out in a well-formatted way. One of them is called
XML lint which is part of the XML2 utils package on devian anyway which it also requires
lib XML2 but if you're interested in that I do actually use XML lint from time to time. I should
probably use XML style because I think it can do a similar job but I've always been using
XML lint for many years. The problem is that XML the layout of it is not usually designed for human
readability so it's all often it's squashed together all in one line or on many many long lines
so an XML lint can reformat it and I demonstrated briefly how you could do this just showing the first
seven lines of what was in my file but I'm not going to talk about that anymore. Now within
the XML. XML consists of objects if you like or tags if you prefer because it's a kin to a
HTML and the tags are enclosed in less than and greater than signs and you'll see in the XML lint
output that there's an instance where it just contains body in the symbols of word body but there's
other cases where you might want to modify the particular object that you're defining and you
can put further sequences of name equals and then a quoted string and that type of thing within
it and there are lots of these there are many of these instances in the opml format so you can
ask XML style it to report back the structure including these things which are called attributes
and you can you can see that now you'll see all of them if you use XML style it to do this so I've
just run it with a head command on the end just to show the first 11 lines I chose 11 because after
trial and error it showed a single sample of what's in there so the command would be XML style it
space EL is subcommand space hyphen A space and then the name of the file yt subs. opml
piped into head minus 11 so what that shows is that the opml tag can contain the attribute
version it shows it as opml slash at version the version is an attribute and it's used in this
particular file it's just the first line of the opml definition which says that it is a version
1.1 opml file you don't really need to know more than that but there's other things the deepest
branch of the tree or the furthest branch of the tree contains a tag which contains the
the attribute text title type and XML URL so with that in mind it tells you what type of layout
the XML contain and you can then write a much more complex style it XML style it command
which will pull all of the relevant information out so I've demonstrated this with an XML
style it command which took a little bit of trial and error to work out reading of the documentation
etc and it's just one long line it's a piped line with a bunch of commands in it and in order to
show it in these notes I've split it up into into separate lines where each one is ends with a
backslash so this would be the actual contents of a file that you could or indeed you could type this
in on the command line it shows it being typed on the on the command line so you put a put a backslash
on the end of the line that means that the command's not finished and it's to continue it's not
the only way to do it but I thought this would be a way of showing what was going on so it gets quite
complicated in terms of what I'm doing here but let's see if I can break it down into into some
reasonable pieces that are understandable what we have here is a pipeline and the first element of
the pipeline is a bracketed list of commands so it's an open parenthesis and then some stuff and
then a closed parenthesis and everything that comes out of that parenthesis list is piped into
in this particular case head, hyphen 5 so it's just to demonstrate it and it just shows the first
five lines that are output by this this pipeline so going into the parenthesis first command we see
in there is an echo and echo simply is a string which consists of the words title comma feed comma
scene comma skip so they're all in single quotes and then a semicolon what that does is it causes
that particular string title we'd seen a skip to be output by the pipeline and because
effectively got here is a bunch of commands the brackets that which the parentheses here are a
bashism it's also available in other shells which causes all of the commands within the parentheses
to be executed and the output to be written as a stream from them all so this is just the first line
that's to be written out and it's um we're making a comma separated variable file and the
requirement is that the first line be the titles of the columns within the within the file so you
could use this in a spreadsheet for example where you use these as titles in your in your spreadsheet
so after the semicolon we then go into XML style itself and there's a subcommand that starts
this off which is cell SEL that means to select data or to query an XML document that's what it
says in the manual page so we're asking XML style it to to do some specific query of the contents
then the next thing we see is hyphen t that defines that there's to be a template used hyphen m
defines a thing called an x-path expression which is the part of the template now an x-path
expression is conceptually similar to a path within the file system so the part the x-path is
actually slash opiml slash body slash outline slash outline we already saw that when analyzing
the contents of this file it's just saying that the deepest node within the within this tree
structure is the thing I just said so we actually want to pull data out of there we don't care
about intermediate data just we want this specific path as if we were looking in a in a file system
path to find specific files at that level in this case we're going to be getting attributes so
that's then followed by a hyphen s hyphen s option is a sort specification I won't go into
details as to what this means but just briefly it's it's asking for the a capital a colon t colon
hyphen space is the type of sort to do and the thing to sort with sort by I suppose you'd say
is the title attribute so at title is in there that's how we're going to sort the output then the
next thing is the specification what is to be reported so that is a sub expression which begins
with the option hyphen v and there's a string containing an expression which will pull
particular pieces of data out of the XML and what it says is concat so it's it wants to concatenate
a bunch of things together and in parentheses it then says at title so we want to know the title
that's the title of each channel in the in the YouTube output then comma then a string containing
a comma a comma then at XML URL comma then a string containing comma 0 comma 0 close string close
parentheses and close the whole enclosing double quotes what's that saying is just pull the title
out the XMLL field out and then put them together in a cover separated variable with a couple of
zeros on the end then we have hyphen n which just simply specifies the name of the file that's
to be processed so everything that's everything within the parentheses and what's happening there
is XML style it is being told how to go and process this file and what to output and it's just
going to output these two fields with a couple of extra zeros on the end the output of these parentheses
this parenthesis list pipeline I guess is to be written somewhere in reality I wrote it to a file
called whitey underscore data dot CSV and I used a greater than sign to pipe them but in this
particular case I'm just showing you what it looks like by demonstrating the first five lines
that come out of it so this is fairly advanced bash ism which I sort of think I will get into
at some stage in my bash series I think but this is a case of actually using it to do some data
data manipulation so we're going to have a four column comma separated variable file and it's got
the remained that the last two columns are all going to be zeros in the file that's generated
but that's to allow me to fiddle around with it and change these values to control things to
do with the file the column marked with C and that's the third column is for marking the channels
which I have already talked about in an earlier episode about YouTube subscriptions that was
2,202 I didn't want to talk about them again so I wanted to mark them as ignore effectively the
skip column is for channels that I just didn't want to include because I didn't think they were
relevant to that to that particular thing I've got a lot more channels than I've talked about so
far that was very long-winded way of explaining a thing that pulls data out of this opml file so the
next thing I wanted to do was to generate HTML for the hbr show notes to do that I used this tool
called template tool kit it's a templating system not too surprisingly and there are many
templating systems for different programming languages and applications this particular one I've
been using for over 15 years I think and I use it a lot when I was working really find it very
usable and has tons of features I actually use it on a regular basis when generating show notes
for hbr shows that I do and I also use it in some of the scripts the admin scripts that I've
written to do work for hbr a template tool kit is pearl application so you need to have pearl
installed on your machine but just about everything does these days including raspberry pies so
it's pretty much a matter of course that you get it you need to have a version of pearl later
than 5.6.0 and my devintesting box has 5.26.1 so 560 is pretty old and the tool kit can be
installed in the normal pearl way using the comprehensive pearl archive network cpan but if you
you do need to do some preliminary work to set that up so if you if you don't want to do that
then there's a method of doing of installing it which is defined on the template toolkit site
and I've copied the instructions into the notes basically you need to grab a tar file and you
need to untie it you cd into it and then you you make it you use the pearl to to run the first
station and use the make command to to build it and you can use sudo to to install it across your
system template toolkit is currently version 2.26 but if you look at the main template toolkit site
whatever that happens to be the instructions whatever versions this is currently instructions will
reflect that so template toolkit is a big subject and I'm not going to go into detail here I have
penciled in possibility of doing an episode or two on it in the future and if you it sounds
interesting do you let me know if you want me to do it principle is that you prepare a template
and in the template are directives which conform to a syntax specific to template toolkit tt is
usually referred to it the template is usually called out of a script written in per or indeed
python there's a python version of template toolkit and then the template is given data from
the from the script or it can obtain data itself and we're going to use that in this particular
process and then it does things to the data and and and formats it template toolkit directives
are enclosed in square bracket percent sequences so open square bracket percent and then
a directive then percent closed square bracket separates it from the data so you'd put that
into to represent a piece of data that was to be inserted or to provide directives such as loops
and variables and control statements and so on and so forth so it's a sort of mini language
all of its own. Now template toolkit can access CSV data and there's a plugin to it there's
it has a plugin system so you can enhance the the basic toolkit there's one called template
colon colon plugin colon colon colon data file and it just comes a standard with template toolkit
and it allows you to open an arbitrary data file by default I think the data is expected to be
infield separated by colon but you can also tell it to separate by commas and that's what I did here
and I could have written the thing with colon rather than commas but I've told it in my particular
case to use commas throughout just because so I felt like I guess so there's an example of how you
would in your template define the connection to your data and it consists of in these square bracket
percent sequences the word use in uppercase then some name equals then data file is a is a
function and then the first argument to it is the path to the file which I've just written
as file path here then if you want to change the delimiter to something else you put delim
equals and then a string containing the single character delimiter so I've defined in general
terms the thing here which points out the at a file separated with the with the fields separated
by commas the thing called refer to as name in this example is is actually a data structure
which is collected by template toolkit and made available within the template it's actually a
list of hashes a hashes an associative array and a list is a non-associated to the array so it's
an array of arrays if you like but you probably don't need to know that in huge amount of detail
because I'll be hopefully be explaining to you in a moment in the example of how I've used it
so I've got a the actual template that I used to do this sort of stuff and it's got the got a
used directive in it where I created a name yt list YouTube list and then set that to the
output from data file function where I pointed a file called yt underscore data dot CSV the one
I mentioned earlier that was created by XML style it delimiter is comma then in my template the
next line just consists of hyphen space YouTube channels colon that's piece of text that's to be
output by the template so I want to have a I want that to be output and that's a piece of markdown
syntax it's the the way you specify a list element and the next directive is a four each it's a
loop and it's a four each and then a variable name in and then some data structure so I've got
four each chan in yt list so yt list is a list of of this data structure I mentioned so it's
a list of channels basically and each channel contains bits of data about the the channel so I'm
setting a variable chan to point out then the next statement is next statements any xt is the
the verb in the command language which means skip to the next iteration in the loop and it's to skip
if the scene variable the scene element of the chan variable or the skip element the chan variable
are set to true that is value one so in other words if I have set these fields to either of the fields
to one then it's not going to be included in the output the next line is a piece of text effectively
with embedded bits of template toolkit stuff it begins with an indentation the indentation is
important because it's needed by markdown it is followed by the indentation is followed by a
hyphen and a space then an open square bracket then an asterisk and then after asterisk is
an open square bracket percent then chan dot title percent close square brackets so that's
a substitution of the value of the title of the particular channel with asterisk side of it
and there's a closed square bracket so there's square brackets around it there's an example
bit lower down in the notes then we do something very similar with with enclosing in parentheses
another template toolkit expression in square brackets percent and in this case it's chan dot
feed feed is the URL of the feed but in the opml the URLs are actually not the feed they are
RSS expressions they are RSS URLs it's not it's not the channel I am confusing channel and feed
it's not the the channel that we want that you'd click you'd load into your search bar in your
browser it's a feed for giving to an RSS feed but the difference between the two is tiny
so the expression chan dot feed dot replace causes a substitution to be done on that string
and this the original one is changed to a new one which references the channel so you get out
a channel pointer I think you'll probably see that from later on without me trying to explain it
then the last last piece is an end statement for template toolkit getting closed in these
prevent open square brackets percent and then percent closed square bracket and so that's the
end of the loop and that's it so there's six lines here and that's all you need in the template
so if you to run it you don't need to have a programmer totally you can use a command that comes
with template toolkit which is t page t p a g e and what that does is simply to run a template you
give it as an argument the name of a template and it will run template toolkit on it because in
the template it says what file it's to process it it just that's all you need in this particular
example I am piping the output into the head command where I'm using dash five to get the first five
line so you'll see that what you what you get is and column one a hyphen then space youtube
channels that was a bit of text that the template outputs and then the loop starts and it then
starts to print out indented hyphen things which are actually markdown links the markdown link
consists of a straight a bit of text in square brackets followed immediately by a URL in
round brackets in parentheses I've used asterisks in these square brackets because that
produces an italicized string so that this is markdown magic which is not really very magic but
they go so if you give that to paddock and the next next example shows the t page output being
piped directly into paddock and then I put the first five lines of that you see it's html where it's
some setting up a a list and then a then a sub list within it which is which is triggered by the
indented lists specifications so that was what was used in show 2493 and there's a there's a link
in these notes that takes you to the place where it's actually used so as I got to this point and
writing these notes I've thinking wow I've probably lost 90% of the audience here and anybody's
left is probably saying why in earth did you do this this is entirely overkill I'm sure Ken is
but um it's it's just the way my mind works that's it it's that thing that Josh was saying you
tend to come up with programming solutions to avoid the boredom of actually cutting and pasting
a whole bunch of things out of a web page or something of that sort it made a tedious process
a little bit more interesting I know Josh mentioned this but it's also things that I've heard
said in the the community of programming and people managing computers and that type of thing
for many many years that there's a tendency to come up with solutions so that you don't have
to do boring things and if you do have to do boring things that you you only do it once and there
after you have you've you've built something to short circuit it it's just a piece of psychology I
guess that goes along with the territory what I have here then is a is a bit of scripting which
I can use again if I ever want to do another episode on youtube subscriptions and I probably won't
but if I ever did wanted to say oh I found this cool this one and this one you might like and stuff
then I can easily go through the same process again and generate such a list and talk about a lot
a lot more straightforwardly than doing it the long hard way so what I've done is not necessarily
waste the effort and along the way I learned about how the hell you get stuff out of youtube which
just seems to be very reluctant to release information about what it is that you're you're subscribed to
and I also learned how to use XML style it I hope I might have passed on a bit of interest and
a recipe for doing strange things for that XML style it and I also learned some new things about
template toolkit even though I use it quite a lot already I found out things I didn't know at all
I'd never used it to process the CSV file and of course there's a hacker probably radio at the
episode at the end of it you might not agree but I think this is a cool process so if you made it
through at the end of this congratulations and thank you for listening okay bye now
you've been listening to HackerPublic Radio at HackerPublicRadio.org we are a community podcast
network that releases shows every weekday Monday through Friday today's show like all our shows
was contributed by an HBR listener like yourself if you ever thought of recording a podcast
then click on our contributing to find out how easy it really is. HackerPublic Radio was found
by the digital dog pound and the infonomicon computer club and it's part of the binary revolution
at binrev.com if you have comments on today's show please email the host directly leave a comment
on the website or record a follow-up episode yourself unless otherwise status today's show is
released on the creative comments attribution share a live 3.0 license