240 lines
21 KiB
Plaintext
240 lines
21 KiB
Plaintext
|
|
Episode: 2720
|
||
|
|
Title: HPR2720: Download youtube channels using the rss feeds
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2720/hpr2720.mp3
|
||
|
|
Transcribed: 2025-10-19 15:32:28
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is HBR Episode 2007 120 entitled Download YouTube channel using the RSS feeds.
|
||
|
|
It is posted by Ken Fallon and is about 24 minutes long and can remain a explicit flag.
|
||
|
|
The summary is, can share a script that will allow you to quickly keep up to date on your
|
||
|
|
YouTube sub-criptions.
|
||
|
|
This episode of HBR is brought to you by AnanasThost.com.
|
||
|
|
With 15% discount on all shared hosting with the offer code HBR15, that's HBR15.
|
||
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
||
|
|
Hi everybody, my name is Ken Fallon and you're listening to another episode of Hacker
|
||
|
|
Public Radio today is going to be the second in a mini series that I'm doing on YouTube
|
||
|
|
subscription.
|
||
|
|
The last episode I recorded was in response to a hookah's episode where he wanted to watch
|
||
|
|
YouTube channels in reverse order, so watching the oldest first and follow the video producer
|
||
|
|
as they progressed through their voyage of discovery.
|
||
|
|
That's quite useful because quite a lot of stuff are episodic and sometimes people
|
||
|
|
the channels that I watch, I also wanted to do that because tutorials build on tutorials
|
||
|
|
build on tutorials, so you end up watching a video and it says well if you haven't seen
|
||
|
|
my video on this then you should go back and watch that and then that has a reference
|
||
|
|
to older stuff, so it's always easier if you like the channel to go back, get all the
|
||
|
|
videos, download them and play them.
|
||
|
|
I download them in order to save them bandwidth costs, but also so that I have them safely offline
|
||
|
|
and I can watch them on the train, but also sometimes the videos disappear for one reason
|
||
|
|
or another and that locally.
|
||
|
|
I do delete them after I watch them usually, other stuff I can save to them as, but okay.
|
||
|
|
Anyway I digress.
|
||
|
|
The disadvantage of that is that for people who produce a lot of content and have been
|
||
|
|
doing it for quite a long time, there are a number of videos can be into the thousands
|
||
|
|
of hundreds if not thousands, the EEV blog took maybe an hour or two to parse using my
|
||
|
|
previous script and that's a bit of a diss, that's fine, the first time you want to download
|
||
|
|
it because it's going to take hours and hours to go through each of the videos and download
|
||
|
|
each of them anyway, but if you just want to check and check the last downloaded time, that
|
||
|
|
means you need to do a query on the video itself, download all the information about that,
|
||
|
|
look at the last download time and then check to see is that older than the one that I want
|
||
|
|
to download now, yes it is, okay I'll skip to the next one and that takes time.
|
||
|
|
Now I was struggling with this and by happy coincidence Dave Morris mentioned that YouTube,
|
||
|
|
if you subscribe to channels, they had a RSS feed per channel and that you could export
|
||
|
|
your all your channels into an opium alpha, well this is absolutely ideal because opium
|
||
|
|
alpha's are basically a playlist of RSS feeds and you should be familiar with RSS feeds because
|
||
|
|
you would be listening to HBO probably if it wasn't for RSS feeds and they just allow your
|
||
|
|
a bug character to go and get the media, provides an XML method to go get media, so there's a
|
||
|
|
few things in this chain, first of all you need to be subscribed to the channels, which means people
|
||
|
|
know what your subscriptions are, of course you can create a pseudo anonymous one if you want but
|
||
|
|
I'm actually thinking that this is kind of the currency of YouTube, the number of subscribers
|
||
|
|
that you have, so I'm quite happy to subscribe to the people that are subscribed to.
|
||
|
|
Now a lot of YouTube content providers will complain about things like that, the bell doesn't
|
||
|
|
work and that you need to get an email when the video comes in and they miss videos because
|
||
|
|
you didn't click the bell and it only recommends videos for you, this bypasses all of that,
|
||
|
|
you will always get all the videos that somebody uploaded, all the current ones that they are
|
||
|
|
called, so if somebody's uploaded less than 10, there will be less than 10 in this feed, if they've
|
||
|
|
uploaded more than 10, it will only be the 10 newest ones that are in this feed, but if you check
|
||
|
|
once or twice a week, then assuming somebody pulls one video a day, you still have 10 days to get
|
||
|
|
all the video downloads and because it is 10 multiplied by the number of videos in your feed,
|
||
|
|
that's the maximum number that you're going to be checking or so, I'm not sure if 10 is the number,
|
||
|
|
let's say it is, excuse me one moment, so you need to be subscribed to
|
||
|
|
people on YouTube, so you can go to subscription manager, and this is actually the only part
|
||
|
|
that you need to be logged in for this thing to work because the RS feed itself doesn't actually
|
||
|
|
require you to be logged in or authenticated to Google, so if you discover a new feed,
|
||
|
|
then you log in, you subscribe to that feed, you export your opml file, and then you can log out
|
||
|
|
of Google again, and our YouTube, and then you're good to go, so the secret URL for the subscription
|
||
|
|
manager is youtube.com, for subscription, underscore manager, subscription, underscore manager,
|
||
|
|
and when you type that in, you'll get to a secret page, and pretty much is the secret page, it
|
||
|
|
looks very much to the subscriptions page, but it's different, because right down at the end,
|
||
|
|
you'll see I have 69 subscriptions at the minute, and right down at the end is export to RSS reader,
|
||
|
|
and there's a button export subscriptions, and when I do that, it asks me to save the file, and I
|
||
|
|
always paste in the file name where I'm going to save it as subscription underscore manager.opml,
|
||
|
|
excuse me, and I also run through XML, lint, space, dash dash format to basically make
|
||
|
|
it human readable, and I use a pipet through the sponge command back into the same file name again,
|
||
|
|
sponge is an excellent command, which actually I think I was introduced to here on it, it allows you
|
||
|
|
to, without having to write, write to the same file without having to write a temp file and overwrite it,
|
||
|
|
very, very cool command. Anyway, I have a bit of a script that I put together,
|
||
|
|
and it's broken down into some predictable sections, so the first part is some variables,
|
||
|
|
which determine what's going to happen, some of the settings that I can do.
|
||
|
|
The second part is some sanity checks to make sure that I have a subscription file, clean up some
|
||
|
|
log files if they're not there, and maintains a copy of all the files that have been downloaded.
|
||
|
|
Then I parse through the opml file to get a list of all the
|
||
|
|
or SS feeds that are listed in the opml file, and inside that loop, I go to each and every one of
|
||
|
|
those subscriptions, so all six, and I extract all the video URLs within each of those files,
|
||
|
|
so 70 multiplied by 10, so 70 multiplied by 10 is 700 URLs at the end, maybe there's more,
|
||
|
|
maybe there's less, I'm not sure, could be 20 and that basically gives me a list of the URLs
|
||
|
|
that I need to get. Then I'm going to go through all that list, and I'm going to check and see
|
||
|
|
if I already have downloaded that URL if I have, then I'm not going to do anything.
|
||
|
|
If I haven't, then I'm going to basically do some checks, and then after all that's done,
|
||
|
|
I do some tidying, I loop the producers a list of all the files, a cut down list of all the
|
||
|
|
possible files that I have, so some of them I don't want, some of them I've already downloaded,
|
||
|
|
and that will give me a list of the new files that I want to download, and then it uses YouTube
|
||
|
|
DL again, but this time, just downloading a list of these files. Here are YouTube DL,
|
||
|
|
you don't need to figure anything out, these are the ones that I want, and here's where I want you
|
||
|
|
to save it, so let's step through this video file. This or this bash file, this will be in the show
|
||
|
|
notes, as well as copy of my OPML file as it currently stands today, and some other links and
|
||
|
|
stuff, so if you basically go through a bash file, it doesn't float you both, I would suggest you
|
||
|
|
basically go to tomorrow's episode right now, but I'll go through this as it is, so the first section
|
||
|
|
is the save path, and that's where I want to save my files. My subscription is the URL to the
|
||
|
|
or the location of where my OPML file is, and what an OPML file looks like actually is it's an
|
||
|
|
XML file, so you've got an opening element, XML version 1, then the OPML element, which opens
|
||
|
|
to version 1.1, and inside of that is a body, and inside of that is an outline, and inside of
|
||
|
|
that is another outline, so on the final list of outlines is for each line, there's essentially
|
||
|
|
one line per channel that you subscribe to, and that channel that you subscribe to has an RSS,
|
||
|
|
and within that element itself, just the element tag itself, outline it has one, two, three,
|
||
|
|
four different attributes in there that gives you some, the text and the title seem to be the same,
|
||
|
|
they just seem to be the name, and the type is RSS, so it's always there you go, and the magic bit
|
||
|
|
that we're looking for is the XML URL, so that is the URL to this particular channel's RSS feed,
|
||
|
|
and the first part is always the same, youtube.com, forward slash feeds, forward slash video.xml,
|
||
|
|
question mark, channel ID equals, and then each of the channels has a unique identifier,
|
||
|
|
so what we're going to be doing in our script is we're going to be taking that file,
|
||
|
|
and using XML starlet, and we're going to be taking out the XML URL and the title, and we're just
|
||
|
|
taking the title so that we can print something nice to the screen so that we know what's going.
|
||
|
|
So if I take the first one of those, and I actually copy and paste it into youtube, and I get a
|
||
|
|
basically RSS feed, which is an atom feed, to be honest, so not as it says an RSS feed,
|
||
|
|
and it gives me the atom feed of this particular channel, and in that channel, there's an entry
|
||
|
|
for each and every one of the videos, and published data, uploaded lots of cool information about
|
||
|
|
this video, but at this point, about this channel for this video content producer.
|
||
|
|
At this point, the only one we're interested in is the URL, which is stored in the feed element,
|
||
|
|
in the entry element, in the media group element, and the media content element,
|
||
|
|
then it has an attribute called URL. So in order to get that, we first need to parse the opml file,
|
||
|
|
so let's start doing that. XML starlet using space siel for select mode, using space dash capital T
|
||
|
|
for do this as text, space dash T to tell us here comes the template, dash m to say match
|
||
|
|
four such opml, four such body, four such outline, four such outline, which we described before.
|
||
|
|
Okay, so now we're on the lines itself, then we're going to produce a list of
|
||
|
|
space-delimited things, and we're going to produce two of them with a using the concat attribute,
|
||
|
|
concat function of XML starlet, of XSLT actually, that is provided by XML starlet,
|
||
|
|
and we're looking for the attribute that's at that location, outline, outline, body, blah blah,
|
||
|
|
XML URL, and then we're going to put a space, and then we're going to put the title of the
|
||
|
|
of the video, and we're doing dash in, so it's going to be a new line between each, which is great,
|
||
|
|
because we're putting it into a loop, and we're using subscriptions as, so our subscription opium,
|
||
|
|
so that's fantastic, that gives us a list of, in the case, subscription manager, I would have
|
||
|
|
youtube.com, it would produce HTTPS, called on blah blah blah, channel ID equals blast, space,
|
||
|
|
winter garden, and then the next one is blah blah blah, space, primitive, space technology.
|
||
|
|
Cool thing here is that I'm piping it into while read, do, and I'm putting instead of normally,
|
||
|
|
you would go while read.i, you can actually put two variables, and the first one is subscription,
|
||
|
|
and the second one is title, so as there is, the first one will always be the URL, and everything
|
||
|
|
else gets dumped into title, and that allows me to echo out, without getting the title, so I know
|
||
|
|
I'm going to have getting winter garden, getting primitive technology, getting John Ward, etc,
|
||
|
|
etc, etc. Then I just do a WGET, because now I'm dealing with an XML file, so all I need to do
|
||
|
|
is WGET that, to the RSS file, so I use WGET-Q and subscription, which is a variable name for the URL,
|
||
|
|
dash capital specifies the output location, and the dash, which is send it to standard out,
|
||
|
|
that's fairly common in a lot of Unix tools, and then I'll pipe that into, you guessed it,
|
||
|
|
XML Starlet again, and I use the select command capital T for yes, you've been paying attention,
|
||
|
|
text, locust for template, dash m for match, underscore colon, which is a really cool thing that
|
||
|
|
XML Starlet does, because XML itself has this horrible thing with namespaces, namespaces are,
|
||
|
|
there are two daves on the channel, one was called Dave, the other was called Dave and Mars,
|
||
|
|
so the namespace, distinguishes one from the other, and what they call, sorry, what the
|
||
|
|
underscore colon does, it says use the default namespace, don't bug me about that, and then I can
|
||
|
|
specify the other one, so underscore feed, underscore colon feed, forward slash underscore colon entry,
|
||
|
|
forward slash media colon group, because that's a separate media type, and then under that element,
|
||
|
|
media colon content, don't worry too much about it, just think of it as like Unix directory paths,
|
||
|
|
that's what X path kind of does, and in there it says you're going to find an attribute called
|
||
|
|
URL, and that's the one I want you to print out, and print it out you will, because I've told you
|
||
|
|
to do it, using the dash v command for value, and then space the dash n for a new line, and then
|
||
|
|
the dash to say we're not finished, send it to standard output, and then we send it to
|
||
|
|
Oc, and we use the F, capital F, Oc space capital F, to specify a delimiter outside of Oc, so going
|
||
|
|
into Oc, we will already have specified what the delimiter is, and it's a question mark, and the
|
||
|
|
reason for that is the URL's media content gives you versions, version equals three, version equals one,
|
||
|
|
so if somebody's uploaded it, personally I don't care, I will always get the latest one,
|
||
|
|
so that's what I want, so I just return back dollar one, and that will give me a clean URL,
|
||
|
|
so basically what it does is we've gone to the channels opm alpha, we've opened up the RSS fee,
|
||
|
|
and we stripped out all the YouTube URLs that are now current, so all the possible ones that
|
||
|
|
we can get, we now have them in a list, which is log file underscore get list, and what we're
|
||
|
|
going to do then is we're going to go through loop through that list, and we're going to do some
|
||
|
|
cool things, first of all we're going to have a count, and then we're going to sort it, get a
|
||
|
|
unique, and a work count of it, so we're going to have a count and a total, and we're going to keep
|
||
|
|
track of the count, and we're going to say downloading count of total, so that we can update
|
||
|
|
a feed bar on the bottom of the screen, showing how many we processed, as we go, which is kind of cool,
|
||
|
|
then the first thing we do is we check to see if this video is stored in our log file, so
|
||
|
|
similar to the way bash podder does it, anytime it downloads a media file it keeps track of that
|
||
|
|
in the log file, we check the log file, have you already downloaded this, if it's in there,
|
||
|
|
it skips it, what you'll find is that youtube dl also maintains a list of what it's downloaded,
|
||
|
|
and we won't download it again, you can override that as well if you wish, but I want to do it
|
||
|
|
belt embrace style, so in order to make my life easier and nicer there's a few things that I want
|
||
|
|
to be able to do, for start I sometimes get links to live events or long events that have 24,
|
||
|
|
48 hours on them, and I want to be able to limit the maximum size of the video that I download,
|
||
|
|
and so I have a variable max length, and then I have another thing called excuse my french skip
|
||
|
|
crap, which is a array, or which is a string that contains a eGrip or a Grip regular expression
|
||
|
|
command, where I put in a list of stuff that I don't want to download, for example,
|
||
|
|
fail of the week, kids react to live stuff, best pets, bloopers, or kids try all that stuff that you
|
||
|
|
you know junk, anything that's junk you put in there, I don't use it in this script, I use it in
|
||
|
|
another one, which we can talk about later, so several things that I'm able to get, one thing that
|
||
|
|
youtube dl does is it allows you to go to url and using the command dash dash dump dash json, you will
|
||
|
|
get a complete json file of all the metadata associated with that video file, all the formats
|
||
|
|
that is available in the upload time, more information than you can shake it stick at, absolutely
|
||
|
|
excellent too, and then I use the json equivalent of xml starlet, which is jq, and that allows me
|
||
|
|
to strip up, for example jq dash uploader will give you the uploader dot title will give you the
|
||
|
|
title dot upload date will give you the upload date dot id will give you the id dot duration will give
|
||
|
|
you the duration, so this makes it absolutely really really easy to work with these video files,
|
||
|
|
you can all the metadata you need to produce nice clean ur messages, so first thing I do is I
|
||
|
|
check to see if the duration is sane, because sometimes it's it's zero, and if the duration is
|
||
|
|
strange I skip over that one, otherwise I look to see if the duration is greater than the max
|
||
|
|
length that I had before, and if it does it skips that one that says you told me not to download
|
||
|
|
this one, and sometimes I'll keep an eye on those and see oh but that one I actually do want it
|
||
|
|
and then the next one is running the grep filter to make sure that to skip over any any of the stuff
|
||
|
|
that I don't want to download, and then finally we get it prints off the video that it tells you
|
||
|
|
that who uploaded what the title is and what the URL is, and then pipes that into a to-do text file,
|
||
|
|
another thing that I do then is I save the metadata description of the
|
||
|
|
of the JSON, which is what you see on the YouTube when YouTube channel when you go to it and you
|
||
|
|
click more info and you see all the information, quite a lot of the videos that I download have
|
||
|
|
your links to the components that they use in it or how to's or links to the get GitHub repositories
|
||
|
|
or whatever, and all that's available right there, the only thing is it has
|
||
|
|
escape codes like slash in and slash t for new lines and tabs, so what I do is I run that
|
||
|
|
in a dollar bracket thing, Dave sorry can't remember what that's called, and I use the echo dash e
|
||
|
|
and then command substitution is it, dollar open bracket echo metadata into JSON jq,
|
||
|
|
space description, and then I put that to a video id.text, so if ever I need to go back and have a
|
||
|
|
look at what it was that somebody the links somebody had in the video, I can get it right there,
|
||
|
|
so that's really really useful, and that that I keep even if I delete the video itself, and if,
|
||
|
|
so that is the loop, so I've found a video that gets added to the to-do list, and if I haven't found
|
||
|
|
a video it says processing count of count, so and it uses the dash n dash e, so it doesn't print a
|
||
|
|
new line and it uses escape codes, and one cool one there is the slash r which means when you print
|
||
|
|
it off it goes back to the start of the line, so it looks like it looks like only the number
|
||
|
|
processing 1, 2, 3, 4, 5, 6, 7 of total has been updated, so that's kind of cool,
|
||
|
|
and then exit the loop, and I index the count using count equals dollar, open bracket book bracket
|
||
|
|
count plus 1, close bracket, close bracket, thank you again, Dave, so now I have a list, and I check
|
||
|
|
to see if there is one, because sometimes there isn't because if I run us twice after each other,
|
||
|
|
there's been no updated files, but if there is, then I cast the to-do file into youtube dl,
|
||
|
|
and this time I'm using dash dash bash dash file with a dash saying that it's going to be a
|
||
|
|
batch file, and it's coming from standard input, and then the old classic ignores no m time
|
||
|
|
restrict file names format equals mpu for dash or save path, and then sub directory of the uploader,
|
||
|
|
sub directory inside the sub directory, upload date dash, then the title, and my diamond
|
||
|
|
delimiter, and then the id, and then the extension, and I cast that to the once that's downloaded,
|
||
|
|
I cut that entire to-do file into the log file so that when I run this again, all the videos that
|
||
|
|
I have downloaded are already in the log file, and then there's just a little bit of cleanup, so
|
||
|
|
I hope you didn't find that too boring, because yeah it's actually quite a nice script, this this one,
|
||
|
|
and sometimes you have scripts that just don't go around, and we'll be talking about that
|
||
|
|
in the next one, and you have scripts that improve over time, and this one has improved over time,
|
||
|
|
I think, so I'm welcome to your feedback, constructive of course, and as always tune in tomorrow
|
||
|
|
for another exciting episode of Hacker Public Radio!
|
||
|
|
You've been listening to Hacker Public Radio at Hacker Public Radio dot org.
|
||
|
|
We are a community podcast network that releases shows every weekday, Monday through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by an HPR listener like yourself.
|
||
|
|
If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is.
|
||
|
|
Hacker Public Radio was founded by the digital dog pound and the infonomicum computer club,
|
||
|
|
and is part of the binary revolution at binrev.com.
|
||
|
|
If you have comments on today's show, please email the host directly, leave a comment on the website,
|
||
|
|
or record a follow-up episode yourself.
|
||
|
|
Unless otherwise stated, today's show is released under creative comments,
|
||
|
|
attribution, share a light, 3.0 license.
|