112 lines
8.1 KiB
Plaintext
112 lines
8.1 KiB
Plaintext
|
|
Episode: 4084
|
||
|
|
Title: HPR4084: Cloud learning
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4084/hpr4084.mp3
|
||
|
|
Transcribed: 2025-10-25 19:22:06
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is Hacker Public Radio episode 4,084 for Thursday the 28th of March 2024.
|
||
|
|
Today's show is entitled Cloud Learning.
|
||
|
|
It is hosted by Daniel Person and is about 10 minutes long.
|
||
|
|
It carries a clean flag.
|
||
|
|
The summary is my experience trying to train a model online.
|
||
|
|
Hello Hacker's and welcome to another episode, Daniel here and today I'm going to talk about
|
||
|
|
cloud learning, using a machine learning model and training it in the cloud.
|
||
|
|
This was a topic that I went into this Christmas break because I was fed up with the advent
|
||
|
|
of code.
|
||
|
|
I couldn't really bear doing more of that so I needed another topic to look into.
|
||
|
|
I have this model, whatever it is.
|
||
|
|
In this case it was a TTS model.
|
||
|
|
I wanted to train a voice to speak a particular language and create something that I could
|
||
|
|
use later on.
|
||
|
|
I needed to find somewhere in the cloud to train this model.
|
||
|
|
I have figured out that training the model on my own, on my computer will take about
|
||
|
|
8 days to run through a full training cycle.
|
||
|
|
When I looked online I could find places where I could train it in 8 hours or in 10 hours,
|
||
|
|
12 hours and so on.
|
||
|
|
Depending on which graphics cards I was using, depending on how many of them I was running
|
||
|
|
and so on.
|
||
|
|
I wanted to try to use this in the cloud.
|
||
|
|
I was talking to different cloud vendors, so I reached out to Microsoft, looked at their
|
||
|
|
solution, sadly after 2 hours of video watching, trying to learn their platform and how to
|
||
|
|
set things up and where to go and what tools to use, I gave up because I still think Microsoft's
|
||
|
|
way of structuring things is not intuitive to me, it's very confusing, so I couldn't
|
||
|
|
really get into that.
|
||
|
|
I didn't want to spend more time on it, 2 hours trying to start even figuring out what
|
||
|
|
tools to use is not fun for me.
|
||
|
|
I also tried Google, it was really easy to find my way there and I figured out that I wanted
|
||
|
|
to use Vertex AI and what I wanted to use is the model that I already had, so I wanted
|
||
|
|
to train that, a lot of these kind of cloud providers gives you these notebooks that you
|
||
|
|
should put in your model and run it there, but this model was so complex that I needed
|
||
|
|
to check out to get through Pository or run a Docker image, but in Vertex AI you could
|
||
|
|
run your own Docker images and connect them to cloud storage, which was not that complicated
|
||
|
|
actually, you could do that pretty simply and I set up something that could train and
|
||
|
|
then I wanted to run it on some GPU power and there were the problem because Google don't
|
||
|
|
give you any graphics cards if you don't ask them for it, so you needed to sign up and
|
||
|
|
ask for a graphics card.
|
||
|
|
So on Christmas day I asked for a graphics card, I asked for one graphics card of one type
|
||
|
|
and four graphics card of an older type and it took me about four weeks, three to four
|
||
|
|
weeks until they actually gave me access to one of these cards.
|
||
|
|
I haven't been able to run any jobs on that card yet, I'm still trying to figure that
|
||
|
|
one out, but just asking for a card and that taking that much time was not really a super
|
||
|
|
good experience, they said that it should take about two to three business days, not four
|
||
|
|
weeks, but still I got access.
|
||
|
|
Then I went over and looked at Amazon because I want to do of course try all of the big
|
||
|
|
ones and Amazon frankly just said no, you will not get any GPU power at us in our tooling,
|
||
|
|
you need to use SageMaker.
|
||
|
|
So SageMaker is pretty much use a notebook and train on GPUs in SageMaker, but you still
|
||
|
|
need to ask for GPUs so I could be declined there as well, but you can't run your custom
|
||
|
|
images.
|
||
|
|
So what I wanted to do, I was not allowed to do, they just said no, which was really
|
||
|
|
a bit really sad so I couldn't really use their service as well.
|
||
|
|
Then I found this service that was pay as you go and they were pretty much GPU as a service
|
||
|
|
and I'm gonna release a video about them, I'm not sure if I actually can talk about
|
||
|
|
the company, but if you're following my channel you will see that video eventually when
|
||
|
|
we have released that, so I'm talking and working with them around this video, it's a very
|
||
|
|
early company, early concept, so they haven't really released everything yet, but I think
|
||
|
|
that their way of doing it is really interesting and the right way to do it.
|
||
|
|
They have two different concepts pretty much or three different concepts, they have an
|
||
|
|
environment where you can say I want to do things in this data center pretty much, so
|
||
|
|
you have an environment and you can say that I want to run it this in Norway for instance.
|
||
|
|
Then when you have set up that environment you could create these kind of volumes where
|
||
|
|
you put your data, so I created a volume of 100 gigabytes where I put all my data and
|
||
|
|
my operating system and so on and then you could start virtual machines, so I started
|
||
|
|
a virtual machine with just CPU power and my volume, I installed all the different dependencies
|
||
|
|
and so on that I needed with the CPU power which was very cheap, so it was a cheap way
|
||
|
|
of getting my dependencies and my model then all my data and so on into my model set up
|
||
|
|
and ready to do some training.
|
||
|
|
Then I shut down that CPU powered machine and I took a machine where they had different
|
||
|
|
amount of graphics cards, they were running a 4,000, 5,000, 6,000, A100s, H100s and L40s,
|
||
|
|
so they had a bunch of these kind of cards and they have machines that had one card, two
|
||
|
|
cards, four cards and eight cards, so some of the machines were super powerful in order
|
||
|
|
to train a lot, but of course it became more expensive to run with a lot of cards but
|
||
|
|
it was still affordable I would say, so if you are running a load and you really want
|
||
|
|
it to be done quick, you just put more GPUs on it and then start it up with your volume
|
||
|
|
that you have prepared, run your data load and then shut it down, perhaps start a CPU load
|
||
|
|
again in order to download the thing and the best thing of this is that you either run
|
||
|
|
it directly in your Linux environment in this virtual machine or you could go into a
|
||
|
|
DNC host and just look at it and run your things there or give it its own IP on the internet
|
||
|
|
and log in using SSH so you had full access to the machine, you could do whatever you want
|
||
|
|
on these machines and just use the GPUs, so it was GPU as a service in its purest ways,
|
||
|
|
so I really like that approach to train things online, I haven't found any other service
|
||
|
|
that does it similarly and as well, there is of course the option of running in Linode
|
||
|
|
which Akami has bought up now, they have similar solutions but they are very expensive what
|
||
|
|
I have found so far, you also could run it in digital ocean, I looked at those, they are very
|
||
|
|
interesting, when you start there you need first give them five bucks in order to just sign up
|
||
|
|
and when you had signed up then you could get access to actually run things but you had to ask
|
||
|
|
machine power again and they only accepted to run notebooks, I had to pay them five bucks
|
||
|
|
and I couldn't use them when it actually came down to it, so yeah I can never see those five
|
||
|
|
bucks again because I can't use them, so they said okay pay us five bucks and you will use that
|
||
|
|
for your training later on but because I will never train with them I just gave them five bucks
|
||
|
|
pretty much, so this is what I have experienced when I have tried to train machine learning
|
||
|
|
tasks or models online using GPUs, have you tried to do this and did you have a different experience,
|
||
|
|
perhaps you have tried Microsoft and found it very easy and could give me some hints,
|
||
|
|
perhaps record an episode explaining how to do that, so I can figure it out myself as well
|
||
|
|
or if you have any other experience please share with the rest of the community, I'm very
|
||
|
|
interested in this topic, I hope that you liked this episode and I hope to see you in the next episode.
|
||
|
|
You have been listening to Hacker Public Radio at Hacker Public Radio does work,
|
||
|
|
today's show was contributed by a HBR listener like yourself, if you ever thought of recording
|
||
|
|
podcast and click on our contribute link to find out how easy it really is, hosting for HBR has
|
||
|
|
been kindly provided by an honesthost.com, the internet archive and our syncs.net, on the
|
||
|
|
Sadois stages, today's show is released on their creative comments, attribution 4.0 international
|