Episode: 4084
Title: HPR4084: Cloud learning
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4084/hpr4084.mp3
Transcribed: 2025-10-25 19:22:06

---

This is Hacker Public Radio episode 4,084 for Thursday the 28th of March 2024.
Today's show is entitled Cloud Learning.
It is hosted by Daniel Person and is about 10 minutes long.
It carries a clean flag.
The summary is my experience trying to train a model online.
Hello Hacker's and welcome to another episode, Daniel here and today I'm going to talk about
cloud learning, using a machine learning model and training it in the cloud.
This was a topic that I went into this Christmas break because I was fed up with the advent
of code.
I couldn't really bear doing more of that so I needed another topic to look into.
I have this model, whatever it is.
In this case it was a TTS model.
I wanted to train a voice to speak a particular language and create something that I could
use later on.
I needed to find somewhere in the cloud to train this model.
I have figured out that training the model on my own, on my computer will take about
8 days to run through a full training cycle.
When I looked online I could find places where I could train it in 8 hours or in 10 hours,
12 hours and so on.
Depending on which graphics cards I was using, depending on how many of them I was running
and so on.
I wanted to try to use this in the cloud.
I was talking to different cloud vendors, so I reached out to Microsoft, looked at their
solution, sadly after 2 hours of video watching, trying to learn their platform and how to
set things up and where to go and what tools to use, I gave up because I still think Microsoft's
way of structuring things is not intuitive to me, it's very confusing, so I couldn't
really get into that.
I didn't want to spend more time on it, 2 hours trying to start even figuring out what
tools to use is not fun for me.
I also tried Google, it was really easy to find my way there and I figured out that I wanted
to use Vertex AI and what I wanted to use is the model that I already had, so I wanted
to train that, a lot of these kind of cloud providers gives you these notebooks that you
should put in your model and run it there, but this model was so complex that I needed
to check out to get through Pository or run a Docker image, but in Vertex AI you could
run your own Docker images and connect them to cloud storage, which was not that complicated
actually, you could do that pretty simply and I set up something that could train and
then I wanted to run it on some GPU power and there were the problem because Google don't
give you any graphics cards if you don't ask them for it, so you needed to sign up and
ask for a graphics card.
So on Christmas day I asked for a graphics card, I asked for one graphics card of one type
and four graphics card of an older type and it took me about four weeks, three to four
weeks until they actually gave me access to one of these cards.
I haven't been able to run any jobs on that card yet, I'm still trying to figure that
one out, but just asking for a card and that taking that much time was not really a super
good experience, they said that it should take about two to three business days, not four
weeks, but still I got access.
Then I went over and looked at Amazon because I want to do of course try all of the big
ones and Amazon frankly just said no, you will not get any GPU power at us in our tooling,
you need to use SageMaker.
So SageMaker is pretty much use a notebook and train on GPUs in SageMaker, but you still
need to ask for GPUs so I could be declined there as well, but you can't run your custom
images.
So what I wanted to do, I was not allowed to do, they just said no, which was really
a bit really sad so I couldn't really use their service as well.
Then I found this service that was pay as you go and they were pretty much GPU as a service
and I'm gonna release a video about them, I'm not sure if I actually can talk about
the company, but if you're following my channel you will see that video eventually when
we have released that, so I'm talking and working with them around this video, it's a very
early company, early concept, so they haven't really released everything yet, but I think
that their way of doing it is really interesting and the right way to do it.
They have two different concepts pretty much or three different concepts, they have an
environment where you can say I want to do things in this data center pretty much, so
you have an environment and you can say that I want to run it this in Norway for instance.
Then when you have set up that environment you could create these kind of volumes where
you put your data, so I created a volume of 100 gigabytes where I put all my data and
my operating system and so on and then you could start virtual machines, so I started
a virtual machine with just CPU power and my volume, I installed all the different dependencies
and so on that I needed with the CPU power which was very cheap, so it was a cheap way
of getting my dependencies and my model then all my data and so on into my model set up
and ready to do some training.
Then I shut down that CPU powered machine and I took a machine where they had different
amount of graphics cards, they were running a 4,000, 5,000, 6,000, A100s, H100s and L40s,
so they had a bunch of these kind of cards and they have machines that had one card, two
cards, four cards and eight cards, so some of the machines were super powerful in order
to train a lot, but of course it became more expensive to run with a lot of cards but
it was still affordable I would say, so if you are running a load and you really want
it to be done quick, you just put more GPUs on it and then start it up with your volume
that you have prepared, run your data load and then shut it down, perhaps start a CPU load
again in order to download the thing and the best thing of this is that you either run
it directly in your Linux environment in this virtual machine or you could go into a
DNC host and just look at it and run your things there or give it its own IP on the internet
and log in using SSH so you had full access to the machine, you could do whatever you want
on these machines and just use the GPUs, so it was GPU as a service in its purest ways,
so I really like that approach to train things online, I haven't found any other service
that does it similarly and as well, there is of course the option of running in Linode
which Akami has bought up now, they have similar solutions but they are very expensive what
I have found so far, you also could run it in digital ocean, I looked at those, they are very
interesting, when you start there you need first give them five bucks in order to just sign up
and when you had signed up then you could get access to actually run things but you had to ask
machine power again and they only accepted to run notebooks, I had to pay them five bucks
and I couldn't use them when it actually came down to it, so yeah I can never see those five
bucks again because I can't use them, so they said okay pay us five bucks and you will use that
for your training later on but because I will never train with them I just gave them five bucks
pretty much, so this is what I have experienced when I have tried to train machine learning
tasks or models online using GPUs, have you tried to do this and did you have a different experience,
perhaps you have tried Microsoft and found it very easy and could give me some hints,
perhaps record an episode explaining how to do that, so I can figure it out myself as well
or if you have any other experience please share with the rest of the community, I'm very
interested in this topic, I hope that you liked this episode and I hope to see you in the next episode.
You have been listening to Hacker Public Radio at Hacker Public Radio does work,
today's show was contributed by a HBR listener like yourself, if you ever thought of recording
podcast and click on our contribute link to find out how easy it really is, hosting for HBR has
been kindly provided by an honesthost.com, the internet archive and our syncs.net, on the
Sadois stages, today's show is released on their creative comments, attribution 4.0 international