Episode: 4084 Title: HPR4084: Cloud learning Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4084/hpr4084.mp3 Transcribed: 2025-10-25 19:22:06 --- This is Hacker Public Radio episode 4,084 for Thursday the 28th of March 2024. Today's show is entitled Cloud Learning. It is hosted by Daniel Person and is about 10 minutes long. It carries a clean flag. The summary is my experience trying to train a model online. Hello Hacker's and welcome to another episode, Daniel here and today I'm going to talk about cloud learning, using a machine learning model and training it in the cloud. This was a topic that I went into this Christmas break because I was fed up with the advent of code. I couldn't really bear doing more of that so I needed another topic to look into. I have this model, whatever it is. In this case it was a TTS model. I wanted to train a voice to speak a particular language and create something that I could use later on. I needed to find somewhere in the cloud to train this model. I have figured out that training the model on my own, on my computer will take about 8 days to run through a full training cycle. When I looked online I could find places where I could train it in 8 hours or in 10 hours, 12 hours and so on. Depending on which graphics cards I was using, depending on how many of them I was running and so on. I wanted to try to use this in the cloud. I was talking to different cloud vendors, so I reached out to Microsoft, looked at their solution, sadly after 2 hours of video watching, trying to learn their platform and how to set things up and where to go and what tools to use, I gave up because I still think Microsoft's way of structuring things is not intuitive to me, it's very confusing, so I couldn't really get into that. I didn't want to spend more time on it, 2 hours trying to start even figuring out what tools to use is not fun for me. I also tried Google, it was really easy to find my way there and I figured out that I wanted to use Vertex AI and what I wanted to use is the model that I already had, so I wanted to train that, a lot of these kind of cloud providers gives you these notebooks that you should put in your model and run it there, but this model was so complex that I needed to check out to get through Pository or run a Docker image, but in Vertex AI you could run your own Docker images and connect them to cloud storage, which was not that complicated actually, you could do that pretty simply and I set up something that could train and then I wanted to run it on some GPU power and there were the problem because Google don't give you any graphics cards if you don't ask them for it, so you needed to sign up and ask for a graphics card. So on Christmas day I asked for a graphics card, I asked for one graphics card of one type and four graphics card of an older type and it took me about four weeks, three to four weeks until they actually gave me access to one of these cards. I haven't been able to run any jobs on that card yet, I'm still trying to figure that one out, but just asking for a card and that taking that much time was not really a super good experience, they said that it should take about two to three business days, not four weeks, but still I got access. Then I went over and looked at Amazon because I want to do of course try all of the big ones and Amazon frankly just said no, you will not get any GPU power at us in our tooling, you need to use SageMaker. So SageMaker is pretty much use a notebook and train on GPUs in SageMaker, but you still need to ask for GPUs so I could be declined there as well, but you can't run your custom images. So what I wanted to do, I was not allowed to do, they just said no, which was really a bit really sad so I couldn't really use their service as well. Then I found this service that was pay as you go and they were pretty much GPU as a service and I'm gonna release a video about them, I'm not sure if I actually can talk about the company, but if you're following my channel you will see that video eventually when we have released that, so I'm talking and working with them around this video, it's a very early company, early concept, so they haven't really released everything yet, but I think that their way of doing it is really interesting and the right way to do it. They have two different concepts pretty much or three different concepts, they have an environment where you can say I want to do things in this data center pretty much, so you have an environment and you can say that I want to run it this in Norway for instance. Then when you have set up that environment you could create these kind of volumes where you put your data, so I created a volume of 100 gigabytes where I put all my data and my operating system and so on and then you could start virtual machines, so I started a virtual machine with just CPU power and my volume, I installed all the different dependencies and so on that I needed with the CPU power which was very cheap, so it was a cheap way of getting my dependencies and my model then all my data and so on into my model set up and ready to do some training. Then I shut down that CPU powered machine and I took a machine where they had different amount of graphics cards, they were running a 4,000, 5,000, 6,000, A100s, H100s and L40s, so they had a bunch of these kind of cards and they have machines that had one card, two cards, four cards and eight cards, so some of the machines were super powerful in order to train a lot, but of course it became more expensive to run with a lot of cards but it was still affordable I would say, so if you are running a load and you really want it to be done quick, you just put more GPUs on it and then start it up with your volume that you have prepared, run your data load and then shut it down, perhaps start a CPU load again in order to download the thing and the best thing of this is that you either run it directly in your Linux environment in this virtual machine or you could go into a DNC host and just look at it and run your things there or give it its own IP on the internet and log in using SSH so you had full access to the machine, you could do whatever you want on these machines and just use the GPUs, so it was GPU as a service in its purest ways, so I really like that approach to train things online, I haven't found any other service that does it similarly and as well, there is of course the option of running in Linode which Akami has bought up now, they have similar solutions but they are very expensive what I have found so far, you also could run it in digital ocean, I looked at those, they are very interesting, when you start there you need first give them five bucks in order to just sign up and when you had signed up then you could get access to actually run things but you had to ask machine power again and they only accepted to run notebooks, I had to pay them five bucks and I couldn't use them when it actually came down to it, so yeah I can never see those five bucks again because I can't use them, so they said okay pay us five bucks and you will use that for your training later on but because I will never train with them I just gave them five bucks pretty much, so this is what I have experienced when I have tried to train machine learning tasks or models online using GPUs, have you tried to do this and did you have a different experience, perhaps you have tried Microsoft and found it very easy and could give me some hints, perhaps record an episode explaining how to do that, so I can figure it out myself as well or if you have any other experience please share with the rest of the community, I'm very interested in this topic, I hope that you liked this episode and I hope to see you in the next episode. You have been listening to Hacker Public Radio at Hacker Public Radio does work, today's show was contributed by a HBR listener like yourself, if you ever thought of recording podcast and click on our contribute link to find out how easy it really is, hosting for HBR has been kindly provided by an honesthost.com, the internet archive and our syncs.net, on the Sadois stages, today's show is released on their creative comments, attribution 4.0 international