Episode: 3428 Title: HPR3428: Bad disk rescue Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3428/hpr3428.mp3 Transcribed: 2025-10-24 23:13:28 --- This is Hacker Public Radio Episode 3428 for Wednesday, the 22nd of September 2021. Today's show is entitled, Bad Disgress Q. It is hosted by Andrew Conway and is about 30 minutes long and carries a clean flag. The summary is, Bad Disgress Q, Tragedy or Happy Ending. This episode of HBR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code HBR15. That's HBR15. Better web hosting that's honest and fair at An honesthost.com. Hello HPR folks, it's McNallow here, also known as Andrew. I was to share with you a story. Well it's a sad story with a, well I won't tell you how it ends, but maybe you can guess. But what happened was a few years ago, as my main workhorse laptop, which is literally the one I would use for my work, I bought an ASUS ZenBook UX 550VD. I think I got that right. It's a pretty modern laptop and it wouldn't run my favourite distribution time Slackware 14.2, but I discovered it would run a Ubuntu that was out at the time and also Slackware current. So why won't it? In the end I went with Slackware current. Also came with Windows on it. Now it's a nice laptop, it's got a half terabyte of SSD, 8GB RAM, some super fast processor and it's actually quite a decent on-board graphics chip. I think it's a 1060 Ti mobile. So it's actually quite a beefy laptop. However I've never been quite satisfied with it. I've always felt there's something a little bit wrong with it and in particular I found that when it was shutting down it used to get stuck and I'd have to do a hard power off alone 5 second press in the power button. Which I don't like doing because I've never that sure even though the hard disk indicator light is mostly out, I'm never that sure that's a good idea for the hard drive. And so when I did see some hard drive errors I wasn't too worried I thought well that's probably because it was doing something when I shut it down but it was on its way down anyway so you know I never lost any work but I just noticed that yeah it was the hard disk error and easily fixed too with the old fs ck but honestly you know it was every few months this would trouble me. And now about just over a week ago I was working away doing something fairly routine and it was it was work. So I was using one of my virtual machines and it's an old Windows 7 virtual machine that I use as a sort of test bed. I mean Windows 7 is our end of life now but frankly people that use software I support don't take notes to that so yeah I still keep that around. Now this Windows 7 has been fairly stable I mean pretty pretty good actually but in this occasion I was doing something fairly routine it just froze I think I was doing something with the browser inside it and it just froze I don't know that's in the way well it's in the life now I also as a backup have a Windows 10 machine which I can work on so I thought well yeah I would just move over to the Windows 10 thing to do my Windows tasks and it ran to trouble fact it would even boot up the virtual machine so that's odd. And then not long after that something I was doing in the host operating system which I say is slackware current which is pretty stable to be honest first of rolling some test bed release. I mean it's actually pretty close to being released to slackware 15 at the moment actually but even when it's going through the flux of change crashes are rare and I hadn't updated it recently so I didn't think that was the cause so I took a closer look and indeed I did find some FSCK did find some maybe more errors than usual so I thought okay well I've got you know I've got the question Mark over the health of this hard drive so I ran smart control smart CTL and it did find well it's been an odd utility I didn't realise this I haven't used it that much before but the first thing to say about is if you don't never used it before is that it returns immediately you type smart control on all the options you want tell it to do a test I told it to do a short test first and actually what it does is it returns immediately but it's actually doing the test in the background and doesn't notify you when it's done so that's one tip with smart CTL it's good it's a good little utility but the you know the documentation behind it doesn't tell you in my opinion that important fact about it anyway once I got used to it's rather idiosyncratic way of behaving I found that it said that my hard disk was healthy and all indicators which and it was crucial was the manufacturer all the indicators that crucial had provided were within threshold that is they were fine now actually I noticed that the values it was coming back were like 100 or nearly 100 and the threshold was zero and I looked at what that meant and it's when the value goes well with the threshold you've got a problem well threshold is zero and it can't go negative so I basically thought that most of the diagnostics that crucial have provided maybe meet the spec in theory but are actually useless in practice so I didn't find smart CTL told me anything that useful but I did notice it said that it had a few reader errors on the blocks now again this isn't that unusual you can go back to spinning hard drives I believe you can get a bad sector which I think is a 512 byte area of the hard drive and if it encounters a bad sector it doesn't even tell the operating system I think what it does is it just marks that sector as problematic and and uses a spare sector which ordinarily won't use from the get go to take its place now obviously you've only got so many spare sectors and I now notice that I had several hundred of them now several like 433 blocks of 512 bytes you know you know it's 200k it's not an awful lot of the hard drive so I'm not overly worried at this point and reading online some people would say oh this is a disaster unplug your hard drive and image the disk immediately but I think that was mainly a reference to older spinning disks and other people say well you know this kind of stuff happens especially if you've had to do some hard powered ones I had power cuts work on a desktop machine or whatever so I honestly did not find that in itself like terribly dramatically bad but it's still concerned because I didn't understand where I stood and then I am still thinking well what two virtual machines that have gone and one other unexplained instant with the disk so with what else can I what are tests can I do so I found this utility called bad blocks now I actually don't know the details of how bad blocks works but I knew that it would potentially find more errors than the other methods because it would go around looking for them whereas smart control I don't know quite how it works but I didn't quite trust it was doing as thorough a test as I'd like even with a long test anyway bad blocks found loads of problems in fact when it got past the 433 bad blocks which is what the smart CTL had told me then I was well I basically at that point decided to shut down my laptop immediately add image of the hard drive and declare the drive inside it as on its way to death now I should see the language your on disks continues to perplex me blocks has two meanings it has one at the hardware level which means 512 bytes on the device itself and it also can refer to the block size of the fly-all system which in the case I was using in my xx4 Linux system I think the default I just was using was 4096 of 4K blocks so the fact that's the first thing that confused me in all of this and the second thing that confused me is it was talking about clusters but I don't think clusters come into it anymore or perhaps they're just a windows thing I don't know but there's so much jargon and it's not clear especially with blocks what you're talking about anyway it's suffice to say I knew something was wrong although at that point I couldn't quantify it so what did I do well I went and got a distro called system rescue after doing a bit of research and for that in the usb stick and booted my laptop from it and oh and the other thing I noticed at this point is after I run the bad blocks the first time I rebooted my hard drive the Linux partition was no longer reported as a bootable option I don't know why it was in the uffirmware that declared this I didn't remove it in any form I don't know what caused it to disappear I still don't know actually but again another serious indication that something was badly wrong with the disk anyway so I'm booting now from the live system rescue distro it's actually called system rescue and after I done some research before I even push it up that way and DDing the desk to image it I didn't think was good enough so I went for a DD rescue I did some research with that and I decided to do the pretty much run it with default parameters and in other words it wouldn't try and read problematic blocks too many times what it would do is just sort of try once and then sort of move on and I could go back and try again if I found it if I if I wanted to later but I felt at that point I just needed to get as much data off that disk as possible now when did you rescue finished it took I can't remember exactly long took maybe an hour or two it wasn't that slow and this number of passes I should say maybe it was longer than that I wasn't really I can't remember exactly the time in but when it finished it reported that 99.99% of my data was safe now I think that's actually the most that it can report it did give me an exact number to but the even less than 0.01 percent errors of all the data in the disk is still many megabytes now many megabytes a small beer compared to five hundred or so gigabytes which is what the disk could hold and it was pretty full it was like eighty ninety percent full so am I worried about a few megabytes well chances are I'm okay but what happens if one of those megabytes was in inside some critical file in the system which case it might not boot you know kernel obviously would not but would not be good if a if a small section of it was effectively zeroed and also there might be personal files you know a little photograph or I don't know video or you know some important PDF document receipts of something to do with say the house purchase or whatever you know I had to I felt like I need to know where those errors were and now the way to do this is did you rescue tells you exactly where the problems were in the disk in general it's what's called a map file which is excellent thing it's plain text readable it looks like gobble-de-gook when you first read it but it doesn't take long feet with the manual in hand to decode what it's telling you and it's really telling you in bytes where you're sorry I think it does blocks sorry not bytes it tells you blocks in blocks where errors are on your disk and and when I went through I could see that there were quick big ranges of blocks that were identified as being bad where it couldn't read the rescue from but there were scattered little trucks all through the file so there was like maybe you know I think that we're all determined by correctly they were they were all they're all quite small so maybe a few blocks together were bad so it wasn't like a huge range that were it was was wiped out but there was little small little small elements dotted around across the disk that were bad and it can be read so with that in mind I could you know I could more or less tell I could I could work out to the byte where the where problems would lie and where the files were there in my image the data for those files were replaced by zero but which files well the solution to this after a bit of research turns out that you can take the image of my disk and my image is just called sda.mg and you can create a loop device using the command LO setup and you give it the minus O option and then you specify the offset in bytes where your partition starts inside that image file and you can tell it that you wanted to appear in one of the loop devices and live distro I was using used slash Dave slash zero so I went for a slash Dave slash loop one and with that I was able to mount the you're choosing the mount command on that loop device I was able to mount that partition from the image file and then using something called debug FS so you just start up you can run it an interactive mode I could enter I think it was the I check command if I give it I think you've byte bytes I think it was I gave it that that would then tell me which I know the number was was present at that point in the image and it might not actually be might not be an I know number because it might not be used it might be an empty bit of the disk or use for something else so I if it did give me an I know number then I could from the I know number pretty quickly look up what the file name was using I think it was called n check name check in debug FS and I was able to do this manually do some manual calculations of what bytes was one thing so after I got the hang of this I began to you know I filmed a file I was able to look at it it was a clearly a text file but the operating system and the sort as being binary file unless we didn't display anything where we'd cat so I could see I definitely it was correct I was I was finding a problematic files in the file system and then I thought well this is going to take a while because there's quite a lot you know although it's only a few megabytes that are dotted above all over the disk so I just wrote a Python script that would generate lists of commands that could be laid into debug FS and debug FS within spio information which was sucked in by another Python script which took I know numbers and then spat out all the file names now to my utter astonishment it took several hours to execute this process in fact this the second step no the first step I left it running overnight so I don't really at the time I don't really know how long it took but it took many hours and this may seem surprising it certainly surprised me but of course it's all optimized all optimized to work in the other direction you don't usually you know when you type ls in the command line you're never you type ls file name or ls path you don't type ls byte at a byte or you know obviously everything is constructed to be optimized in other direction never the less it seems to be ludicrously slow if there's a faster way of doing it please let me know in the comments or better still do a show for hpr folks and how to quickly look up file names from byte positions in a partition anyway I got what I wanted a list of files that were problematic and unsurprisingly the bigger the file was the more likely it was to suffer a problem so my VDI files for both my windows virtual machines were affected also some ISO files were rendered useless because they had a bunch of zeros in the middle of them now and I was able to to go around and you know remove the problematic files or mark them for salvage now the only is it turns out I don't think I have any serious data loss the only things I really cared about the ISOs were disposable of course I can download them again if I need them but the VDIs of course contained the hard disk of my virtual machines so I had before I'd realized what I'm trying to use windows to repair its own file system of course that didn't work while the VDI files were living in a hostOS with a drive that was suffering from read errors but now I had a I um now if I could put the VDI files on a good disk I could potentially recover them and indeed I'll I put the windows 7 virtual machine on another PC of mine running in windows actually and I started it up and it's actually um booted fine it didn't actually uh I don't think windows tried actually did it appear I'm a bit puzzled as to what happened actually because it just seemed to start up fine and I did a um um a check disk uh actually I think I did it by right clicking on the C drive and going to properties tools to check didn't that way and it came back with you know a few minor problems and it corrected it didn't seem to think there was a much of a big deal and it was working again so that was good so the next task was to see if I could get my um Linux partition back onto a good disk and renew it as easily as that one windows virtual machine now uh there's another story here in that I checked online and discovered that my laptop either could take uh well what the website said that offered hard drives that could either take an MVME disk or a SATA 3 disk and I suspected looking at the specs although I didn't know for sure that I had a SATA 3 disk actually I should have been able to check in two ways one with running a laptop but I wasn't going to start that up again you know and until unless I wanted to scrape some more data off it so I couldn't do it that way and I didn't remember whether it was MVME or SATA 3 now some of you might be laughing at that but it just never occurred to me to take take take take a note of it before um the second way much easier way with a laptop uh which that you don't want to start up would be to unscrew the bottom of the case and look at the hard drive that I want to change before I ordered it but I didn't do that because it required I could to a video I saw online at Torx screwdriver at T4 which is really rather tiny and I didn't have one I only had a T6 um now that I so I had to but I had to order that and I had to order the disk and so I plugged for an MVME disk because that's faster and I suspected that there would be two connectors for two hard drives on the board the screwdriver arrived at about the same time as the hard drive than it was next day actually and I opened up the laptop and discovered that there was only one socket to put in a SSD and it was the SATA 3 kind which is what my current disk was so it was a bit dumb off me I think I could have done that better but actually didn't bother me because I um I just went out and bought another SATA drive as the time was SATA 3 and for the MVME uh two-tier about drives I wasn't great use or I couldn't use in the laptop I bought a little caddy because it's quite a handy thing to have a super fast slim hard drive because of course these SSDs are like you know it's uh let's look you know it's like a tiny pencil case compared to the old caddies that uh uh which were about the size of floppy drives um well a few remember how big floppy drives external floppy drives were um so I eventually I got this new two-tier about SATA 3 SSD it was actually I think a WD uh western digital red disk which is really made for NAS drives um but they didn't have the blue kind and I've read people say that red kind was fine and some ways might even be better again if you know differently please let me know but that's what's in there now dead easy to fit much easier than you know fafing it on with two and a half inch uh units of old especially than the cramped confines of laptops it was absolutely once I'd had the correct screw driver it was dead easy to do instantly the turns out that although the torque T4 whoever did open up the laptop actually found that T5 was a better fit um and that after very short time using the T4 it seems like ground off the corners of the of the of the sockets that you put it into um which is not great so I think T5 if you've got an Aces Zen book it might be a T5 like me that you need anyway I'd like to raise so I um after I've installed a new hard drive I was just pretty straightforward I didn't screw on the bottom panel which has got one two three four eight it's got ten screws great fiddley so I didn't screw them all back on I thought I'd just start up the laptop and see if I could see the new hard drive even though it had nothing on it and yeah I could do it was fine that was great so turn it off again turn laptop over and put in all the torque screws then I turned on the laptop the right way up preparing to uh format and partition my new two terabyte drive and nothing happened I pressed power button nothing I don't know I thought well maybe the battery's run out so I plugged in the power cable uh the power LED didn't come on on the laptop and I verified it was power definitely coming to it from the cable that power LED shouldn't decayed that it was red charging or white fully charged it was neither nothing happened there was no sign of life in the laptop whatsoever and I couldn't believe it so I went and I unscrewed all ten of these torque screws and this is when I started to discover that I was um ruining the heads of them with the T4 screwdriver um and I sat and I looked to see what I might have disturbed inside the laptop I couldn't see anything I was particularly looking along the hard drive that I wanted is there a contact and I discovered that when I put the the bottom plate on it hadn't quite clipped into place and the screws hadn't quite caught as well as they could have done so I tightened them all up made sure it clicked nothing and then I went to the Aces website and I said in this page support page you said if you're having trouble starting out you raise this blah blah book um press and hold the power button for 40 seconds okay and that's a bit weird but I'll try that and blown cold after 40 seconds nothing seemed to change but the next time I pressed the power button the whole thing burst into life LEDs came on and it started to boot and I booted it up into the live distro again and that was it um so the next thing to do was I just thought I thought I've just put the image of the old artist straight onto this new one of course it wasn't it was only about 500 gigabytes and it's new to it it's new to 10 about drive it's 2000 gigabytes or the intervals so but I could fix it for the learn with the partition as later I thought I just wanted to see if it would work and that you know my laptop would now be functional again and certainly I could um I wrote the image I'm using DD I mean literally it was as simple as DD space if equals sta to img space of equals slash dev slash fta and then I did sped five blocks i space bs equals 32 capital m and that was that's the command that I recreated took about an hour hour and a half and five thousand seconds I can tell you remember it was almost exactly five thousand seconds and it recreated it fine now I should say that there was some debate about what block size you should use but I think it doesn't really matter how big it is as long as it's over a a few megabytes that seemed to be optimum and it seemed to depend on your hardware what the optimum value was but yeah I wouldn't use I wouldn't I would specify it because I think the default is five twelve bytes the size of our hardware block and that's probably kind of small things done a lot so I would recommend you up in the block size does no harm if it's too big anyway so after done that the first thing I did was I pushed up laptop and windows now I couldn't test slack we're at this point because as before the ufe firmware didn't see a slack we're a boot option anymore and it seemed to remember Ubuntu had once been sold there and that was still lurking about and I'd not taken it off but the windows partition was still there and should have been recreated intact and so I started it up and windows it was just fine no problem at all just like it had always been so I then was able to go into the live distro and well that's what an interesting wrinkle here and that I knew the what I needed to do was use a command called I think it's called EFI boot MGR EFI boot manager and I need to tell it get a label slackware and also key thing is where the EFI EFI file is on EFI boot partition that's needed to boot just like we're in this case I was using e-lilow as it happened and so I was actually you know the boot from a from a system rest you live distro I could build that boot mount that boot partition EFI boot partition and I could see where that file was and it was intact and fine no trouble so I was able to do that well I tried and it wouldn't work and it just said there are no I can't know what exactly what it said but I think it said something like there are no EFI boot variables or something like that one then I realized that there was two options when booting into this live USB one of them was to just boot the disc you know just said boot I was called a sand disc the brand of the USB stick and that's why I chosen and it turns out that boots are in as of MBR compatibility mode so there's no it doesn't think there's any UEFA going on if I booted it by the thing that said sand disc partition one then that was the UEFA way of doing things and then I could use EFI boot manager and then I did that and I was able to boot up like we're as it was run on FFSEK was the first thing I did um actually no sorry I didn't I got that wrong I didn't that wasn't the first thing I did I ran the FFSEK of course from the Linux Live to Stroke because you can't run animal partition and it came back with no errors so I did that first and then I booted it into Slackware and it worked fine and I was back at my desktop exactly almost as if nothing had happened to my laptop it was really quite strange but of course something had happened to it and I still had several megabytes of zero scatterable and files on over the disk um so I knew what they were so I was good I would deal with that later um I'm repartitioned the disk that's not terribly interesting story I just used um G parted on the system rescue live distro to do that and then copied data around so that I have a new having two terabytes I getting large windows partition on that disk and give most space over to Linux and um I have a nice nice big like home partition now over a telebyte inside in fact so to end the story the ending is happy and in fact I'm recording this on the very late laptops I'm talking about using audacity and um and favorite thing is working just fine so happy ending thanks for listening bye bye you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an HBR listener like yourself if you ever thought of recording a podcast then click on our contribute link to find out how easy it really is hecka public radio was founded by the digital dot pound and the infonominant computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments attribution share a like 3.0 license