Files
hpr-knowledge-base/hpr_transcripts/hpr3428.txt

290 lines
26 KiB
Plaintext
Raw Normal View History

Episode: 3428
Title: HPR3428: Bad disk rescue
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3428/hpr3428.mp3
Transcribed: 2025-10-24 23:13:28
---
This is Hacker Public Radio Episode 3428 for Wednesday, the 22nd of September 2021.
Today's show is entitled, Bad Disgress Q. It is hosted by Andrew Conway and is about 30 minutes long
and carries a clean flag. The summary is, Bad Disgress Q, Tragedy or Happy Ending.
This episode of HBR is brought to you by an honesthost.com. Get 15% discount on all shared hosting
with the offer code HBR15. That's HBR15. Better web hosting that's honest and fair at An honesthost.com.
Hello HPR folks, it's McNallow here, also known as Andrew. I was to share with you a story. Well it's a
sad story with a, well I won't tell you how it ends, but maybe you can guess. But what happened
was a few years ago, as my main workhorse laptop, which is literally the one I would use for my work,
I bought an ASUS ZenBook UX 550VD. I think I got that right. It's a pretty modern laptop and it
wouldn't run my favourite distribution time Slackware 14.2, but I discovered it would run a Ubuntu
that was out at the time and also Slackware current. So why won't it? In the end I went with Slackware
current. Also came with Windows on it. Now it's a nice laptop, it's got a half terabyte of SSD,
8GB RAM, some super fast processor and it's actually quite a decent on-board graphics chip. I
think it's a 1060 Ti mobile. So it's actually quite a beefy laptop. However I've never been quite
satisfied with it. I've always felt there's something a little bit wrong with it and in particular
I found that when it was shutting down it used to get stuck and I'd have to do a hard power
off alone 5 second press in the power button. Which I don't like doing because I've never
that sure even though the hard disk indicator light is mostly out, I'm never that sure that's
a good idea for the hard drive. And so when I did see some hard drive errors I wasn't too worried
I thought well that's probably because it was doing something when I shut it down but it was on
its way down anyway so you know I never lost any work but I just noticed that yeah it was the
hard disk error and easily fixed too with the old fs ck but honestly you know it was every few
months this would trouble me. And now about just over a week ago I was working away doing something
fairly routine and it was it was work. So I was using one of my virtual machines and it's an old
Windows 7 virtual machine that I use as a sort of test bed. I mean Windows 7 is our end of life now
but frankly people that use software I support don't take notes to that so yeah I still keep that
around. Now this Windows 7 has been fairly stable I mean pretty pretty good actually
but in this occasion I was doing something fairly routine it just froze I think I was doing something
with the browser inside it and it just froze I don't know that's in the way well it's in the life
now I also as a backup have a Windows 10 machine which I can work on so I thought well yeah I
would just move over to the Windows 10 thing to do my Windows tasks and it ran to trouble
fact it would even boot up the virtual machine so that's odd. And then not long after that
something I was doing in the host operating system which I say is slackware current which is
pretty stable to be honest first of rolling some test bed release. I mean it's actually pretty
close to being released to slackware 15 at the moment actually but even when it's going through
the flux of change crashes are rare and I hadn't updated it recently so I didn't think that was
the cause so I took a closer look and indeed I did find some FSCK did find some maybe more errors
than usual so I thought okay well I've got you know I've got the question Mark over the health
of this hard drive so I ran smart control smart CTL and it did find well it's been an odd
utility I didn't realise this I haven't used it that much before but the first thing to say about
is if you don't never used it before is that it returns immediately you type smart control
on all the options you want tell it to do a test I told it to do a short test first and
actually what it does is it returns immediately but it's actually doing the test in the background
and doesn't notify you when it's done so that's one tip with smart CTL it's good it's a good
little utility but the you know the documentation behind it doesn't tell you in my opinion
that important fact about it anyway once I got used to it's rather idiosyncratic way of behaving
I found that it said that my hard disk was healthy and all indicators which and it was crucial
was the manufacturer all the indicators that crucial had provided were within threshold that is
they were fine now actually I noticed that the values it was coming back were like 100 or nearly
100 and the threshold was zero and I looked at what that meant and it's when the value goes
well with the threshold you've got a problem well threshold is zero and it can't go negative so
I basically thought that most of the diagnostics that crucial have provided maybe
meet the spec in theory but are actually useless in practice so I didn't find smart CTL told me
anything that useful but I did notice it said that it had a few reader errors on the blocks now
again this isn't that unusual you can go back to spinning hard drives I believe you can get a bad
sector which I think is a 512 byte area of the hard drive and if it encounters a bad sector
it doesn't even tell the operating system I think what it does is it just marks that sector as
problematic and and uses a spare sector which ordinarily won't use from the get go to take its place
now obviously you've only got so many spare sectors and I now notice that I had several hundred
of them now several like 433 blocks of 512 bytes you know you know it's 200k it's not an awful
lot of the hard drive so I'm not overly worried at this point and reading online some people would
say oh this is a disaster unplug your hard drive and image the disk immediately but I think
that was mainly a reference to older spinning disks and other people say well you know this
kind of stuff happens especially if you've had to do some hard powered ones I had power cuts
work on a desktop machine or whatever so I honestly did not find that in itself
like terribly dramatically bad but it's still concerned because I didn't understand where I stood
and then I am still thinking well what two virtual machines that have gone and one other unexplained
instant with the disk so with what else can I what are tests can I do so I found this utility called
bad blocks now I actually don't know the details of how bad blocks works but I knew that it would
potentially find more errors than the other methods because it would go around looking for them
whereas smart control I don't know quite how it works but I didn't quite trust it was doing as
thorough a test as I'd like even with a long test anyway bad blocks found loads of problems in
fact when it got past the 433 bad blocks which is what the smart CTL had told me then I was well
I basically at that point decided to shut down my laptop immediately add image of the hard drive
and declare the drive inside it as on its way to death now I should see the language your
on disks continues to perplex me blocks has two meanings it has one at the hardware level which
means 512 bytes on the device itself and it also can refer to the block size of the fly-all system
which in the case I was using in my xx4 Linux system I think the default I just was using was 4096 of 4K
blocks so the fact that's the first thing that confused me in all of this and the second thing that
confused me is it was talking about clusters but I don't think clusters come into it anymore or
perhaps they're just a windows thing I don't know but there's so much jargon and it's not clear
especially with blocks what you're talking about anyway it's suffice to say I knew something was
wrong although at that point I couldn't quantify it so what did I do well I went and got a
distro called system rescue after doing a bit of research and for that in the usb stick
and booted my laptop from it and oh and the other thing I noticed at this point is after I run
the bad blocks the first time I rebooted my hard drive the Linux partition was no longer
reported as a bootable option I don't know why it was in the uffirmware that declared this I
didn't remove it in any form I don't know what caused it to disappear I still don't know actually
but again another serious indication that something was badly wrong with the disk anyway so I'm
booting now from the live system rescue distro it's actually called system rescue and after I
done some research before I even push it up that way and DDing the desk to image it I didn't think
was good enough so I went for a DD rescue I did some research with that and I decided to do the
pretty much run it with default parameters and in other words it wouldn't try and read
problematic blocks too many times what it would do is just sort of try once and then sort of move on
and I could go back and try again if I found it if I if I wanted to later but I felt at that point
I just needed to get as much data off that disk as possible now when did you rescue finished it took
I can't remember exactly long took maybe an hour or two it wasn't that slow and this number of
passes I should say maybe it was longer than that I wasn't really I can't remember exactly
the time in but when it finished it reported that 99.99% of my data was safe now I think that's
actually the most that it can report it did give me an exact number to but the even less than 0.01
percent errors of all the data in the disk is still many megabytes now many megabytes a small
beer compared to five hundred or so gigabytes which is what the disk could hold and it was pretty
full it was like eighty ninety percent full so am I worried about a few megabytes well chances are
I'm okay but what happens if one of those megabytes was in inside some critical file in the system
which case it might not boot you know kernel obviously would not but would not be good if a
if a small section of it was effectively zeroed and also there might be personal files you know
a little photograph or I don't know video or you know some important PDF document receipts of
something to do with say the house purchase or whatever you know I had to I felt like I need to
know where those errors were and now the way to do this is did you rescue tells you exactly where
the problems were in the disk in general it's what's called a map file which is excellent thing it's
plain text readable it looks like gobble-de-gook when you first read it but it doesn't take long
feet with the manual in hand to decode what it's telling you and it's really telling you in bytes
where you're sorry I think it does blocks sorry not bytes it tells you blocks in blocks where errors are
on your disk and and when I went through I could see that there were quick big ranges of blocks
that were identified as being bad where it couldn't read the rescue from but there were scattered
little trucks all through the file so there was like maybe you know I think that we're all
determined by correctly they were they were all they're all quite small so maybe a few blocks
together were bad so it wasn't like a huge range that were it was was wiped out but there was
little small little small elements dotted around across the disk that were bad and it can be read
so with that in mind I could you know I could more or less tell I could I could work out to the
byte where the where problems would lie and where the files were there in my image the data for
those files were replaced by zero but which files well the solution to this after a bit of research
turns out that you can take the image of my disk and my image is just called sda.mg and you
can create a loop device using the command LO setup and you give it the minus O option and then
you specify the offset in bytes where your partition starts inside that image file and you can
tell it that you wanted to appear in one of the loop devices and live distro I was using used
slash Dave slash zero so I went for a slash Dave slash loop one and with that I was able to mount
the you're choosing the mount command on that loop device I was able to mount that partition from
the image file and then using something called debug FS so you just start up you can run it
an interactive mode I could enter I think it was the I check command if I give it I think you've
byte bytes I think it was I gave it that that would then tell me which I know the number
was was present at that point in the image and it might not actually be might not be an I know
number because it might not be used it might be an empty bit of the disk or use for something else
so I if it did give me an I know number then I could from the I know number pretty quickly look
up what the file name was using I think it was called n check name check in debug FS and I was
able to do this manually do some manual calculations of what bytes was one thing so after I got the
hang of this I began to you know I filmed a file I was able to look at it it was a clearly a text
file but the operating system and the sort as being binary file unless we didn't display anything
where we'd cat so I could see I definitely it was correct I was I was finding a problematic files
in the file system and then I thought well this is going to take a while because there's quite a lot
you know although it's only a few megabytes that are dotted above all over the disk so I just
wrote a Python script that would generate lists of commands that could be laid into debug FS
and debug FS within spio information which was sucked in by another Python script which took
I know numbers and then spat out all the file names now to my utter astonishment it took several
hours to execute this process in fact this the second step no the first step I left it running
overnight so I don't really at the time I don't really know how long it took but it took many hours
and this may seem surprising it certainly surprised me but of course it's all optimized all
optimized to work in the other direction you don't usually you know when you type ls in the
command line you're never you type ls file name or ls path you don't type ls byte at a byte or you
know obviously everything is constructed to be optimized in other direction never the less it
seems to be ludicrously slow if there's a faster way of doing it please let me know in the comments
or better still do a show for hpr folks and how to quickly look up file names from
byte positions in a partition anyway I got what I wanted a list of files that were problematic
and unsurprisingly the bigger the file was the more likely it was to suffer a problem so my VDI files
for both my windows virtual machines were affected also some ISO files were rendered useless
because they had a bunch of zeros in the middle of them now and I was able to to go around and
you know remove the problematic files or mark them for salvage now the only is it turns out
I don't think I have any serious data loss the only things I really cared about the ISOs were
disposable of course I can download them again if I need them but the VDIs of course contained
the hard disk of my virtual machines so I had before I'd realized what I'm trying to use windows
to repair its own file system of course that didn't work while the VDI files were living in a
hostOS with a drive that was suffering from read errors but now I had a I um now if I could put
the VDI files on a good disk I could potentially recover them and indeed I'll I put the windows 7
virtual machine on another PC of mine running in windows actually and I started it up and it's
actually um booted fine it didn't actually uh I don't think windows tried actually did it appear
I'm a bit puzzled as to what happened actually because it just seemed to start up fine and I did a
um um a check disk uh actually I think I did it by right clicking on the C drive and going to
properties tools to check didn't that way and it came back with you know a few minor problems
and it corrected it didn't seem to think there was a much of a big deal and it was working again so
that was good so the next task was to see if I could get my um Linux partition back onto a good
disk and renew it as easily as that one windows virtual machine now uh there's another story here
in that I checked online and discovered that my laptop either could take uh well what the website
said that offered hard drives that could either take an MVME disk or a SATA 3 disk and I suspected
looking at the specs although I didn't know for sure that I had a SATA 3 disk actually I should
have been able to check in two ways one with running a laptop but I wasn't going to start that
up again you know and until unless I wanted to scrape some more data off it so I couldn't do it
that way and I didn't remember whether it was MVME or SATA 3 now some of you might be laughing at
that but it just never occurred to me to take take take take a note of it before um the second
way much easier way with a laptop uh which that you don't want to start up would be to unscrew the
bottom of the case and look at the hard drive that I want to change before I ordered it but I didn't
do that because it required I could to a video I saw online at Torx screwdriver at T4 which is
really rather tiny and I didn't have one I only had a T6 um now that I so I had to but I had to order
that and I had to order the disk and so I plugged for an MVME disk because that's faster and I
suspected that there would be two connectors for two hard drives on the board the screwdriver arrived
at about the same time as the hard drive than it was next day actually and I opened up the laptop
and discovered that there was only one socket to put in a SSD and it was the SATA 3 kind which is
what my current disk was so it was a bit dumb off me I think I could have done that better but
actually didn't bother me because I um I just went out and bought another SATA drive as the time
was SATA 3 and for the MVME uh two-tier about drives I wasn't great use or I couldn't use
in the laptop I bought a little caddy because it's quite a handy thing to have a super fast
slim hard drive because of course these SSDs are like you know it's uh let's look you know it's
like a tiny pencil case compared to the old caddies that uh uh which were about the size of floppy
drives um well a few remember how big floppy drives external floppy drives were
um so I eventually I got this new two-tier about SATA 3 SSD it was actually I think a WD
uh western digital red disk which is really made for NAS drives um but they didn't have the
blue kind and I've read people say that red kind was fine and some ways might even be better
again if you know differently please let me know but that's what's in there now
dead easy to fit much easier than you know fafing it on with two and a half inch uh units
of old especially than the cramped confines of laptops it was absolutely once I'd had the correct
screw driver it was dead easy to do instantly the turns out that although the torque T4
whoever did open up the laptop actually found that T5 was a better fit um and that after very
short time using the T4 it seems like ground off the corners of the of the of the sockets that you
put it into um which is not great so I think T5 if you've got an Aces Zen book it might be a T5
like me that you need anyway I'd like to raise so I um after I've installed a new hard drive
I was just pretty straightforward I didn't screw on the bottom panel which has got
one two three four eight it's got ten screws great fiddley so I didn't screw them all back on
I thought I'd just start up the laptop and see if I could see the new hard drive even though
it had nothing on it and yeah I could do it was fine that was great so turn it off again turn
laptop over and put in all the torque screws then I turned on the laptop the right way up
preparing to uh format and partition my new two terabyte drive and nothing happened
I pressed power button nothing I don't know I thought well maybe the battery's run out so I plugged
in the power cable uh the power LED didn't come on on the laptop and I verified it was power
definitely coming to it from the cable that power LED shouldn't decayed that it was red charging
or white fully charged it was neither nothing happened there was no sign of life in the laptop
whatsoever and I couldn't believe it so I went and I unscrewed all ten of these torque screws
and this is when I started to discover that I was um ruining the heads of them with the T4
screwdriver um and I sat and I looked to see what I might have disturbed inside the laptop I
couldn't see anything I was particularly looking along the hard drive that I wanted is there
a contact and I discovered that when I put the the bottom plate on it hadn't quite clipped into
place and the screws hadn't quite caught as well as they could have done so I tightened them all up
made sure it clicked nothing and then I went to the Aces website and I said in this page
support page you said if you're having trouble starting out you raise this blah blah book um press
and hold the power button for 40 seconds okay and that's a bit weird but I'll try that
and blown cold after 40 seconds nothing seemed to change but the next time I pressed the power button
the whole thing burst into life LEDs came on and it started to boot and I booted it up into the
live distro again and that was it um so the next thing to do was I just thought I thought I've just
put the image of the old artist straight onto this new one of course it wasn't it was only about
500 gigabytes and it's new to it it's new to 10 about drive it's 2000 gigabytes or the intervals
so but I could fix it for the learn with the partition as later I thought I just wanted to see
if it would work and that you know my laptop would now be functional again and certainly I could
um I wrote the image I'm using DD I mean literally it was as simple as
DD space if equals sta to img space of equals slash dev slash fta and then I did sped five blocks
i space bs equals 32 capital m and that was that's the command that I recreated took about an hour
hour and a half and five thousand seconds I can tell you remember it was almost exactly five
thousand seconds and it recreated it fine now I should say that there was some debate about what
block size you should use but I think it doesn't really matter how big it is as long as it's over
a a few megabytes that seemed to be optimum and it seemed to depend on your hardware what the
optimum value was but yeah I wouldn't use I wouldn't I would specify it because I think the default
is five twelve bytes the size of our hardware block and that's probably kind of small things
done a lot so I would recommend you up in the block size does no harm if it's too big anyway
so after done that the first thing I did was I pushed up laptop and windows now I couldn't
test slack we're at this point because as before the ufe firmware didn't see a slack we're
a boot option anymore and it seemed to remember Ubuntu had once been sold there and that was still
lurking about and I'd not taken it off but the windows partition was still there and should have
been recreated intact and so I started it up and windows it was just fine no problem at all just
like it had always been so I then was able to go into the live distro and well that's what an
interesting wrinkle here and that I knew the what I needed to do was use a command called
I think it's called EFI boot MGR EFI boot manager and I need to tell it get a label
slackware and also key thing is where the EFI EFI file is on EFI boot partition that's needed
to boot just like we're in this case I was using e-lilow as it happened and so I was actually
you know the boot from a from a system rest you live distro I could build that boot mount that
boot partition EFI boot partition and I could see where that file was and it was intact and fine
no trouble so I was able to do that well I tried and it wouldn't work and it just said there are no
I can't know what exactly what it said but I think it said something like there are no
EFI boot variables or something like that one then I realized that there was two options when
booting into this live USB one of them was to just boot the disc you know just said boot I was
called a sand disc the brand of the USB stick and that's why I chosen and it turns out that
boots are in as of MBR compatibility mode so there's no it doesn't think there's any UEFA going on
if I booted it by the thing that said sand disc partition one then that was the UEFA way of doing
things and then I could use EFI boot manager and then I did that and I was able to boot up
like we're as it was run on FFSEK was the first thing I did um actually no sorry I didn't I got
that wrong I didn't that wasn't the first thing I did I ran the FFSEK of course from the Linux
Live to Stroke because you can't run animal partition and it came back with no errors so I did that
first and then I booted it into Slackware and it worked fine and I was back at my desktop exactly
almost as if nothing had happened to my laptop it was really quite strange but of course
something had happened to it and I still had several megabytes of zero scatterable and files
on over the disk um so I knew what they were so I was good I would deal with that later um I'm
repartitioned the disk that's not terribly interesting story I just used um G parted on the system
rescue live distro to do that and then copied data around so that I have a new having two terabytes
I getting large windows partition on that disk and give most space over to Linux and um I have
a nice nice big like home partition now over a telebyte inside in fact so to end the story
the ending is happy and in fact I'm recording this on the very late laptops I'm talking about using
audacity and um and favorite thing is working just fine so happy ending thanks for listening bye bye
you've been listening to hecka public radio at hecka public radio dot org
we are a community podcast network that releases shows every weekday Monday through Friday
today's show like all our shows was contributed by an HBR listener like yourself
if you ever thought of recording a podcast then click on our contribute link to find out
how easy it really is hecka public radio was founded by the digital dot pound and the
infonominant computer club and it's part of the binary revolution at binrev.com if you have
comments on today's show please email the host directly leave a comment on the website or record
a follow-up episode yourself unless otherwise status today's show is released on the creative
comments attribution share a like 3.0 license