- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
290 lines
26 KiB
Plaintext
290 lines
26 KiB
Plaintext
Episode: 3428
|
|
Title: HPR3428: Bad disk rescue
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3428/hpr3428.mp3
|
|
Transcribed: 2025-10-24 23:13:28
|
|
|
|
---
|
|
|
|
This is Hacker Public Radio Episode 3428 for Wednesday, the 22nd of September 2021.
|
|
Today's show is entitled, Bad Disgress Q. It is hosted by Andrew Conway and is about 30 minutes long
|
|
and carries a clean flag. The summary is, Bad Disgress Q, Tragedy or Happy Ending.
|
|
This episode of HBR is brought to you by an honesthost.com. Get 15% discount on all shared hosting
|
|
with the offer code HBR15. That's HBR15. Better web hosting that's honest and fair at An honesthost.com.
|
|
Hello HPR folks, it's McNallow here, also known as Andrew. I was to share with you a story. Well it's a
|
|
sad story with a, well I won't tell you how it ends, but maybe you can guess. But what happened
|
|
was a few years ago, as my main workhorse laptop, which is literally the one I would use for my work,
|
|
I bought an ASUS ZenBook UX 550VD. I think I got that right. It's a pretty modern laptop and it
|
|
wouldn't run my favourite distribution time Slackware 14.2, but I discovered it would run a Ubuntu
|
|
that was out at the time and also Slackware current. So why won't it? In the end I went with Slackware
|
|
current. Also came with Windows on it. Now it's a nice laptop, it's got a half terabyte of SSD,
|
|
8GB RAM, some super fast processor and it's actually quite a decent on-board graphics chip. I
|
|
think it's a 1060 Ti mobile. So it's actually quite a beefy laptop. However I've never been quite
|
|
satisfied with it. I've always felt there's something a little bit wrong with it and in particular
|
|
I found that when it was shutting down it used to get stuck and I'd have to do a hard power
|
|
off alone 5 second press in the power button. Which I don't like doing because I've never
|
|
that sure even though the hard disk indicator light is mostly out, I'm never that sure that's
|
|
a good idea for the hard drive. And so when I did see some hard drive errors I wasn't too worried
|
|
I thought well that's probably because it was doing something when I shut it down but it was on
|
|
its way down anyway so you know I never lost any work but I just noticed that yeah it was the
|
|
hard disk error and easily fixed too with the old fs ck but honestly you know it was every few
|
|
months this would trouble me. And now about just over a week ago I was working away doing something
|
|
fairly routine and it was it was work. So I was using one of my virtual machines and it's an old
|
|
Windows 7 virtual machine that I use as a sort of test bed. I mean Windows 7 is our end of life now
|
|
but frankly people that use software I support don't take notes to that so yeah I still keep that
|
|
around. Now this Windows 7 has been fairly stable I mean pretty pretty good actually
|
|
but in this occasion I was doing something fairly routine it just froze I think I was doing something
|
|
with the browser inside it and it just froze I don't know that's in the way well it's in the life
|
|
now I also as a backup have a Windows 10 machine which I can work on so I thought well yeah I
|
|
would just move over to the Windows 10 thing to do my Windows tasks and it ran to trouble
|
|
fact it would even boot up the virtual machine so that's odd. And then not long after that
|
|
something I was doing in the host operating system which I say is slackware current which is
|
|
pretty stable to be honest first of rolling some test bed release. I mean it's actually pretty
|
|
close to being released to slackware 15 at the moment actually but even when it's going through
|
|
the flux of change crashes are rare and I hadn't updated it recently so I didn't think that was
|
|
the cause so I took a closer look and indeed I did find some FSCK did find some maybe more errors
|
|
than usual so I thought okay well I've got you know I've got the question Mark over the health
|
|
of this hard drive so I ran smart control smart CTL and it did find well it's been an odd
|
|
utility I didn't realise this I haven't used it that much before but the first thing to say about
|
|
is if you don't never used it before is that it returns immediately you type smart control
|
|
on all the options you want tell it to do a test I told it to do a short test first and
|
|
actually what it does is it returns immediately but it's actually doing the test in the background
|
|
and doesn't notify you when it's done so that's one tip with smart CTL it's good it's a good
|
|
little utility but the you know the documentation behind it doesn't tell you in my opinion
|
|
that important fact about it anyway once I got used to it's rather idiosyncratic way of behaving
|
|
I found that it said that my hard disk was healthy and all indicators which and it was crucial
|
|
was the manufacturer all the indicators that crucial had provided were within threshold that is
|
|
they were fine now actually I noticed that the values it was coming back were like 100 or nearly
|
|
100 and the threshold was zero and I looked at what that meant and it's when the value goes
|
|
well with the threshold you've got a problem well threshold is zero and it can't go negative so
|
|
I basically thought that most of the diagnostics that crucial have provided maybe
|
|
meet the spec in theory but are actually useless in practice so I didn't find smart CTL told me
|
|
anything that useful but I did notice it said that it had a few reader errors on the blocks now
|
|
again this isn't that unusual you can go back to spinning hard drives I believe you can get a bad
|
|
sector which I think is a 512 byte area of the hard drive and if it encounters a bad sector
|
|
it doesn't even tell the operating system I think what it does is it just marks that sector as
|
|
problematic and and uses a spare sector which ordinarily won't use from the get go to take its place
|
|
now obviously you've only got so many spare sectors and I now notice that I had several hundred
|
|
of them now several like 433 blocks of 512 bytes you know you know it's 200k it's not an awful
|
|
lot of the hard drive so I'm not overly worried at this point and reading online some people would
|
|
say oh this is a disaster unplug your hard drive and image the disk immediately but I think
|
|
that was mainly a reference to older spinning disks and other people say well you know this
|
|
kind of stuff happens especially if you've had to do some hard powered ones I had power cuts
|
|
work on a desktop machine or whatever so I honestly did not find that in itself
|
|
like terribly dramatically bad but it's still concerned because I didn't understand where I stood
|
|
and then I am still thinking well what two virtual machines that have gone and one other unexplained
|
|
instant with the disk so with what else can I what are tests can I do so I found this utility called
|
|
bad blocks now I actually don't know the details of how bad blocks works but I knew that it would
|
|
potentially find more errors than the other methods because it would go around looking for them
|
|
whereas smart control I don't know quite how it works but I didn't quite trust it was doing as
|
|
thorough a test as I'd like even with a long test anyway bad blocks found loads of problems in
|
|
fact when it got past the 433 bad blocks which is what the smart CTL had told me then I was well
|
|
I basically at that point decided to shut down my laptop immediately add image of the hard drive
|
|
and declare the drive inside it as on its way to death now I should see the language your
|
|
on disks continues to perplex me blocks has two meanings it has one at the hardware level which
|
|
means 512 bytes on the device itself and it also can refer to the block size of the fly-all system
|
|
which in the case I was using in my xx4 Linux system I think the default I just was using was 4096 of 4K
|
|
blocks so the fact that's the first thing that confused me in all of this and the second thing that
|
|
confused me is it was talking about clusters but I don't think clusters come into it anymore or
|
|
perhaps they're just a windows thing I don't know but there's so much jargon and it's not clear
|
|
especially with blocks what you're talking about anyway it's suffice to say I knew something was
|
|
wrong although at that point I couldn't quantify it so what did I do well I went and got a
|
|
distro called system rescue after doing a bit of research and for that in the usb stick
|
|
and booted my laptop from it and oh and the other thing I noticed at this point is after I run
|
|
the bad blocks the first time I rebooted my hard drive the Linux partition was no longer
|
|
reported as a bootable option I don't know why it was in the uffirmware that declared this I
|
|
didn't remove it in any form I don't know what caused it to disappear I still don't know actually
|
|
but again another serious indication that something was badly wrong with the disk anyway so I'm
|
|
booting now from the live system rescue distro it's actually called system rescue and after I
|
|
done some research before I even push it up that way and DDing the desk to image it I didn't think
|
|
was good enough so I went for a DD rescue I did some research with that and I decided to do the
|
|
pretty much run it with default parameters and in other words it wouldn't try and read
|
|
problematic blocks too many times what it would do is just sort of try once and then sort of move on
|
|
and I could go back and try again if I found it if I if I wanted to later but I felt at that point
|
|
I just needed to get as much data off that disk as possible now when did you rescue finished it took
|
|
I can't remember exactly long took maybe an hour or two it wasn't that slow and this number of
|
|
passes I should say maybe it was longer than that I wasn't really I can't remember exactly
|
|
the time in but when it finished it reported that 99.99% of my data was safe now I think that's
|
|
actually the most that it can report it did give me an exact number to but the even less than 0.01
|
|
percent errors of all the data in the disk is still many megabytes now many megabytes a small
|
|
beer compared to five hundred or so gigabytes which is what the disk could hold and it was pretty
|
|
full it was like eighty ninety percent full so am I worried about a few megabytes well chances are
|
|
I'm okay but what happens if one of those megabytes was in inside some critical file in the system
|
|
which case it might not boot you know kernel obviously would not but would not be good if a
|
|
if a small section of it was effectively zeroed and also there might be personal files you know
|
|
a little photograph or I don't know video or you know some important PDF document receipts of
|
|
something to do with say the house purchase or whatever you know I had to I felt like I need to
|
|
know where those errors were and now the way to do this is did you rescue tells you exactly where
|
|
the problems were in the disk in general it's what's called a map file which is excellent thing it's
|
|
plain text readable it looks like gobble-de-gook when you first read it but it doesn't take long
|
|
feet with the manual in hand to decode what it's telling you and it's really telling you in bytes
|
|
where you're sorry I think it does blocks sorry not bytes it tells you blocks in blocks where errors are
|
|
on your disk and and when I went through I could see that there were quick big ranges of blocks
|
|
that were identified as being bad where it couldn't read the rescue from but there were scattered
|
|
little trucks all through the file so there was like maybe you know I think that we're all
|
|
determined by correctly they were they were all they're all quite small so maybe a few blocks
|
|
together were bad so it wasn't like a huge range that were it was was wiped out but there was
|
|
little small little small elements dotted around across the disk that were bad and it can be read
|
|
so with that in mind I could you know I could more or less tell I could I could work out to the
|
|
byte where the where problems would lie and where the files were there in my image the data for
|
|
those files were replaced by zero but which files well the solution to this after a bit of research
|
|
turns out that you can take the image of my disk and my image is just called sda.mg and you
|
|
can create a loop device using the command LO setup and you give it the minus O option and then
|
|
you specify the offset in bytes where your partition starts inside that image file and you can
|
|
tell it that you wanted to appear in one of the loop devices and live distro I was using used
|
|
slash Dave slash zero so I went for a slash Dave slash loop one and with that I was able to mount
|
|
the you're choosing the mount command on that loop device I was able to mount that partition from
|
|
the image file and then using something called debug FS so you just start up you can run it
|
|
an interactive mode I could enter I think it was the I check command if I give it I think you've
|
|
byte bytes I think it was I gave it that that would then tell me which I know the number
|
|
was was present at that point in the image and it might not actually be might not be an I know
|
|
number because it might not be used it might be an empty bit of the disk or use for something else
|
|
so I if it did give me an I know number then I could from the I know number pretty quickly look
|
|
up what the file name was using I think it was called n check name check in debug FS and I was
|
|
able to do this manually do some manual calculations of what bytes was one thing so after I got the
|
|
hang of this I began to you know I filmed a file I was able to look at it it was a clearly a text
|
|
file but the operating system and the sort as being binary file unless we didn't display anything
|
|
where we'd cat so I could see I definitely it was correct I was I was finding a problematic files
|
|
in the file system and then I thought well this is going to take a while because there's quite a lot
|
|
you know although it's only a few megabytes that are dotted above all over the disk so I just
|
|
wrote a Python script that would generate lists of commands that could be laid into debug FS
|
|
and debug FS within spio information which was sucked in by another Python script which took
|
|
I know numbers and then spat out all the file names now to my utter astonishment it took several
|
|
hours to execute this process in fact this the second step no the first step I left it running
|
|
overnight so I don't really at the time I don't really know how long it took but it took many hours
|
|
and this may seem surprising it certainly surprised me but of course it's all optimized all
|
|
optimized to work in the other direction you don't usually you know when you type ls in the
|
|
command line you're never you type ls file name or ls path you don't type ls byte at a byte or you
|
|
know obviously everything is constructed to be optimized in other direction never the less it
|
|
seems to be ludicrously slow if there's a faster way of doing it please let me know in the comments
|
|
or better still do a show for hpr folks and how to quickly look up file names from
|
|
byte positions in a partition anyway I got what I wanted a list of files that were problematic
|
|
and unsurprisingly the bigger the file was the more likely it was to suffer a problem so my VDI files
|
|
for both my windows virtual machines were affected also some ISO files were rendered useless
|
|
because they had a bunch of zeros in the middle of them now and I was able to to go around and
|
|
you know remove the problematic files or mark them for salvage now the only is it turns out
|
|
I don't think I have any serious data loss the only things I really cared about the ISOs were
|
|
disposable of course I can download them again if I need them but the VDIs of course contained
|
|
the hard disk of my virtual machines so I had before I'd realized what I'm trying to use windows
|
|
to repair its own file system of course that didn't work while the VDI files were living in a
|
|
hostOS with a drive that was suffering from read errors but now I had a I um now if I could put
|
|
the VDI files on a good disk I could potentially recover them and indeed I'll I put the windows 7
|
|
virtual machine on another PC of mine running in windows actually and I started it up and it's
|
|
actually um booted fine it didn't actually uh I don't think windows tried actually did it appear
|
|
I'm a bit puzzled as to what happened actually because it just seemed to start up fine and I did a
|
|
um um a check disk uh actually I think I did it by right clicking on the C drive and going to
|
|
properties tools to check didn't that way and it came back with you know a few minor problems
|
|
and it corrected it didn't seem to think there was a much of a big deal and it was working again so
|
|
that was good so the next task was to see if I could get my um Linux partition back onto a good
|
|
disk and renew it as easily as that one windows virtual machine now uh there's another story here
|
|
in that I checked online and discovered that my laptop either could take uh well what the website
|
|
said that offered hard drives that could either take an MVME disk or a SATA 3 disk and I suspected
|
|
looking at the specs although I didn't know for sure that I had a SATA 3 disk actually I should
|
|
have been able to check in two ways one with running a laptop but I wasn't going to start that
|
|
up again you know and until unless I wanted to scrape some more data off it so I couldn't do it
|
|
that way and I didn't remember whether it was MVME or SATA 3 now some of you might be laughing at
|
|
that but it just never occurred to me to take take take take a note of it before um the second
|
|
way much easier way with a laptop uh which that you don't want to start up would be to unscrew the
|
|
bottom of the case and look at the hard drive that I want to change before I ordered it but I didn't
|
|
do that because it required I could to a video I saw online at Torx screwdriver at T4 which is
|
|
really rather tiny and I didn't have one I only had a T6 um now that I so I had to but I had to order
|
|
that and I had to order the disk and so I plugged for an MVME disk because that's faster and I
|
|
suspected that there would be two connectors for two hard drives on the board the screwdriver arrived
|
|
at about the same time as the hard drive than it was next day actually and I opened up the laptop
|
|
and discovered that there was only one socket to put in a SSD and it was the SATA 3 kind which is
|
|
what my current disk was so it was a bit dumb off me I think I could have done that better but
|
|
actually didn't bother me because I um I just went out and bought another SATA drive as the time
|
|
was SATA 3 and for the MVME uh two-tier about drives I wasn't great use or I couldn't use
|
|
in the laptop I bought a little caddy because it's quite a handy thing to have a super fast
|
|
slim hard drive because of course these SSDs are like you know it's uh let's look you know it's
|
|
like a tiny pencil case compared to the old caddies that uh uh which were about the size of floppy
|
|
drives um well a few remember how big floppy drives external floppy drives were
|
|
um so I eventually I got this new two-tier about SATA 3 SSD it was actually I think a WD
|
|
uh western digital red disk which is really made for NAS drives um but they didn't have the
|
|
blue kind and I've read people say that red kind was fine and some ways might even be better
|
|
again if you know differently please let me know but that's what's in there now
|
|
dead easy to fit much easier than you know fafing it on with two and a half inch uh units
|
|
of old especially than the cramped confines of laptops it was absolutely once I'd had the correct
|
|
screw driver it was dead easy to do instantly the turns out that although the torque T4
|
|
whoever did open up the laptop actually found that T5 was a better fit um and that after very
|
|
short time using the T4 it seems like ground off the corners of the of the of the sockets that you
|
|
put it into um which is not great so I think T5 if you've got an Aces Zen book it might be a T5
|
|
like me that you need anyway I'd like to raise so I um after I've installed a new hard drive
|
|
I was just pretty straightforward I didn't screw on the bottom panel which has got
|
|
one two three four eight it's got ten screws great fiddley so I didn't screw them all back on
|
|
I thought I'd just start up the laptop and see if I could see the new hard drive even though
|
|
it had nothing on it and yeah I could do it was fine that was great so turn it off again turn
|
|
laptop over and put in all the torque screws then I turned on the laptop the right way up
|
|
preparing to uh format and partition my new two terabyte drive and nothing happened
|
|
I pressed power button nothing I don't know I thought well maybe the battery's run out so I plugged
|
|
in the power cable uh the power LED didn't come on on the laptop and I verified it was power
|
|
definitely coming to it from the cable that power LED shouldn't decayed that it was red charging
|
|
or white fully charged it was neither nothing happened there was no sign of life in the laptop
|
|
whatsoever and I couldn't believe it so I went and I unscrewed all ten of these torque screws
|
|
and this is when I started to discover that I was um ruining the heads of them with the T4
|
|
screwdriver um and I sat and I looked to see what I might have disturbed inside the laptop I
|
|
couldn't see anything I was particularly looking along the hard drive that I wanted is there
|
|
a contact and I discovered that when I put the the bottom plate on it hadn't quite clipped into
|
|
place and the screws hadn't quite caught as well as they could have done so I tightened them all up
|
|
made sure it clicked nothing and then I went to the Aces website and I said in this page
|
|
support page you said if you're having trouble starting out you raise this blah blah book um press
|
|
and hold the power button for 40 seconds okay and that's a bit weird but I'll try that
|
|
and blown cold after 40 seconds nothing seemed to change but the next time I pressed the power button
|
|
the whole thing burst into life LEDs came on and it started to boot and I booted it up into the
|
|
live distro again and that was it um so the next thing to do was I just thought I thought I've just
|
|
put the image of the old artist straight onto this new one of course it wasn't it was only about
|
|
500 gigabytes and it's new to it it's new to 10 about drive it's 2000 gigabytes or the intervals
|
|
so but I could fix it for the learn with the partition as later I thought I just wanted to see
|
|
if it would work and that you know my laptop would now be functional again and certainly I could
|
|
um I wrote the image I'm using DD I mean literally it was as simple as
|
|
DD space if equals sta to img space of equals slash dev slash fta and then I did sped five blocks
|
|
i space bs equals 32 capital m and that was that's the command that I recreated took about an hour
|
|
hour and a half and five thousand seconds I can tell you remember it was almost exactly five
|
|
thousand seconds and it recreated it fine now I should say that there was some debate about what
|
|
block size you should use but I think it doesn't really matter how big it is as long as it's over
|
|
a a few megabytes that seemed to be optimum and it seemed to depend on your hardware what the
|
|
optimum value was but yeah I wouldn't use I wouldn't I would specify it because I think the default
|
|
is five twelve bytes the size of our hardware block and that's probably kind of small things
|
|
done a lot so I would recommend you up in the block size does no harm if it's too big anyway
|
|
so after done that the first thing I did was I pushed up laptop and windows now I couldn't
|
|
test slack we're at this point because as before the ufe firmware didn't see a slack we're
|
|
a boot option anymore and it seemed to remember Ubuntu had once been sold there and that was still
|
|
lurking about and I'd not taken it off but the windows partition was still there and should have
|
|
been recreated intact and so I started it up and windows it was just fine no problem at all just
|
|
like it had always been so I then was able to go into the live distro and well that's what an
|
|
interesting wrinkle here and that I knew the what I needed to do was use a command called
|
|
I think it's called EFI boot MGR EFI boot manager and I need to tell it get a label
|
|
slackware and also key thing is where the EFI EFI file is on EFI boot partition that's needed
|
|
to boot just like we're in this case I was using e-lilow as it happened and so I was actually
|
|
you know the boot from a from a system rest you live distro I could build that boot mount that
|
|
boot partition EFI boot partition and I could see where that file was and it was intact and fine
|
|
no trouble so I was able to do that well I tried and it wouldn't work and it just said there are no
|
|
I can't know what exactly what it said but I think it said something like there are no
|
|
EFI boot variables or something like that one then I realized that there was two options when
|
|
booting into this live USB one of them was to just boot the disc you know just said boot I was
|
|
called a sand disc the brand of the USB stick and that's why I chosen and it turns out that
|
|
boots are in as of MBR compatibility mode so there's no it doesn't think there's any UEFA going on
|
|
if I booted it by the thing that said sand disc partition one then that was the UEFA way of doing
|
|
things and then I could use EFI boot manager and then I did that and I was able to boot up
|
|
like we're as it was run on FFSEK was the first thing I did um actually no sorry I didn't I got
|
|
that wrong I didn't that wasn't the first thing I did I ran the FFSEK of course from the Linux
|
|
Live to Stroke because you can't run animal partition and it came back with no errors so I did that
|
|
first and then I booted it into Slackware and it worked fine and I was back at my desktop exactly
|
|
almost as if nothing had happened to my laptop it was really quite strange but of course
|
|
something had happened to it and I still had several megabytes of zero scatterable and files
|
|
on over the disk um so I knew what they were so I was good I would deal with that later um I'm
|
|
repartitioned the disk that's not terribly interesting story I just used um G parted on the system
|
|
rescue live distro to do that and then copied data around so that I have a new having two terabytes
|
|
I getting large windows partition on that disk and give most space over to Linux and um I have
|
|
a nice nice big like home partition now over a telebyte inside in fact so to end the story
|
|
the ending is happy and in fact I'm recording this on the very late laptops I'm talking about using
|
|
audacity and um and favorite thing is working just fine so happy ending thanks for listening bye bye
|
|
you've been listening to hecka public radio at hecka public radio dot org
|
|
we are a community podcast network that releases shows every weekday Monday through Friday
|
|
today's show like all our shows was contributed by an HBR listener like yourself
|
|
if you ever thought of recording a podcast then click on our contribute link to find out
|
|
how easy it really is hecka public radio was founded by the digital dot pound and the
|
|
infonominant computer club and it's part of the binary revolution at binrev.com if you have
|
|
comments on today's show please email the host directly leave a comment on the website or record
|
|
a follow-up episode yourself unless otherwise status today's show is released on the creative
|
|
comments attribution share a like 3.0 license
|