Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Author Topic: Disc Crash Recovery?  (Read 4899 times)

WWWombat

  • Kitizen
  • ****
  • Posts: 1674
Disc Crash Recovery?
« on: March 09, 2016, 01:17:08 AM »

I've been trying to sort out a bit of a disk crash. There's a long story, where the recovery disk (System Rescue CD) made me fear for all my protected data (LVM on RAID 1), but that now doesn't appear to be the big issue I thought it was last week.

My biggest issue now becomes "only" the system disk - the one with the core Linux installation. That one, unfortunately, wasn't RAID, and now exhibits the "click of death". If I hadn't wasted time on chasing the non-problem of my RAID data, I probably could have saved it easily.

Any suggestions how to overcome the click of death?

I'd normally resort to the System Rescue CD, and partimage to make copies of the filesystem's on partitions, but use "ddrescue" if I'm getting read errors. However, I'm getting more fundamental problems here. I need to plug the disk in, post-boot, with a USB IDE adapter to even allow the CD boot to complete. The system identifies 5 of the 6 partitions on the disk, but otherwise won't read any data.

I'm somewhat stumped...
Logged

burakkucat

  • Respected
  • Senior Kitizen
  • *
  • Posts: 38300
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Disc Crash Recovery?
« Reply #1 on: March 09, 2016, 05:33:11 PM »

Hmm . . .

Would you be able to boot from USB medium? If yes, then perhaps install a rescue ISO image on to an appropriate sized USB memory stick, boot it and then salvage the information from the OS disk?

Failing that, would you be able to send a byte-stream of the raw disk image, via any interface, to a suitable byte-bucket? (Serial V24, USB, Firewire or ?)

How about disconnecting both the data RAID and the dying system disk, install a replacement for the system disk and perform a new OS installation. Once that is complete, reconnect the dying system disk and mount it read-only. Assuming success up to this point, then use any technique (dd conv=noerror) to salvage as much as possible.
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5369
Re: Disc Crash Recovery?
« Reply #2 on: March 09, 2016, 10:45:19 PM »

Ddrescue has worked miracles for me in the past.   AFAIK It improves upon simple 'dd' recovery by narrowing down the damaged data to just the few bytes that can't be read, rather than the entire block.   It sounds like here, though, maybe the partition table is corrupt?

If it were me, I'd look at a combination of...

1) ddrescue to 'recover' the entire raw device
2) testdisk to reconstruct the partition table from the 'recovered' drive

I'm not going to give any blow-by-blow instructions as it's not something I've ever tried, but web searches on these utilities might yield some inspiration.

Logged

WWWombat

  • Kitizen
  • ****
  • Posts: 1674
Re: Disc Crash Recovery?
« Reply #3 on: March 10, 2016, 02:40:29 PM »

Would you be able to boot from USB medium? If yes, then perhaps install a rescue ISO image on to an appropriate sized USB memory stick, boot it and then salvage the information from the OS disk?

I'm going to try that. I'll see if I can get a running system that is as independent from the disk I/O system as possible. Unfortunately, I still really need a disk to write recovered data out to.

Currently, I've put the failing disk behind a USB-IDE adapter, so I can boot from CD correctly, and then see what happens when the failing disk is plugged in later.

How about disconnecting both the data RAID and the dying system disk, install a replacement for the system disk and perform a new OS installation. Once that is complete, reconnect the dying system disk and mount it read-only. Assuming success up to this point, then use any technique (dd conv=noerror) to salvage as much as possible.

The failures that I'm seeing happen while the kernel is trying to identify the disk, and make the partitions available. It seems the drivers do some sanity checking (is the disk really that big?) that interferes with attempts to read the data by user-level programs. It appears my biggest issues come before I can attempt to manipulate the failing disk.

Let me explain ... Yesterday's attempts got somewhere with this, but only so far...

When I plug the failing disk in, the kernel messages show the USB device detection as normal, followed by the storage device identification too. It identifies the disk as a 160GB disk with 6 partitions (two within the extended partition; all old-school MBR format). The disk starts clicking away...

Unfortunately, things don't stay like that for long. After a few USB resets (tens of seconds apart), the identification happens again ... and again ... and again. For a few times. Eventually, the messages start to complain that the partitions go off the end of the device (EOD), and the reported size gets smaller. Eventually, it seems to settle on having 5 partitions and just 8GB, and eventually stops going through this identification process again and again.

Before things have settled, basic commands (like "fdisk -l /dev/sda", or "ddrescue -n /dev/sda /mnt/rescue/sda.img /mnt/rescue/sda.log") often fail because the device has disappeared from under it.

However, once the kernel has settled on the 5-partition, 8 GB device, user commands start to work.

In this state, partitions 1-3 are within the first 8GB; partition 4 is large enough to span the full (original-size) disk, and partition 5 curtails itself at 8GB (which I don't think is the proper endpoint of partition 5). No sign of partition 6.

While I was in this state yesterday, I managed to "ddrescue" the individual partitions for 1-3. I then asked it to rescue the whole of /dev/sda; this pulled the full 8GB off ... and then terminated because that was the size of the device. I couldn't get it to read further down the disk, even by specifying a size manually.

While rescuing this data, I had zero blocks with errors. The disc went through occasional phases of the head clicking away, but it didn't seem to affect extraction.

So ... my main requirement now is to get the disk through the identification process properly.

My intended tricks, right now, are to try plugging it in with different physical orientations. Trying different sides upwards.

Later tricks might involve banging the drive physically in a certain direction, or freezing it. I'm not sure I want to go there just yet...

Ddrescue has worked miracles for me in the past.   AFAIK It improves upon simple 'dd' recovery by narrowing down the damaged data to just the few bytes that can't be read, rather than the entire block.   It sounds like here, though, maybe the partition table is corrupt?

I love ddrescue too. It is a shame, though, that I never seem to learn enough to never want to use it again!

In this case, I think I'm getting something wrong with the head movement inside the disk, in a way that prevents it from reading the full disk. The Linux disk driver, or the disk itself, sanity-checks the last partitions out of existence.

I'll take a look at testdisk - it's not something I've seen before.
Logged

tickmike

  • Kitizen
  • ****
  • Posts: 3640
  • Yes Another Penguin !. :)
Re: Disc Crash Recovery?
« Reply #4 on: March 13, 2016, 11:55:59 PM »

On my home built NAS I run the disc's in parallel with the data being put on two disc's and if one fails you still have a good copy on the other.

Have you taken the disc out ?, try running it upside-down or on it's side, I did this once for someone and recovered his data and copied to a new disc.
« Last Edit: March 14, 2016, 12:00:55 AM by tickmike »
Logged
I have a set of 6 fixed IP's From  Eclipse  isp.BT ADSL2(G992.3) line>HG612 as a Modem, Bridge, WAN Not Bound to LAN1 or 2 + Also have FTTP (G.984) No One isp Fixed IP >Dual WAN pfSense (Hardware Firewall and routing).> Two WAN's, Ethernet LAN, DMZ LAN, Zyxel GS1100-24 Switch.

WWWombat

  • Kitizen
  • ****
  • Posts: 1674
Re: Disc Crash Recovery?
« Reply #5 on: March 14, 2016, 10:00:52 AM »

On my home built NAS I run the disc's in parallel with the data being put on two disc's and if one fails you still have a good copy on the other.

Yeah - most of the data is in a RAID-1, using the Linux software drivers.

Quote
Have you taken the disc out ?, try running it upside-down or on it's side, I did this once for someone and recovered his data and copied to a new disc.

I've been trying this too, and all sorts of angles. Sometimes, I do get some more data out. But there is nothing consistent about when or why it works.

I've now managed to suck off something like 25GB so far. I'll report back a little later, when I've more time, with results of what I've tried so far...
Logged

WWWombat

  • Kitizen
  • ****
  • Posts: 1674
Re: Disc Crash Recovery?
« Reply #6 on: March 15, 2016, 12:51:49 PM »

A quick update. Hope this makes sense...

Failure Mode:
The disk isn't failing in any kind of standard way. It doesn't just suddenly return errors when reading some sectors, but returns others successfully.

Instead, it appears to be failing to use the bus correctly - whether that is the ATA bus, or the USB bus when plugged in via an adapter. "dmesg" from the kernel is a regular procession of USB resets, odd USB responses, re-detection of the USB device followed by re-detection of the hard disk, and re-reporting of the disk-size and partitions. Along with the occasional "proper" failure to read the disk.

This behaviour means that the disk (eg /dev/sdc) and its partitions (eg /dev/sdc1-6) regularly disappear from the kernel, and then re-appear. There is no pattern - so it might do this after 30 seconds, or after 5 minutes. Or the device becomes unusable permanently, and has to be unplugged.

Sometimes, when it re-appeared, it reported a smaller size - sometimes by a few sectors, but sometimes down from the full 160GB to just 8GB.

All this time, the disk just keeps clicking, presumably as the heads hit some buffer. The noise might stop for a while, around times that the USB re-detection is going on, but otherwise it is pretty ever-present.

In the end, having the disk on an IDE cable is just inflexible. I can't power the disk down and start again. It also bogs down the boot process, with multiple processes trying to look for RAID arrays, etc. So the failing disk is best handled via USB, connected post-boot.

udevd
However, the disk seemed to be being accessed a lot whenever it was re-announced by USB, and I eventually discovered that "udevd" was trying to run stuff - presumably requiring access to the disk device and partitions. A set of 7 processes were kicked off, that then sat - presumably waiting for responses from the failing drive. This attempted access, whenever the drive was re-announced, really got in the way of my own rescue attempts. I had to stop the "systemd-udevd" service to avoid this behaviour. Now the post-attached clicking is much reduced.

Rescue Setup
At one point, I tried performing the rescue through a normal installation of Fedora, but it attempted to auto-mount partitions whenever the disk announced itself. That was out, so I'm now booting via a copy of "System Rescue CD" installed on a USB stick (though a CD works too).

The rescued files are going onto a separate USB-storage device, but there was a problem there too. I discovered that writing my rescued files onto a USB-powered "WD My Passport" device wasn't very stable (lots of disk errors in the logs from that), so the rescued files now go onto USB storage that has a separate power supply.

ddrescue
Unfortunately, ddrescue is written under the assumption that communication with the disk controller is fine, and that the failures happen when the controller asks for things from the platters. It really doesn't work well when the disk controller can't be controlled.

In particular, it just stops when the underlying device (eg /dev/sdc) just disappears, though at least it does this gracefully.

But this subverts some of the behaviour of ddrescue: it won't move on to try new areas of the disk, in case they happen to be readable. It keeps trying the same spot, and the only alternative I have is to try "--reverse" to start at the other end of the disk.

Initial Progress
With ddrescue being problematical, it still managed to read approximately 8GB from the start of the disk, and this worked best when it had self-reported the reduced size of 8GB. But it wouldn't do much when the size reported was the full 160GB.

Restricted Size
This gave me an idea ... as there is a jumper on the back of IDE disks that allow you to restrict the size. I did this, and got a disk reported to be 32GB instead. This was a little stable, in that the clicking subsided faster.

This allowed me to get an extra 4GB off the disk quite readily, so now up to 12GB, but I needed new tactics...

Manual Control of ddrescue
Next I decided to figure out how 'ddrescue' worked more intimately. I couldn't figure any way to get it to behave automatically, but I figured some manual controls to get it to return one section at a time, by using the "-i" and "-s" options.

Now I start using ddrescue with (a lot of) commands like this:
Code: [Select]
$ ddrescue -vv -i0GiB -s1GiB rescued.img rescued.log
$ ddrescue -vv -i1GiB -s1GiB rescued.img rescued.log
$ ddrescue -vv -i2GiB -s1GiB rescued.img rescued.log
$ ddrescue -vv -i3GiB -s1GiB rescued.img rescued.log
...

Doing this got me around 20GB of the first 32GB on the "restricted disk".

I then removed the jumper, and put the disk back to a full 160GB,

Process
Having found the "udevd" issue in the middle of this, and turning off the automatic access associated with this, I've made things a little more stable in the recovery process.

I won't know precisely when the disk will re-discover the ability to talk, but I now follow a sequence where I use a "dummy" ddrescue command to attempt to re-read a section that has previously been recovered successfully. If this command fails, I'll wait a while, or try plugging the disk in again. If it succeeds all the way through, I'll then issue a new ddrescue command for one of the untouched sections.

Often the disk will not want to behave, and will keep reporting faults. Yet suddenly, it will give me a batch of up to 4GB at a time, before becoming sullen again.

I reckon I've now reached  around 70GB restored.
Logged

WWWombat

  • Kitizen
  • ****
  • Posts: 1674
Re: Disc Crash Recovery?
« Reply #7 on: March 21, 2016, 09:00:08 AM »

A proper quick update.

The 70GB recovered before has become 112GB, but will go no higher.

Now onto file system recovery. Two were easy, one needs to be played with. However, the last one is an LVM partition with a number of file systems. That isn't recognised, and I'm about to start trying to search and edit metadata directly on the copy of the disk.
Logged