After 7 years one of the drive about to die (software raid1)

Offline

Issue

After 7 years one of the drive about to die (software raid1)

Resolved

1 votes

Hello,

I'm one of your clearsOS enterprise happy customer and I have ClearOS 6.6 running on our systems for the last 7 years and now I have found the symptom that one of my disk drive is about to die.
Now it is always try to check the filesystems after the server got reboot and most of the time would take hours.
Today I had this again and stack on 30% and I have to hard reboot.
Any ideas how to solves this, as I don't know how to get in to the console to do several mdam tasks.

In Software RAID Manager

Friday, February 19 2016, 02:08 AM

Michael Proper likes this post.

No. Favourite

Share this post:

Responses (12)

Accepted Answer
Tony Ellis

Offline
Sunday, February 21 2016, 01:54 AM - #Permalink
Resolved

0 votes

Run the smart test on /dev/sdb - what condition is that disk in?
The reply is currently minimized Show

Accepted Answer

Khairun

Offline

Saturday, February 20 2016, 11:34 AM - #Permalink

Resolved

0 votes

Hi,

Here's the result of the smar test disk. I belive is a lot of problem.
Any idea what to do next?

Thanks for any help on this.

root@sysresccd /root % smartctl --all /dev/sda  

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-3.18.25-std471-amd64] (local build)

Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.12

Device Model:     ST3250318AS

Serial Number:    5VY2V6RG

LU WWN Device Id: 5 000c50 021722b96

Firmware Version: CC38

User Capacity:    250,059,350,016 bytes [250 GB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 2.6, 3.0 Gb/s

Local Time is:    Sat Feb 20 19:30:11 2016 UTC



==> WARNING: A firmware update for this drive may be available,

see the following Seagate web pages:

http://knowledge.seagate.com/articles/en_US/FAQ/207931en

http://knowledge.seagate.com/articles/en_US/FAQ/213891en



SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x82)	Offline data collection activity

					was completed without error.

					Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 121)	The previous self-test completed having

					the read element of the test failed.

Total time to complete Offline 

data collection: 		(  600) seconds.

Offline data collection

capabilities: 			 (0x7b) SMART execute Offline immediate.

					Auto Offline data collection on/off support.

					Suspend Offline collection upon new

					command.

					Offline surface scan supported.

					Self-test supported.

					Conveyance Self-test supported.

					Selective Self-test supported.

SMART capabilities:            (0x0003)	Saves SMART data before entering

					power-saving mode.

					Supports SMART auto save timer.

Error logging capability:        (0x01)	Error logging supported.

					General Purpose Logging supported.

Short self-test routine 

recommended polling time: 	 (   1) minutes.

Extended self-test routine

recommended polling time: 	 (  44) minutes.

Conveyance self-test routine

recommended polling time: 	 (   2) minutes.

SCT capabilities: 	       (0x103f)	SCT Status supported.

					SCT Error Recovery Control supported.

					SCT Feature Control supported.

					SCT Data Table supported.





SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure       90%     51026         486461351



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

The reply is currently minimized Show

Accepted Answer

Khairun

Offline
Saturday, February 20 2016, 10:38 AM - #Permalink
Resolved

0 votes

Hi Tony,

I'm working on your suggestions on below, I will posted any update for this in next 40 minutes.

Best regards, Khairun

root@sysresccd /root % smartctl /dev/sda -t long smartctl 6.4 2015-06-04 r4109 [x86_64-linux-3.18.25-std471-amd64] (local build) Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 44 minutes for test to complete. Test will complete after Sat Feb 20 19:16:50 2016 Use smartctl -X to abort test.

[quote]Tony Ellis wrote:
2. Run the SMART long tests on both drives. Start with /dev/sda as it looks like partition sda3 is a problem. If /dev/sda is clean, try adding it back into the array with mdadm. If there are any problem I would use ddrescue from a SystemRescue like the one in my previous link to copy to a new drive of the same (preferably) or larger size. If there are bad sectors on /dev/sda, ddrescue will try very hard to read with multiple re-tries. If unsuccessful will write zeros in that position on the new drive, then continue, whereas dd would give up... Then replace the old sda disk with the newly created one./quote]
The reply is currently minimized Show

Accepted Answer

Khairun

Offline

Saturday, February 20 2016, 10:22 AM - #Permalink

Resolved

0 votes

Hi All,

Just an update, I have done ddrescue the disk in to the new clean drive. I did not see any error but I'm not sure how can I verify if the data is intact.
the new drive is indefied as
Disk /dev/loop0: 337.6 MiB, 353955840 bytes, 691320 sectors
or sdc.

I will run the smart test on the drives see if they are clean first. Will keep you posted on this.

Best regards, Khairun

root@sysresccd /root % fdisk -l

Disk /dev/loop0: 337.6 MiB, 353955840 bytes, 691320 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes





Disk /dev/sda: 232.9 GiB, 250059350016 bytes, 488397168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: dos

Disk identifier: 0x000de450



Device     Boot     Start       End   Sectors   Size Id Type

/dev/sda1  *           63    208844    208782   102M fd Linux raid autodetect

/dev/sda2          208845 479990069 479781225 228.8G fd Linux raid autodetect

/dev/sda3       479990070 488375999   8385930     4G fd Linux raid autodetect





Disk /dev/sdb: 232.9 GiB, 250059350016 bytes, 488397168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: dos

Disk identifier: 0x00025dc4



Device     Boot     Start       End   Sectors   Size Id Type

/dev/sdb1  *           63    208844    208782   102M fd Linux raid autodetect

/dev/sdb2          208845 479990069 479781225 228.8G fd Linux raid autodetect

/dev/sdb3       479990070 488375999   8385930     4G fd Linux raid autodetect





Disk /dev/sdc: 298.1 GiB, 320072933376 bytes, 625142448 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: dos

Disk identifier: 0x000de450



Device     Boot     Start       End   Sectors   Size Id Type

/dev/sdc1  *           63    208844    208782   102M fd Linux raid autodetect

/dev/sdc2          208845 479990069 479781225 228.8G fd Linux raid autodetect

/dev/sdc3       479990070 488375999   8385930     4G fd Linux raid autodetect





Disk /dev/md127: 4 GiB, 4293525504 bytes, 8385792 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes





Disk /dev/md126: 101.9 MiB, 106823680 bytes, 208640 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes





Disk /dev/md125: 228.8 GiB, 245647867904 bytes, 479780992 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes





Disk /dev/md1: 101.9 MiB, 106823680 bytes, 208640 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

The reply is currently minimized Show

Accepted Answer

Khairun

Offline
Friday, February 19 2016, 01:09 PM - #Permalink
Resolved

0 votes

Hi,

Hi Nick, thank you for your suggestion, as precaution step, I have start the ddrescue methode based on Tony suggestion and try to copy all my current configuration and perhaps the if Im lucky the cyrus imap data.
It would be a long hours I guess.

Cross my fingers.

Attachments:

P_20160219_210221_1_p.jpg
The reply is currently minimized Show
Accepted Answer
Nick Howitt

Offline
Friday, February 19 2016, 12:36 PM - #Permalink
Resolved

0 votes

I can't add much but I know bad superblocks on ext3 can sometimes be recovered as ext3 keeps backups around the partition. Google "bad superblock ext3" and you'll find lots of references. It is probably a similar procedure with ext2 and ext4 but I can't guarantee it.
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Friday, February 19 2016, 10:55 AM - #Permalink
Resolved

1 votes

This would be my approach... Do some research on the web and evaluate what you find with my suggestions. You could well find better... You will also need to learn how to use the tools if you are not familiar with them, but don't practice on this broken system.

1. Since it looks like you can access all three partitions I would use "dd", or better "ddrescue" (see below) to copy them elsewhere. If one is swap then that should be skipped. This means you now have all your recoverable data safe together with a copy of your configuration files and everything else hopefully.

2. Run the SMART long tests on both drives. Start with /dev/sda as it looks like partition sda3 is a problem. If /dev/sda is clean, try adding it back into the array with mdadm. If there are any problem I would use ddrescue from a SystemRescue like the one in my previous link to copy to a new drive of the same (preferably) or larger size. If there are bad sectors on /dev/sda, ddrescue will try very hard to read with multiple re-tries. If unsuccessful will write zeros in that position on the new drive, then continue, whereas dd would give up... Then replace the old sda disk with the newly created one.

I am not sure the ClearOS rescue mode has all the tools that you need, I don't use it for this purpose, but a special disk designed for this type of rescue...

and if anybody else has better ideas or dis-agree with the above, then respond. I work out what to do in these situations as I go along, as it is impossible to predict a series of absolute steps. it's a matter of gathering facts, make the next move, gather some more etc.
The reply is currently minimized Show
Accepted Answer

Khairun

Offline
Friday, February 19 2016, 10:16 AM - #Permalink
Resolved

0 votes

Hi,

I have managed to boot using clearOS community disk and access the rescue mode for the system.
I have mounted the system /mnt/sysimage and has gather several more info on my situation.
I think I can see all the configuration just fine on this mode except for they map cyrus which is very important for men my cases.
Is there some other things that I can do after this?

Attachments:

P_20160219_181123_1_p.jpg

P_20160219_181144_1_p.jpg
The reply is currently minimized Show
Accepted Answer

Khairun

Offline
Friday, February 19 2016, 09:37 AM - #Permalink
Resolved

0 votes

Hi Tony, glad to hear your feedback on this, long time I have not visit this forum. A lot of things to catch and learn again.
I will try your suggestions and will return to you with feedback.
The reply is currently minimized Show
Accepted Answer

Khairun

Offline
Friday, February 19 2016, 09:22 AM - #Permalink
Resolved

1 votes

Hi,

Strange things, when I did
/cat /proc/mdstat
there are no drive to shows?
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Friday, February 19 2016, 08:55 AM - #Permalink
Resolved

1 votes

Sounds like the file-system is corrupted. Assuming md0 is a Software Raid 1 Version 0.90, and not using LVM, with something in that condition, I would be using a system rescue linux bootable CD/USB stick such as https://www.system-rescue-cd.org/ to look at each drive separately. A raid 1 version 0.90 disk is compatible with a normal linux single disk. I would be looking for my data on each disk in turn to copy elsewhere and start again from scratch, depending of course on what I discovered. An alternative would be to install the two disks in another linux machine with two spare disk connectors. At the first sign of trouble you should have taken a full backup of the system, if you didn't have one, and start replacing the faulty disk(s). Unfortunately if the system is corrupted on one disk, then raid 1 being a mirror the other is likely the same. Raid is NOT a backup and shouldn't be used as an excuse for not doing so. If you were using LVM, then I cannot help you. Last time I used that was with IBM's OS2.

Like

1

Khairun likes this post.
The reply is currently minimized Show
Accepted Answer

Khairun

Offline
Friday, February 19 2016, 08:03 AM - #Permalink
Resolved

0 votes

Hi,

I managed to boot up now but now I'm having different problem.
It said md0: raid array is not clean so its trying to autodecting Raid array.
And when it finished its trying to resume from /dev/md0 but it failed to read superblock with EXT3-fs and having error while trying to mount /dev/root as ext3 : invalid argument.
Error with set uproot, no stab.sys and finally Kernel Panic.

Please help!
The reply is currently minimized Show

Your Reply

Please login to post a reply

You will need to be logged in to be able to post a reply. Login using the form on the right or register an account if you are new here.

Community Forums

ClearOS Portal

ClearVM Platform

ClearVM 2 Platform

Forums