Changing non-raid drive, breaks Raid

Offline

Changing non-raid drive, breaks Raid

Resolved

0 votes

Hi All

I have configured my server with 8x 2TB drives as Raid 6 as per the video posted here - Title as per this post - Title
At the time i asked the question, creating a raid using - /dev/sdx, will the raid not break if i add or remove non-raid drives as Linux does not preserve the /dev/sdx numbers, "seen this movie before..."
Now it is broken... i replaced one of my non-raid drives to copy data to the Raid, and now it boots to recovery mode.
I have no idea where to start looking, any help will be appreciated.

The other thing i noted is that my 2nd NIC is also "dead"

Z

In Software RAID Manager

Thursday, September 14 2017, 08:30 AM

Share this post:

Responses (34)

Accepted Answer
Tony Ellis

Offline
Thursday, September 21 2017, 04:27 PM - #Permalink
Resolved

0 votes

Just thought about doing some research on SATA cards before going to bed (it's currently 2.20 am
Came across this... https://www.jethrocarr.com/2013/11/24/adventures-in-io-hell/ - bit different to your problems but some of the comments at the bottom are telling...

Seems like the Marvell SATA chipsets/drivers might not be the best if this is anything to go by... Have a Silicon Image 3114 (and still use in a backup Server that's only powered up once a week for backup updates) so your PCI card is probably OK (provided the manufacturer was careful laying out the PC board traces to minimise cross-talk - mine is a different brand to yours) Just make sure the SATA cable is good quality with a nice tight fit. Mine has no notches for the clips... However, based on the URL above there are suspicions about PCIe Marvell based cards - perhaps change yours for a decent one - with no Marvell chipset!. Never used Marvell myself so no first hand experience there.

My SI 3114 card has 1 drive attached, the motherboard has 4 SATA ports with 4 drives. The 5 drives are combined in a software raid 5 array. The OS resides on 2x IDE drives that are mirrored in Raid 1. All my drives support TLER (ERC) set to 7 seconds timeout.
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Thursday, September 21 2017, 02:21 PM - #Permalink
Resolved

0 votes

Hi Leon, Once a UUID is assigned to each array during raid creation it doesn't matter what the /dev/sdx order is if you specify by UUID in mdadm.conf... madadm scans all drives looking for the UUIDs and uses that to ascertain which drive is which
Did you create the script to change the disk time-out and place it, for example, /etc/rc.d/rc.local so it runs when booting? Those drives of yours are not suitable for use in Raid 5/6 without doing that.

No idea how you are setting the drives up - never watch a "How-To" on YouTube or anything else similar - can read far faster than any narrator can speak and thus lean a lot more in the same period of time... and you can always refer to any written part instantly. Lot better than having to replay something to make sure you heard and understood a certain passage correctly... Initially several years ago downloaded and used the Redhat Administration Manuals and went from there... Studied the complete set, beginning to end, while going to//from work on the train.

If you continue to have problems - then it might be time to look at the hardware... would be inclined to ditch the two budget PCI/PCIe controllers and get a decent 8-port one with current Linux support, eg a modern LSI, assuming one of your 8x or 16x PCIe slots is vacant... are you using good quality SATA cables with clips? The old original ones with no clips are notorious for creating intermittent connections as are cheap controllers that don't have the little notch for the clip to latch onto. Wouldn't be surprised if your two add-on controllers fall into that category. No clips means you are relying on friction - you might get away with it - but the number of drives you have provides more opportunity for vibrations and connector movement...
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 21 2017, 11:33 AM - #Permalink
Resolved

0 votes

Hi Tony

Thank you for all the help, but i am at the point where, so my data is gone... deal with it...
I am prepared to start from scratch, about 4TB will be lost, i think i can get most of it back over time from other sources, it might just take some time.

I still have the issue where on every reboot, the raid - /var/sd[abcefgijkl] is not the same as the next.
This is what seems to be the issue - we got side tracked trying to recover the data after the reboot and i loaded data on the drive.

Any idea how i resolve my original issue?

I shall start from scratch, as i did from here ClearOS Setting up Storage Volumes with Linux RAID
Once it is done i shall create /etc/mdadm.conf -
mdadm --detail --scan >> /etc/mdadm.conf

Add the UUID to fstab, put some data on it, verify and reboot
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Wednesday, September 20 2017, 11:38 PM - #Permalink
Resolved

0 votes

Create is dangerous in that you need to specify the disks, in the create command, in the same order that they had become in the raid when it broke. Since you did a grow they may not be in strict alphabetical order any more - do not write anything to the drive - mount it read-only until you have verified the data in large files is OK. Since you have 10 drives the number of possible combinations is enormous, See the Section "File system check" https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID - you really should have done this with overlays - I pointed to this procedure before. For all you know the correct order may be /dev/sdc /dev/sdf /dev/sdd.... etc. On the other hand you may have been very lucky... but what happened to the necessary "--assume-clean"?

Comment out the entry in fstab - (assume you put it back) and do not add it back until you are sure the array will always assemble on a boot. in the mean time do a manual mount if and when the array is assembled, then reboot. Can you stop /dev/md127 and /dev/md0 and will it now assemble with a --detail --scan ?

see amongst many others
https://serverfault.com/questions/538904/mdadm-raid5-recover-double-disk-failure-with-a-twist-drive-order
https://serverfault.com/questions/347606/recover-raid-5-data-after-created-new-array-instead-of-re-using
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Wednesday, September 20 2017, 08:00 PM - #Permalink
Resolved

0 votes

Hi Tony

I created a mdadm.conf file, this is the content

ARRAY /dev/md/0 metadata=1.2 name=Zodiac.lan:0 UUID=785c89de:1d355c46:9b116951:21d6c4b3

And it booted into emergency mode again?

Any idea?
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Wednesday, September 20 2017, 02:46 PM - #Permalink
Resolved

0 votes

Hi Tony

I did
mdadm --create /dev/md0 --level=6 --chunk=64 --raid-devices=10 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdl1 /dev/sdj1 /dev/sdk1

I spent hours reading and trying to do a reassemble with Zero success, so i took the dive.

Only thing i need to check is why i only see 453GB of 456GB free and not the whole array, shall do some google hunting..
Edit: i think it is bacause it is not mounting md0, so it is the same size as the boot disk
For some reason, blkid does not show me /dev/md0

I presumed that the 1st UUID of all the disks that is part of the array is the UUID, but mounting that UUID i get /dev/sdc1 is already mounted
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Tuesday, September 19 2017, 10:40 AM - #Permalink
Resolved

0 votes

Fingers crossed -

What speed is it rebuilding?

What did you do to get the array started?
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Tuesday, September 19 2017, 10:29 AM - #Permalink
Resolved

0 votes

Hi Tony

Yes, not ideal, but i had old P4 3GHz with 4Gig Ram and 2x SATA4000 SI 3114 SATA I 1.5 Gb/s as my previous server.
It worked fine for what i need it to do.

Server is busy rebuilding..... should be done in 38 Hours, so let's see.
I either have data back or nothing....
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Monday, September 18 2017, 11:45 PM - #Permalink
Resolved

0 votes

Thanks Leon - minor quibble the STAT4000 is a PCI card using the Silicon Image 3114 controller - it is NOT PCIe
see http://www.sunix.com.tw/product/sata4000.html
The SATA2600 is PCIe http://www.sunix.com/product/sata2600.html - so you have 1x PCI and 1x PCIe

So we have quite a mixture here
Drives are SATA III 6Gb/s (Compatible with SATA I and SATA II)
Intel Motherboard Controllers SATA II 3Gb/s
SATA2600 Marvell 91XX SATAIII 6Gb/s (transfer will be limited as has only 1x PCIe connector)
SATA4000 SI 3114 SATA I 1.5 Gb/s (limited even more since motherboard slot is 33 MHz PCI)

Should work - but the PCI card is creating a bottle-neck
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Monday, September 18 2017, 07:43 PM - #Permalink
Resolved

0 votes

Hi Tony

The one card is a SUNIX SATA2600 - 2x SATA and the other is a SUNIX SATA 4000 4x SATA
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Monday, September 18 2017, 01:13 AM - #Permalink
Resolved

0 votes

Leon - a question about the hardware
[code]
The board had 6x SATA port and i have 2 PCIe cards to give me an additional 6 SATA ports
[/quote]
Make and Model of the "2 PCIe cards" please...
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Sunday, September 17 2017, 09:52 AM - #Permalink
Resolved

0 votes

Thanks Leon... OK - those drives are not suitable for use in a Raid 5 or 6. See the Section in"Timeout Mismatch" at https://raid.wiki.kernel.org/index.php/Timeout_Mismatch - You really need to do get that script working first before anything else - you cannot afford a drive to be kicked out at this stage...

Then use smartctl to check that every raid drive does not have a Current Pending Sector count warning or other serious error - this is important if you end up using only 8 of your ten drives to recover the array - you don't want a drive kicked for any reason. Run the long test on every drive and check the output. You can run the test on all drives concurrently. Lots of help on this on the web eg https://www.linuxtechi.com/smartctl-monitoring-analysis-tool-hard-drive/
An example showing Current Pending Sector:- https://community.wd.com/t/help-current-pending-sector-count-warning/3436/3

Then...
See https://raid.wiki.kernel.org/index.php/Assemble_Run
Are the event counts for the 'good' drives the same or very very close?
Believe you have a Raid 6 - so you should be able to recover if 8 of the 10 drives are OK - then add the other 2 latert.

Otherwise, it might mean working through https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID - ypu might also want to contact the experts on the raid mailing list...

Do we assume you created the raid and then used grow without making a viable backup first? If so, that is highly dangerous. You don't use raid as a backup - it is to guard against one kind of hardware failure - a drive failure. There's lots of failure modes that raid doesn't guard against such as file corruption (software problem, power drop etc), human error (deleting files by mistake), viruses and other malware, etc. With a backup the quickest way would be to create the raid again and restore from the backup, having a procedure in place to ensure your backups are complete and useable...
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Sunday, September 17 2017, 07:58 AM - #Permalink
Resolved

0 votes

Hi Tony

Let's start with the disk

[root@Zodiac ~]# smartctl -l scterc /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.26.2.v7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Commands not supported

Here are the rest of the information - I used /dev/md0 when i created the array

[root@Zodiac ~]# cat /proc/mdstat
Personalities :
md127 : inactive sdg1[3](S) sde1[1](S) sdf1[2](S) sdj1[5](S) sdl1[7](S) sdh1[4](S) sdc1[0](S) sdk1[6](S)
15627059200 blocks super 1.2

[root@Zodiac ~]# mdadm -S /dev/md127
mdadm: stopped /dev/md127

[root@Zodiac ~]# mdadm -S /dev/md0
mdadm: error opening /dev/md0: No such file or directory

[root@Zodiac ~]# mdadm --detail --scan
INACTIVE-ARRAY /dev/md127 metadata=1.2 name=localhost.localdomain:0 UUID=243e5d11:e1049ad5:a4a2ce43:304fdb4f

[root@Zodiac ~]# mdadm -vv --assemble --force /dev/md0 /dev/sd[cdefghijkl]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdc1 is busy - skipping
mdadm: no recogniseable superblock on /dev/sdd1
mdadm: /dev/sdd1 has no superblock - assembly aborted
[root@Zodiac ~]#

I also attached the output of - mdadm --examine /dev/sd*

[root@Zodiac ~]# mdadm --examine /dev/sd*
/dev/sda:
MBR Magic : aa55
Partition[0] : 2097152 sectors at 2048 (type 83)
Partition[1] : 974673920 sectors at 2099200 (type 8e)
mdadm: No md superblock detected on /dev/sda1.
mdadm: No md superblock detected on /dev/sda2.
mdadm: No md superblock detected on /dev/sdb.
/dev/sdc:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9e0c7861:16ceac28:516359d9:59cd656f

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : cb80fcde - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 0
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
mdadm: No md superblock detected on /dev/sdd1.
/dev/sde:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 3cfbd19f:052f19e4:ef9c0132:b3537526

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : d4ff0c64 - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 1
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : eebd70e4:2ff03aa7:3e22b382:cbdc2f1a

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 212b9ea9 - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 2
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 2ec94af6:9a7ad26a:e76988a9:11d2b8b0

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : b3fb7144 - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 3
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdh1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 00162b82:1c3488bf:f1299cf6:bc353dae

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : df299b4e - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 4
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
mdadm: No md superblock detected on /dev/sdi1.
/dev/sdj:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdj1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 87366ae7:d3fcd71e:f4d5b50d:3a493d1f

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 2bd2440d - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 5
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdk:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdk1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : b39843dc:4bb2d0de:c0753679:a52bed1d

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 4ad4ddea - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 6
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdl:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
/dev/sdl1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x45
Array UUID : 243e5d11:e1049ad5:a4a2ce43:304fdb4f
Name : localhost.localdomain:0
Creation Time : Mon Sep 11 21:40:47 2017
Raid Level : raid6
Raid Devices : 10

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 15627059200 (14903.13 GiB 16002.11 GB)
Data Offset : 262144 sectors
New Offset : 258048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 3053a5ba:77ac33c9:0ae49712:24186eaf

Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 515829760 (491.93 GiB 528.21 GB)
Delta Devices : 2 (8->10)

Update Time : Fri Sep 15 21:22:14 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 3e7bed5d - correct
Events : 30173

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 7
Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
[root@Zodiac ~]#
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Sunday, September 17 2017, 01:42 AM - #Permalink
Resolved

0 votes

As for the raid - did you stop /dev/md127 and /dev/md1 (assuming /dev/md1 is your raid device) before doing the "mdadm --detail -scan"

If not, then try again stopping both arrays first...

# cat /proc/mdstat # show us output
# mdadm -S /dev/md127 # show us output
# mdadm -S /dev/md1 # show us output (assuming /dev/md1 is your array)
# mdadm --detail --scan # show us output

if the ...scan fails, after using both 'stop raid commands', try

# mdadm -vv --assemble --force /dev/md1 /dev/sd[abcd...]1

that's two 'v's not a 'w' - show us output - where "abcd..." are all the ten correct drive letters for your raid array and assuming you are using partition 1 for raid which appears to be the case from your output... thanks
Please do ***NOT*** use the 'create' command yet - that is dangerous and ***last** resort only

Here's the output from a simulated failure

cat /proc/mdstat Personalities : md127 : inactive sdd[2](S) sdc[4](S) 3906767024 blocks super 1.2 mdadm -S /dev/127 mdadm: stopped /dev/md127 mdadm -S /dev/md1 mdadm: error opening /dev/md1: No such file or directory mdadm --detail -scan mdadm: /dev/md1 has been started with 2 drives. Personalities : [raid0] [raid1] md1 : active raid1 sdc1[0] sdd1[1] 1953382464 blocks super 1.2 [2/2] [UU] bitmap: 0/15 pages [0KB], 65536KB chunk

Much good information at https://raid.wiki.kernel.org/index.php/Linux_Raid

Tony... http://www.sraellis.tk/
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Sunday, September 17 2017, 12:29 AM - #Permalink
Resolved

0 votes

OK - let's deal with the disks first - and am concerned... This is from Seagate documentation - does give TLER (ERC) specification...

Barracuda XT drives—The performance leader in the family, with maximum capacity, cache and SATA performance for the ultimate in desktop computing

Application Desktop RAID

Cannot find a strict definition - but "desktop raid" often means Raid0 and Raid1 ***ONLY*** - why? because often they don't support TLER (ERC) - and that's a big drawback when used in Raid5 and Raid6 - search the web for all the gory details - but basically on an error occurring the drive should timeout first before the software timeout - this will cause the raid to initiate error recovery. If the software times out first (which is what happens with 'desktop' drives) - the raid thinks the disk is 'broken' and kicks it out of the array. Basically with raid you want the disk to timeout fast as this prevents 'hangs' for the users and raid can recover the data using data from the other drives to re-construct, then re-write the correct data to the 'failing' one. With 'desktop' environments there is only one copy of the data and is only that one drive - so the drive will try desperately, for ages if necessary, to recover the data - the user will experience a 'hang' while this takes place...

To test for TLER (ERC) see below - please give the results from your drives. If they do not support TLER (ERC) - then we need to change the timeouts within the Linux software disk tables...

[root@danda ~]# smartctl -l scterc /dev/sdc smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.v6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) [root@karien ~]# smartctl -l scterc /dev/sdc smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.26.2.v7.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control command not supported
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Saturday, September 16 2017, 08:50 PM - #Permalink
Resolved

0 votes

So i get the following:

[@Zodiac ~]# cat /proc/mdstat
Personalities :
md127 : inactive sdg1[7](S) sdf1[6](S) sdd1[4](S) sdc1[3](S) sdb1[2](S) sda1[1](S)
11720294400 blocks super 1.2

unused devices: <none>

and

[@Zodiac ~]# mdadm --detail -scan
INACTIVE-ARRAY /dev/md127 metadata=1.2 name=localhost.localdomain:0 UUID=243e5d11:e1049ad5:a4a2ce43:304fdb4f

[root@Zodiac ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
└─sda1 8:1 0 1.8T 0 part
sdb 8:16 0 1.8T 0 disk
└─sdb1 8:17 0 1.8T 0 part
sdc 8:32 0 1.8T 0 disk
└─sdc1 8:33 0 1.8T 0 part
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
sdg 8:96 0 1.8T 0 disk
└─sdg1 8:97 0 1.8T 0 part
sdh 8:112 0 465.8G 0 disk
├─sdh1 8:113 0 1G 0 part /boot
└─sdh2 8:114 0 464.8G 0 part
├─clearos-root 253:0 0 456.9G 0 lvm /
└─clearos-swap 253:1 0 7.9G 0 lvm [SWAP]
sdi 8:128 0 465.8G 0 disk /var/flexshare/shares/torrents
sdj 8:144 0 1.8T 0 disk
└─sdj1 8:145 0 1.8T 0 part
sdk 8:160 0 1.8T 0 disk
└─sdk1 8:161 0 1.8T 0 part
sdl 8:176 0 1.8T 0 disk
└─sdl1 8:177 0 1.8T 0 part

As you can see, the OS drive is now sdh and i do not see any disk showing up as raid here.

Any way to rebuild the raid??
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Saturday, September 16 2017, 07:57 PM - #Permalink
Resolved

0 votes

Hi Tony / Nick

w.r.t the Network, the 2nd NIC is only "missing" when the system goes into emergency mode, i am presuming that the OS does not get that far to load the driver???
Here is the info on the Network ports:

00:19.0 Ethernet controller [0200]: Intel Corporation 82578DM Gigabit Network Co nnection [8086:10ef] (rev 05)
Subsystem: Intel Corporation Device [8086:34ec]
Kernel driver in use: e1000e
Kernel modules: e1000e
--
02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Con nection [8086:10d3]
Subsystem: Intel Corporation Device [8086:34ec]
Kernel driver in use: e1000e
Kernel modules: e1000e

My system is a Intel S3420GP with standard Xeon 3450 and 16GB of RAM
The board had 6x SATA port and i have 2 PCIe cards to give me an additional 6 SATA ports.
No hardware Raid is enabled.

The disks are all from a Qnap NAS that the motherboard failed:
8x seagate barracuda XT 2TB and 2 more of the same that i purchase to make the 10x

Power supply is a 850W Gigabyte PSU

Here is my fstab file - Thank you for pointing out that i need to "#" the Raid map in fstab to boot normal.

#
# /etc/fstab
# Created by anaconda on Thu Sep 14 21:04:41 2017
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/clearos-root / xfs defaults 0 0
UUID=0c365059-ad79-431a-8486-2a83ac85eb67 /boot xfs defaults 0 0
#UUID="3e796bd4-d628-43ad-a4ba-f6c6c7f93710" /store/data1 ext4 defaults 0 0
/dev/mapper/clearos-swap swap swap defaults 0 0

# Mount Drives by Label
LABEL=Torrents /mnt/Torrents ext3 defaults,noatime 0 0
/mnt/Torrents /var/flexshare/shares/torrents none defaults,bind 0 0
#/store/data1/Movies /var/flexshare/shares/movies none defaults,bind 0 0
#/store/data1/Series /var/flexshare/shares/series none defaults,bind 0 0

I still get this feeling that on reboot, that the drives - /dev/sd[a-l] - do not stay in sequence.
I have seen that the boot drive is /dev/sdc and after i add or remove a non-raid drive it will be /dev/sdg.
To me this means that the raid gets broken, they are not /dev/sdc to /dev/sdk, and that is why it is failing and going into "Emergency" mode?
The reply is currently minimized Show
Accepted Answer
Nick Howitt

Offline
Saturday, September 16 2017, 08:46 AM - #Permalink
Resolved

0 votes

Tony Ellis wrote:

The other thing i noted is that my 2nd NIC is also "dead"

Was this resolved? - otherwise could be looking at all sorts of other problems - BIOS, missing interrupts etc
Missing drivers?
lspci -knn | grep Eth -A 3
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Saturday, September 16 2017, 07:56 AM - #Permalink
Resolved

0 votes

Good point Nick - since the OP was putting the system in a rack one would hope this is a decent server that can support 10 drives properly

Examining the output of the command below should confirm if all drives are accessable...
# fdisk -l /dev/sd[a-j]

Leon - tell us what the system hardware is like please... motherboard, raid/disk controllers and PS especially.

One thing worrying thing is the comment...

The other thing i noted is that my 2nd NIC is also "dead"

Was this resolved? - otherwise could be looking at all sorts of other problems - BIOS, missing interrupts etc
The reply is currently minimized Show
Accepted Answer
Nick Howitt

Offline
Saturday, September 16 2017, 07:31 AM - #Permalink
Resolved

0 votes

Another sideways thought. With 10 RAID drives, is your PSU strong enough to reliably spin them all up together?
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Saturday, September 16 2017, 01:14 AM - #Permalink
Resolved

1 votes

One more thing... hope these are proper NAS or Raid certified drives - not consumer - otherwise you will be continually in a world of pain with that many drives...

Can you post the make and model of the disk drives you are using please...
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Saturday, September 16 2017, 01:11 AM - #Permalink
Resolved

0 votes

No that is not normal Sounds as if the raid array is not being assembled on boot.

I assume here that the raid is only for data and no system directories reside on it...
if so - the following should get you going... (assuming an assemble failure)

When in emergency mode type your root password to get in.
Edit /etc/fstab and put a comment ( # ) in front of the line for the raid filesystem
Reboot - the system should come up without the filesystem on the raid

# cat /proc/mdstat
(please save a copy of this to review later..
e.g. cat /proc/mdstat > /root/mdstat_on_boot.txt)

suspect it will have an entry re. md127
if so
# mdadm -S /dev/md127

for good measure in case it is stuck
# mdadm -S /dev/mdx
where "x" is your array number

ensure you have a valid mdadm.conf - here is the complete file for one of mine as a example

MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md10 metadata=1.2 name=karien.sraellis.com:10 UUID=41cd67d5:98593c51:a02a59db:aaa91e8e

then

# mdadm --assemble --scan
you should get a message that your array has started

# cat /proc/mdadm
looks good?

remove the comment you added to the /etc/fstab file and mount
# mount -a

all being well the filesystem should mount - then requirement to determine why raid array is not being assembled on a reboot...

There are of course other failure methods - just picked the most likely...
How about posting your fstab and mdadm.conf files together with the output from "cat /proc/mdadm"
Also "mdadm --detail /dev/mdx" where "x" is your array number when you get it going...

Also before rebooting - comment out the raid filesystem line in /etc/fstab until you have resolved the problem. This may save the "Emergency mode"after a reboot. You can tell from /etc/mdstat after reboots if you have solved the problem, then you can leave the line active in fstab. Another solution is to leave it commented out until the problem is fixed, and in the meantime start the filesystem from the command line after you have manually got the raid going correctly..
e.g. mount -o noatime /dev/md1 /data - substitute your own values...

Edit - fixed the odd typo - **verify** my commands before using...
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Friday, September 15 2017, 07:45 PM - #Permalink
Resolved

0 votes

So it is official.... I HATE RAID.....

Did a shutdown to move the server back to it's rack.....Emergency mode, again... What The Fudge???
Is this normal?
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Friday, September 15 2017, 06:13 PM - #Permalink
Resolved

0 votes

So i could not get ClearOS to boot normally, did not matter what i did, it went to the recovery console.
So i did a clean install again with only the OS drive connected, setup ClearOS to the point where the network is configured and you have to re-boot.

I then connected the 11 other drives and re-booted.
During going thru the update and installing the apps i checked with putty and the Raid was re-building itself - 1200 odd min later - Raid6 fully on line.

I am now growing the Raid6 to 10x 2TB drives, that is going to take some time....
Once that is done i shall see how to release the 5%, about 1TB of space and i should have about 14TB of Raid6 storage.

As this is my 1st dive into Software Raid, i have come to the conclusion that Raid and non-Raid drives on the same system is NOT for me.
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 14 2017, 05:25 PM - #Permalink
Resolved

0 votes

Hi Tony

Thank you for the pointer, let me do the google thing and i shall post here the moment i have some news
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Thursday, September 14 2017, 05:23 PM - #Permalink
Resolved

0 votes

You create it... search the web and look at the 'man' page for mdadm...

EDIT: Sorry for being curt - but it is 3.25 am in in Aus and I must get to bed :-(
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 14 2017, 05:20 PM - #Permalink
Resolved

0 votes

Hi Tony

Where can i find mdadm.conf?
The reply is currently minimized Show
Accepted Answer
Tony Ellis

Offline
Thursday, September 14 2017, 05:20 PM - #Permalink
Resolved

0 votes

We crossed

Is their a way to use the raid drives UUID for mdadm, as the UUID never changes

See my append just before yours :-)
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 14 2017, 05:16 PM - #Permalink
Resolved

0 votes

Hi Nick

What i mean is that if i run cat /proc/mdstat, it show md127: active raird6 sda[1] sdb[2] sdd[5] ..... you get the picture
Now when i remove or add drives, sda or sdd might not be part of the raid anymore as they have now been assigned to the other non-raid drives.

Is their a way to use the raid drives UUID for mdadm, as the UUID never changes.

In fstab i have a UUID, but this is not for the individual drive, but for the UUID that mdadm gave me when i made the raid array
The reply is currently minimized Show

Accepted Answer

Tony Ellis

Offline

Thursday, September 14 2017, 05:09 PM - #Permalink

Resolved

0 votes

By specifying UUID in mdadm.conf would have thought /dev/sdx changing would not be a problem - i.e. a mdadm.conf similar to below - the disks are scanned for the UUID to start the array, regardless of the /dev/sdx values... First 4 arrays are raid 1, the last raid 5, example from one of my machines



ARRAY /dev/md0 metadata=0.90 UUID=f0511672:1c4c78a6:0d13d201:d8f86b49

ARRAY /dev/md2 metadata=0.90 UUID=9f28af19:dd735a97:f0090154:079827c8

ARRAY /dev/md3 metadata=0.90 UUID=5682de5a:98ebd29a:df9ea576:ef6ee481

ARRAY /dev/md4 metadata=0.90 UUID=de20142e:4f6ad390:17d8605d:b2c36998

ARRAY /dev/md10 metadata=1.2 name=danda.sraellis.com:10 UUID=5006cf00:449f8311:2bbdec12:b86c4ef1

an alternative file that also avoids /dev/sdx entries...



ARRAY /dev/md0  level=raid1 num-devices=2 UUID=f0511672:1c4c78a6:0d13d201:d8f86b49

ARRAY /dev/md2  level=raid1 num-devices=2 UUID=9f28af19:dd735a97:f0090154:079827c8

ARRAY /dev/md3  level=raid1 num-devices=2 UUID=5682de5a:98ebd29a:df9ea576:ef6ee481

ARRAY /dev/md4  level=raid1 num-devices=2 UUID=de20142e:4f6ad390:17d8605d:b2c36998

ARRAY /dev/md10 level=raid5 num-devices=3 UUID=5006cf00:449f8311:2bbdec12:b86c4ef1

# mdadm --detail --scan
shows what could be in your mdadm.conf
# mdadm --detail /dev/mdx
will provide the UUID for a single array and the current /dev/sdx assignments amongst other information...



/dev/md10:

        Version : 1.2

  Creation Time : Tue May  5 11:30:08 2015

     Raid Level : raid5

     Array Size : 1845515264 (1760.02 GiB 1889.81 GB)

  Used Dev Size : 922757632 (880.01 GiB 944.90 GB)

   Raid Devices : 3

  Total Devices : 3

    Persistence : Superblock is persistent



  Intent Bitmap : Internal



    Update Time : Fri Sep 15 02:41:29 2017

          State : clean 

 Active Devices : 3

Working Devices : 3

 Failed Devices : 0

  Spare Devices : 0



         Layout : left-symmetric

     Chunk Size : 512K



           Name : danda.sraellis.com:10  (local to host danda.sraellis.com)

           UUID : 5006cf00:449f8311:2bbdec12:b86c4ef1

         Events : 15590



    Number   Major   Minor   RaidDevice State

       3       8       54        0      active sync   /dev/sdd6

       1       8       38        1      active sync   /dev/sdc6

       2       8       24        2      active sync   /dev/sdb8

Can you post your fstab and mdadm.conf files...

EDIT: - just saw this update

I suspect that not everyone has a raid system with non-raid drives.

Not everyone, but at least there are some systems here with a mixture of software raid and non-raid disks :-)

The reply is currently minimized Show

Accepted Answer
Nick Howitt

Offline
Thursday, September 14 2017, 05:07 PM - #Permalink
Resolved

0 votes

I'm not sure what you're aiming for. The safest way to mount drives in /etc/fstab is by UUID and it is easy to change the entries round. If you do that the best thing to do is reboot for it to take effect. "mount -a " may pick up the changes, but I don't know. If it does not and you don't want to reboot you'll need to unmount the drives first before giving the "mount -a" command.

I'm afraid I don't understand this bit:

Is their a way to exit the raid info to mount by UUID instead of /dev/sdx?
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 14 2017, 04:35 PM - #Permalink
Resolved

0 votes

Hi Nick

I checked, and yes, i do not have any non-raid drives mapped in fstab

What i did see is exactly what i expected the issue is:
When i change - remove or add non-raid drives - ClearOS does NOT assign the same /dev/sdx id to the drives.
I suspect that this is the issue.

I did query this before but was told that you mount the Raid by it's UUID.
I suspect that not everyone has a raid system with non-raid drives.

Is their a way to edit the raid info to mount by UUID instead of /dev/sdx?
That way it does not matter what Linux do.
The reply is currently minimized Show
Accepted Answer
Leon

Offline
Thursday, September 14 2017, 09:44 AM - #Permalink
Resolved

0 votes

Hi Nick

Yes i did that before i shut the server down.
I shall check again when i get home from work later today.
The reply is currently minimized Show
Accepted Answer
Nick Howitt

Offline
Thursday, September 14 2017, 08:53 AM - #Permalink
Resolved

0 votes

Can you look at /etc/fstab and comment out the entry which pointed to the drive you removed. In recovery mode you can use the nano editor, so something like:
nano /etc/fstab
or
cd /etc nano fstab
If you try to exit nano, it will prompt you to save the file.
The reply is currently minimized Show

Your Reply

Please login to post a reply

You will need to be logged in to be able to post a reply. Login using the form on the right or register an account if you are new here.

Community Forums

ClearOS Portal

ClearVM Platform

ClearVM 2 Platform

Forums