Forums

nuke
nuke
Offline
Resolved
0 votes
Over the past week, logwatch has shown a kernel error. When I look back in the log I see it started a few weeks back. One day an error, then nothing for weeks. In the past week had 2 errors so time replace the drive.

After many attempts, I can't get the replaced hard drive to boot. I believe it is missing a link but not sure how to diagnose. Any suggestions appreciated.

Background:
Here is a summary of my attempted fixes ...

Here is an example of the kernel errors.
--------------------- Kernel Begin ------------------------ 
WARNING: Kernel Errors Present
ata1.00: cmd 60/00:28:e0:66:c0/01:00:02:00:00/40 tag 5 ncq 131072 in#012 res 41/40:00:38:67:c0/00:00:02:00:00/00 Emask 0x409 (media error) <F> ...: 1 Time(s)
ata1.00: cmd 60/08:00:38:67:c0/00:00:02:00:00/40 tag 0 ncq 4096 in#012 res 40/00:80:a0:51:39/00:00:2c:00:00/40 Emask 0x1 (device error) ...: 1 Time(s)
ata1.00: cmd 60/08:30:38:67:c0/00:00:02:00:00/40 tag 6 ncq 4096 in#012 res 41/40:00:38:67:c0/00:00:02:00:00/00 Emask 0x409 (media error) <F> ...: 1 Time(s)
ata1.00: cmd 60/08:a0:38:67:c0/00:00:02:00:00/40 tag 20 ncq 4096 in#012 res 41/40:00:38:67:c0/00:00:02:00:00/00 Emask 0x409 (media error) <F> ...: 1 Time(s)
ata1.00: cmd 60/20:88:40:41:ad/00:00:2c:00:00/40 tag 17 ncq 16384 in#012 res 40/00:80:a0:51:39/00:00:2c:00:00/40 Emask 0x1 (device error) ...: 1 Time(s)
ata1.00: error: { UNC } ...: 3 Time(s)
blk_update_request: I/O error, dev sda, sector ...: 3 Time(s)
sd 0:0:0:0: [sda] tag#20 Add. Sense: Unrecovered read error - auto reallocat ...: 1 Time(s)
sd 0:0:0:0: [sda] tag#20 Sense Key : Medium Error [current] [descr ...: 1 Time(s)
sd 0:0:0:0: [sda] tag#5 Add. Sense: Unrecovered read error - auto reallocat ...: 1 Time(s)
sd 0:0:0:0: [sda] tag#5 Sense Key : Medium Error [current] [descr ...: 1 Time(s)
sd 0:0:0:0: [sda] tag#6 Add. Sense: Unrecovered read error - auto reallocat ...: 1 Time(s)
sd 0:0:0:0: [sda] tag#6 Sense Key : Medium Error [current] [descr ...: 1 Time(s)
---------------------- Kernel End -------------------------


I installed smartmontools and ran it. The system disk is getting some errors. Decision time to replace the hard drive.

Normally I'd do a fresh install but have added a bunch of customizations that I didn't document. So I thought it would be easier to just clone the drive???
Sure enough there were posts in the forum recommending Clonezilla.
The clonezilla failed the first time. I figured that must be due to the media errors that the kernel error mentioned. So I tried it again with the --recover and fsck options. Second time was successful.

Then I replaced the drive in the server and rebooted. :-(

I ended up seeing the grub menu and selecting the image but then dropping to dracut and getting "Entering emergency mode. Exit the shell to continue. Type 'journalctl" to view system logs. You might want to save "/run/initramfs/rdsosreport.txt" to a usb stick after mounting them and attach it to a bug report."

Completely embarrassed as I couldn't figure out how to get the USB stick mounted to copy the file to add here. No /mnt and no ability to add a /mnt.

Rebooted using ClearOS installer to try the recovery. The recovery got through to the grub menu and when selecting recovery it dropped to the dracut, Same happened when I selected any of the other kernel images.

I tried to find the volumes to make sure they are there:
# lvm vgscan
# lvm vgchange -ay
# blkid


All there.
xfs_repair the lvm images but clearos-root appears to be missing the "superblock"?

Now stuck.

Put old drive back. Boots up and back working but not fixed. :-( Help.
Tuesday, May 25 2021, 12:33 AM
Share this post:

Accepted Answer

Thursday, May 27 2021, 07:52 AM - #Permalink
Resolved
0 votes
There should be no reason that I can think of that getting the chroot line wrong would stop you being able to boot again from USB/DVD.

The initramfs message could be interesting. There are plenty of references for generating a new one. Google "generate initramfs centos 7". This looks interesting.

I don't think /boot is in the LVM. I think it is a native partition in its own right. I am not sure you can copy a /boot from one drive to another as it has the drive partition references in it, but I am not sure.

Otherwise, especially if you have a significant user set up or OpenVPN certificates deployed, I'd go for option 2 over 3, but note that after you do a system restore, I'd delete and recreate websites and flexshares as a restore on its own does not generate the folders or the bind mounts in /etc/fstab. If you can still mount your old disk you may be able to recover those and the rest of your data. They should be fast to copy over if you have both disks mounted in the same machine. From /etc/fstab, don't copy over the whole file or you will really mess things up, just copy the bind mount settings.
The reply is currently minimized Show
Responses (11)
  • Accepted Answer

    Tuesday, May 25 2021, 07:15 AM - #Permalink
    Resolved
    0 votes
    Clonezilla is a good tool and should work.
    I've recently good experiences with MiniTool ShadowMaker (Free). It is a windows tool, but it worked.
    The reply is currently minimized Show
  • Accepted Answer

    Tuesday, May 25 2021, 07:31 AM - #Permalink
    Resolved
    0 votes
    For disk repair, if TestDisk won't do it, I think you have to go professional.
    The reply is currently minimized Show
  • Accepted Answer

    nuke
    nuke
    Offline
    Thursday, May 27 2021, 02:48 AM - #Permalink
    Resolved
    0 votes
    Nick Howitt wrote:

    For disk repair, if TestDisk won't do it, I think you have to go professional.


    Thank you Nick and Patrick. I might try the clonezilla again. I don't have a Window pc so unfortunately I can't try MiniToolShadowMaker.

    I used Testdisk on the drive and the partitions look to be OK. After the "analysis", Testdisk looks to have added the * boot to the right partition. After running Testdisk, I switched out the new drive into the server but it still won't boot. :(

    I suspect there is a missing link to the images or file or grub is buggered. When doing the ClearOS Recovery from the installer there was a part about "chroot" that I think I should have added something on the cmd line but it wasn't clear in the instructions and then I couldn't get the server to boot from the DVD or USB again. I've decided that Dell BIOS is a PITA.

    Grub shows the list of kernels but when you select any of them you drop to dracut with a message about initramfs...

    (If I knew more, I suspect this would be relatively easy to figure out but I'm a bit of a novice.)

    This leaves me with a couple of questions since I can't figure this out.
    1) lvm is supposed to be able to do live-migration of data. I've started to read the Redhat documentation but it is a book for just lvm.
    I think I should be able to create a snapshot of the boot directory etc that works on the failing drive, and copy/migrate that to the new drive.
    Does this make sense or am I looking for lvm to do something it isn't set up to do?

    Or

    2) I've done the backups of all the data and settings. I've got daily backups from "Configuration Backup and Restore" for years. The /home is on a separate drive. All email are POP so everyone has their email locally. The drive that is starting to have problems contains only the COS system (and mail & websites).

    If I do a fresh/new install on the new drive, can I boot to the webconfig and restore all the settings using "Configuration Backup and Restore"? Would this be less futzing around and get up and running again relatively quickly? I expect I'll have to put back a copy of the websites but that shouldn't be an issue. I'm not sure about the mail server but with the settings files and no mail stored on the server that shouldn't be too big a chore?

    Or

    3) Should I be rebuilding the complete server from scratch and manually set up everything?

    Thanks again for your help and suggestions.
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, May 27 2021, 07:58 AM - #Permalink
    Resolved
    0 votes
    nuke wrote:

    Thank you Nick and Patrick. I might try the clonezilla again. I don't have a Window pc so unfortunately I can't try MiniToolShadowMaker.


    Beside the tips from Nick and other methode, you can also use this tool.
    I've done this many time with my COS disk to make a complete clone.

    Use a External Hard Drive Docking Station like this (for example)
    It clones HDD without a PC.

    https://images-na.ssl-images-amazon.com/images/I/4136PRJnyUL._AC_SX466_.jpg
    The reply is currently minimized Show
  • Accepted Answer

    nuke
    nuke
    Offline
    Friday, May 28 2021, 07:46 PM - #Permalink
    Resolved
    0 votes
    Patrick de Brabander wrote:

    Beside the tips from Nick and other methode, you can also use this tool.
    I've done this many time with my COS disk to make a complete clone.

    Use a External Hard Drive Docking Station like this (for example)
    It clones HDD without a PC.


    Thank you Patrick. That is a cool piece of gear. I'll be having a look on the weekend. That would certainly save some effort.
    The reply is currently minimized Show
  • Accepted Answer

    nuke
    nuke
    Offline
    Friday, May 28 2021, 08:14 PM - #Permalink
    Resolved
    0 votes
    Nick Howitt wrote:

    There should be no reason that I can think of that getting the chroot line wrong would stop you being able to boot again from USB/DVD.

    The initramfs message could be interesting. There are plenty of references for generating a new one. Google "generate initramfs centos 7". This looks interesting.

    I don't think /boot is in the LVM. I think it is a native partition in its own right. I am not sure you can copy a /boot from one drive to another as it has the drive partition references in it, but I am not sure.

    Otherwise, especially if you have a significant user set up or OpenVPN certificates deployed, I'd go for option 2 over 3, but note that after you do a system restore, I'd delete and recreate websites and flexshares as a restore on its own does not generate the folders or the bind mounts in /etc/fstab. If you can still mount your old disk you may be able to recover those and the rest of your data. They should be fast to copy over if you have both disks mounted in the same machine. From /etc/fstab, don't copy over the whole file or you will really mess things up, just copy the bind mount settings.


    Thank you Nick.

    You are correct. The /boot is an ext2/3/4 partition and is not in the LVM.

    The second partition on the hard drive is the LVM volume. It appears that the cloned /boot has defective sectors or corrupted file(s). I suspect that is why it won't boot up properly.

    I've read the info about initramfs and will give that a try but I think I need to try a two pronged approach.

    I still have my old server box with similar hardware but one generation older. The new server is build the same just with a newer CPU and more RAM. Everything else is essentially the same.

    Would the following work?
    Could I install the current COS version onto my old server, do the restore of settings and if everything runs OK, just swap the hard drive to the new server? This way I wouldn't touch the existing server hard drive again until I do the swap out. I think I reduce the risk that sectors used on the hard drive sector fail in start-up before I've got a rebuilt server.

    Is that advisable or are the auto identification of hardware in COS going to have a problem with the switch?

    Thanks again for all the advice!
    The reply is currently minimized Show
  • Accepted Answer

    Friday, May 28 2021, 08:30 PM - #Permalink
    Resolved
    0 votes
    I have successfully moved a disk between servers. It is how I did my 7.x installation, but note that all your network interfaces may (will) change so check the files you will need to review in the Config Backup docs. The other thing is the two servers will have to boot the same way, either UEFI or BIOS.

    Note, if doing a disk recovery, best practice says always do it on a clone of the disk and never on the disk itself
    The reply is currently minimized Show
  • Accepted Answer

    nuke
    nuke
    Offline
    Sunday, May 30 2021, 02:28 AM - #Permalink
    Resolved
    0 votes
    Nick Howitt wrote:

    I have successfully moved a disk between servers. It is how I did my 7.x installation, but note that all your network interfaces may (will) change so check the files you will need to review in the Config Backup docs. The other thing is the two servers will have to boot the same way, either UEFI or BIOS.

    Note, if doing a disk recovery, best practice says always do it on a clone of the disk and never on the disk itself
    Thanks Nick.
    I have reinstalled on the new drive and restored all the settings via "Configuration Backup and Restore". It was mostly OK. There were issues with the Let's Encrypt certificates, flexshare, fail2ban, mounting our nas and virtual websites. I think those are fixed now.

    I have two outstanding issues.
    The reinstall seems to buggered up the certificate for the webconsole. It is now a localhost.localdomain. It should be using my server Let's Encrypt certificate. I can't figure out where to change this. It doesn't appear to be in the HowTo unless I missed it.

    Second is not unexpected but frustrating ... Even after I told everyone to download their email, one family member didn't. So the second questions, is what would you use as search terms to figure out how to rescue the left over mail that are on the old server disk?
    The reply is currently minimized Show
  • Accepted Answer

    Sunday, May 30 2021, 07:21 AM - #Permalink
    Resolved
    0 votes
    nuke wrote:
    Second is not unexpected but frustrating ... Even after I told everyone to download their email, one family member didn't. So the second questions, is what would you use as search terms to figure out how to rescue the left over mail that are on the old server disk?

    If you stil have the original disk an a spare pc, you can use this and startup the old server within the network.
    Open the nescaseery ports and then try to access the old server with ssh.
    If you are able to access the server you can sync the emaibox with imapsync
    The reply is currently minimized Show
  • Accepted Answer

    Sunday, May 30 2021, 08:26 AM - #Permalink
    Resolved
    0 votes
    For LE, it is in the documentation for LE - https://documentation.clearos.com/content:en_us:7_ug_lets_encrypt#replace_the_self-signed_certificate_for_webconfig. The setting is not backed up, as you say. I wonder if it can be safely but the restore program would need extra functionality to restart the webconfig.

    For e-mail, if you use cyrus-imapd, all mail is under /var/spool/imap and /var/lib/imap. The raw mails are under /var/spool/imap and there is a database and other stuff under /var/lib/imap. These can be copied/rsync'd/tar'd across, but it will do everyone at the state of the last mails. If your e-mails have moved on since then, you have a bit of a pickle. You can possibly copy any new e-mails for the user as a backup, can copy in all the old e-mails under /var/spool/imap, but they will all appear as unread and you won't see any where you have replied. Then the new ones you've backed up can probably then be copied in as well, but make sure the files don't duplicate. Be careful manipulating files as the naming is odd as they all end with a . and things don't always go as expected. I think if you edit them you initially lose the trailing ".". Make sure you also copy in the cyrus.* files in each folder or you will have to run "reconstruct" on the mailbox.

    Alternatively you can present your user with all his old files and tell him to add the "eml" extension to each one and then import it into his client.

    You may be able to do the import for him in a fashion. Get him to create an empty folder "old" or some other name under his inbox. Copy in all the old folder structure under this folder. If he can't then see all his old e-mails, I think one of the overnight cron jobs or a restart of cyrus-imapd will make everything visible again.
    The reply is currently minimized Show
  • Accepted Answer

    nuke
    nuke
    Offline
    Wednesday, June 09 2021, 01:16 AM - #Permalink
    Resolved
    0 votes
    Nick Howitt wrote:

    For LE, it is in the documentation for LE - https://documentation.clearos.com/content:en_us:7_ug_lets_encrypt#replace_the_self-signed_certificate_for_webconfig. The setting is not backed up, as you say. I wonder if it can be safely but the restore program would need extra functionality to restart the webconfig.

    For e-mail, if you use cyrus-imapd, all mail is under /var/spool/imap and /var/lib/imap. The raw mails are under /var/spool/imap and there is a database and other stuff under /var/lib/imap. These can be copied/rsync'd/tar'd across, but it will do everyone at the state of the last mails. If your e-mails have moved on since then, you have a bit of a pickle. You can possibly copy any new e-mails for the user as a backup, can copy in all the old e-mails under /var/spool/imap, but they will all appear as unread and you won't see any where you have replied. Then the new ones you've backed up can probably then be copied in as well, but make sure the files don't duplicate. Be careful manipulating files as the naming is odd as they all end with a . and things don't always go as expected. I think if you edit them you initially lose the trailing ".". Make sure you also copy in the cyrus.* files in each folder or you will have to run "reconstruct" on the mailbox.


    Thanks Nick.

    I wanted to close this adventure off for now. I change the self signed cert as per the link provided. No more messages! Thanks for the link! I'm not sure why I didn't find it myself!

    I gave up on the cyrus email extract and told the individual in question that the email was lost. It's a bit of a cop-out but I ran out of patience. :)

    As always, thanks for all your help, Nick. Much appreciated!!
    The reply is currently minimized Show
Your Reply