Stressful Day

Today was a rather stressful day.

When I built my current PC several years ago, ECC memory was part of my specification. When compiling software to distribute, ECC memory is important. What happens is every now and then, a bit gets randomly flipped in memory that should not get flipped. This happens with increasing frequency as semiconductors get smaller. While often harmless, a flipped bit can cause monetary damage to someone using software that I compiled if the bit in memory got flipped on my system while compiling the software.

ECC memory is error correcting, when a single bit in a byte flips and is not correct ECC memory can correct for it, but ECC memory is not supported by consumer grade Intel CPUs, I have to use a Xeon processor to utilize ECC memory.

XEON processors do not have a GPU, you have to use a video card if you are going to use it with a Graphical User Interface (it is possible to attach a standard VGA monitor and only use command-line but while there are CLI web browsers, how would I see all the useless memes people post?).

When first building the system I went with a very inexpensive outdated video card that uses the legacy Nvidia driver. It was cheap, low power, but powerful enough for all my personal graphics needs. The original card (GeForce 405):

Old low-profile nVidia video card

This card was an OEM card, meaning Nvidia did not intend it for retail sale. I bought it from an importer for around $10 as it was already deprecated and was low end even when it was not already deprecated. But it did everything I personally needed from a video card and worked with a low-profile bracket.

Years later (in 2020) the fan on it started to die. It would stop spinning and then it would overheat and then my system would crash. Not fun, but fortunately my PC is built in a Media PC case so I could have the top off and monitor whether or not the fan was spinning and power down if it stopped. But obviously the card needed to be replaced.

They no longer sell that card or any card that uses the same driver, so I went with an MSI branded GeForce GT 1030. It is not a ‘gaming’ card either, though probably good enough for many games. What I like about it is like the card it is replacing, it is low power consumption so I do not need to replace my power supply and unlike the card it is replacing, it has a massive heat sink instead of a fan. No fan to fail like with the last card means it will last until the card itself dies.

There is an open source driver for Nvidia cards and I am an open source advocate but the open source driver for Nvidia cards sucks. It just does not work very well. The problem is Nvidia does not open up the hardware specs needed for kernel developers to maintain a quality driver.

The old card needed the legacy proprietary drivers that do not work with the new card and the newer card needed the current proprietary drivers that do not work with the old card.

Switching should have been simple—

Boot to command line with old card installed, power off, switch cards, boot to command line with new card installed, remove the legacy drivers, install current drivers, verify GUI works, switch back to booting to GUI.

Should have been simple and are are simple however turned out to be two different things.

I literally could not boot to the command line. The system would hang during the boot process. This was with the old card installed. Many moons ago when I built this system (when CentOS 7 was still fairly new) booting to command line worked, but it was not working now.

Booting to single user emergency CLI still worked but not booting to multi-user normal CLI. I had to boot to emergency CLI single user and change the system to boot to GUI again, hence not able to remove the legacy drivers.

To try and diagnose the problem, I again set it to boot to multi-user CLI but removed the rhgb quiet boot options. This almost bricked the system. I did see the messages during boot process but not anything of value. And now, even booting to emergency single user CLI failed!

Fortunately I was able to edit the boot parameters (press e key when selecting kernel to boot in grub2 menu) and get it to boot to GUI. But no boot to CLI.

After a lot of searching online where all I could find was the opposite problem—booting to CLI worked but not booting to GUI—I finally found the solution. Add the nomodeset kernel option. That apparently tells the kernel not to even try to load the video drivers until the GUI is started. To me, it seems that should be set by default, there is no need to load video drivers until then anyway.

That got things working where I could boot to CLI. Then I was able to follow the plan and it all worked. Removed old card, installed new card, boot to CLI, remove legacy driver, install modern driver, reboot to CLI and manually start GUI, reconfigure to boot straight to GUI.

It is all working now but was a very frustrating experience.

Image of new card installed:

PC without top panel

The new card is the card with the big massive heat sink that is between the power supply (lower left) and CPU (tan fan on top).

Things are working, I can put the case cover back on and not have to worry about the system crashing…

Leave a Reply

Your email address will not be published. Required fields are marked *