09 April 2014

571. Briefly: Dodgy/underpowered UPS?

I've built quite a few computers in the past, and in general I haven't had any issues beyond the odd dodgy RAM stick.

However, a while back I became careless and built a box ('Oxygen') where the motherboard didn't officially support the CPU. Swapping CPUs with another box seemed to solve the issues I had.

See e.g.
http://verahill.blogspot.com.au/2013/10/520-new-node-amd-fx-835032-gb-ram990-fx.html
http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

In the past couple of weeks I've begun to see some worrying signs that all isn't right. In particular I noticed the following in the dmesg output:
[693166.514897] [Hardware Error]: MC2 Error: VB Data ECC or parity error. [693166.514926] [Hardware Error]: Error Status: Corrected error, no action required. [693166.514934] [Hardware Error]: CPU:6 (15:1:2) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98414000010c0176 [693166.514955] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV

A few days after that, the computer turned itself off without returning any additional error messages. It did cause me to look at the sensor output though (I've been logging it every two minutes for months), and I compared it with another computer ('Neon') which is completely stable. Note that both computers have been running the same types of jobs recently (large memory frequency jobs).

Hardware specs:
Oxygen: AMD FX8150, 32 gb ram, Corsair GS700, asrock 990 fx extreme3
Neon: AMD FX8350, 32 gb ram, Corsair GS800, gigabyte 990 fxa

Anyway, this is what I found:
On Neon the power output is very stable, while on Oxygen it jumps up and down between ca 45 W and 130 W.

Has it been a crappy UPS that has been causing the issues all along? Or do these plot mean nothing?

570. Briefly: restarting a g09 frequency job with SGE, using same queue

I've had g09 frequency jobs die on me, and in g09 analytical frequency jobs can only be restarted using the .rwf. Because the .rwf files are 160 gb, I don't want to be copying them back and forth between nodes. It's easier then to simply make sure that the restarted job is run on the same node as the original job.

A good resource for SGE related stuff is http://rous.mit.edu/index.php/SGE_Instructions_and_Tips#Submitting_jobs_to_specific_queues

Either way, first figure out what node the job ran on. Assuming that the job number was 445:
qacct -j 445|grep hostname
hostname compute-0-6.local

Next figure out the PID, as this is used to name the Gau-[PID].rwf file:
grep PID g03.g03out
Entering Link 1 = /share/apps/gaussian/g09/l1.exe PID= 24286.

You can now craft your restart file, g09_freq.restart -- you'll need to make sure that the paths are appropriate for your system:
%nprocshared=8 %Mem=900000000 %rwf=/scratch/Gau-24286.rwf %Chk=/home/me/jobs/testing/delta_631gplusstar-freq/delta_631gplusstar-freq.chk #P restart
(having empty lines at the end of the file is important) and a qsub file, g09_freq.qsub:
#$ -S /bin/sh #$ -cwd #$ -l h_rt=999:30:00 #$ -l h_vmem=8G #$ -j y #$ -pe orte 8 export GAUSS_SCRDIR=/tmp export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09 /share/apps/gaussian/g09/g09 g09_freq.restart > g09_freq.out
Then submit it to the correct queue by doing
qsub -q all.q@compute-0-6.local g09_freq.qsub

The output goes to g09_freq.log. You know if the restart worked properly if it says
Skip MakeAB in pass 1 during restart.
and
Resume CPHF with iteration 214.

Note that restarting analytical frequency jobs in g09 can be a hit and miss affair. Jobs that run out of time are easy to restart, and some jobs that die silently have also been restarted successfully. On the other hand, a job that died because my resource allocations ran out couldn't be restarted i.e. restart started the freq job from scratch. The same happened with one a node of mine that has what seems like a dodgy PSU. Finally, I also couldn't restart jobs that died silently due to allocation all the RAM to g09 without leaving any to the OS (or at least that's the current best theory). It may thus be a good idea to back up the rwf file every now and again, in spite of the unwieldy size.

03 April 2014

569. Briefly: Dual monitor set-up on Debian Wheezy with Gnome and a single nvidia graphics card

This is very easy, but I might as well document it here anyway.

This morning another group threw out two functioning monitors and I grabbed both. While I haven't yet decided on what to do with the second one I decided to use one to make a dual monitor set-up for my work station.

My desktop has both onboard nvidia graphics and a separate pci-e nvidia (GT 430) graphics card. Using lspci only the external graphics card shows up, probably because the bios prioritises the external card and disables the onboard graphics.

The nvidia GT 430 card has three output ports: vga, hdmi and dvi. My main monitor (Dell P2411H, 1920x1080) has both vga and dvi, and my 'new' monitor (HP S1932, 1366x768) only has vga.

The first step was to physically connect both monitors to my computer. I originally thought I had to use one card per monitor, which would've necessitated me to reboot, change the bios and probably hand-craft an xorg.conf. I don't like rebooting, so I looked at the alternatives.

Apparently you can simply connect both monitors to the same card by using the different ports, so I hooked up the small screen to the vga port and the big one to the hdmi port.

After that it was a simple matter of opening 'displays' in the gnome 3 systems settings, setting both monitors to 'on', and arranging them side by side correctly by dragging them with the mouse:



I also had a look at it in nvidia-settings:

The only issue that remained was guake -- it was showing up in the 'wrong' screen (i.e. the left-most, smaller one).  This post showed how to edit: http://haifzhan.blogspot.com.au/2013/10/guake-dual-monitor-setup.html

This is how to do it on the version currently in wheezy, 0.4.3:
sudo vim `which guake`
814 def get_final_window_rect(self): 815 """Gets the final size of the main window of guake. The height 816 is the window_height property, width is window_width and the 817 horizontal alignment is given by window_alignment. 818 """ 819 screen = self.window.get_screen() 820 height = self.client.get_int(KEY('/general/window_height')) 821 width = 100 822 halignment = self.client.get_int(KEY('/general/window_halignment')) 823 824 # get the rectangle just from the first/default monitor in the 825 # future we might create a field to select which monitor you 826 # wanna use 827 828 #monitor = 0 # use the left most monitor 829 monitor = screen.get_n_monitors() - 1 # use the rightmost monitor 830 831 monitor_rect = screen.get_monitor_geometry(monitor) 832 window_rect = monitor_rect.copy() 833 total_width = window_rect.width 834 window_rect.height = window_rect.height * height / 100 835 window_rect.width = window_rect.width * width / 100 836 837 if width < monitor_rect.width: 838 if halignment == ALIGN_CENTER: 839 window_rect.x = monitor_rect.x+(monitor_rect.width - window_rect.width) / 2 840 elif halignment == ALIGN_LEFT: 841 window_rect.x = monitor_rect.x 842 elif halignment == ALIGN_RIGHT: 843 window_rect.x = monitor_rect.x+monitor_rect.width-window_rect.width 844 window_rect.y = monitor_rect.y 845 return window_rect 846

Note that the edited version will be overwritten when you upgrade guake.