My Variables Only Have 6 Letters

Wednesday, July 25, 2007

MPICH.NT, MPICH2, and TCP Offloading (continued)

Well we tried updated drivers for the NC373i but with no avail. We were still seeing TOE errors which resulted in MPICH.NT and MPICH2 hanging. Luckily the machines have second NIC's, NC380T PCIe DP Multifunc Gigabit Server Adapters. Unluckily for us, those network cards exhibited the same problem as the embedded NC373i.

This leads me to believe that there is a problem with the Windows 2003 R2 Scalable Networking Pack. Specifically with the Chimney TCP Offloading portion. There may even be an issue in MPICH.NT and MPICH2 and Windows 2003 R2's SNP. However, that just seems highly unlikely as MPICH.NT and MPICH2 are wildly different under the hood.

We've initiated a case with Microsoft to get to the bottom of this issue. Meanwhile our new servers have TOE disabled, which isn't bad, but it isn't good either.

Monday, July 23, 2007

MPICH.NT, MPICH2, and TCP Offloading

Recently we came across a strange problem where on some new machines MPICH.NT and MPICH2 both would fail to correctly operate across machines. On a single machine there were no problems, and actually the application partially worked under MPICH.NT.

My initial guess was MPICH.NT did not play well with Windows 2003 Server R2. Previously we have only had Windows 2000 Server for the MPI jobs, so it seemed logical that changing the server OS would cause some issues. I recompiled the application for MPICH2 (bigger, newer, better) and found the application "hung" in exactly the same fashion as under MPICH.NT.

So now I had a common failure mode across two versions of MPICH (which are wildly different under the hood) on the same OS. I started running MPICH in debug/verbose mode and spent a lot of time looking at the thousands of lines of output and noticed that both under MPICH.NT and MPICH2 they came to the same place and halted, only I couldn't tell where in the code this was.

You may think this is where I fired up the parallel debugger and did wild and crazy things, but that takes too much time. I went with good ole fashioned printf debugging. I ended up getting output like the following:

[0] calling MPI_Bcast ...
[1] calling MPI_Bcast ...
[2] calling MPI_Bcast ...
[3] calling MPI_Bcast ...
[4] calling MPI_Bcast ...
...
[N-2] calling MPI_Bcast ...
[N-1] calling MPI_Bcast ...

Where [X] is the process number and if the call was successful the MPI_STATUS code would be returned on the next line, however, none of these calls returned. An important thing to note is these MPI_Bcast calls were sending buffers of 150MiB+, which is quite large. This fact drove me to check on the network card settings.

While poking around in the network card settings, I noticed two things:

Network card drivers were out of date
Network card status tool counters showed some errors in TCP Offloading

Using the information that TOE had reported errors in the past, I reran the application and watched the TOE error counters. Sure enough, when that MPI_Bcast line was reached, the TOE error counters incremented by one. So I went and disabled TOE on my two test machines and reran the application.

Boom, problem solved. Well...not really.

Disabling TOE will kill performance for other applications that do not have an issue with offloading. However, I cannot upgrade the drivers for the network card (even as a test) without going through 10 miles of red tape. That is neither here nor there, the important part is the problem has been identified and can be solved.

So if you have an HP NC373i Multifunction Gigabit Ethernet Adapter and experience problems with MPICH or MPICH2 on Windows, it is probably the TCP Offloading Engine. Try updating your drivers or disabling TOE to solve the issue. I will post an update if the latest drivers indeed fix this issue.

Tuesday, July 17, 2007

CNN, Kashiwazaki NPP, and the 2007 Niigata Earthquake

(ed: I normally don't blog about non-technical issues, but this chaps my cheeks to no end)

If you haven't heard yesterday an earthquake happened of the coast off the Niigata prefecture in Japan. It had a magnitude of 6.8 and caused serious problems for the Kashiwazaki-Kariwa NPP. A transformer caught fire at unit 3 (there are 7 units) and radioactive liquids spilled into the ocean. This is obviously a serious event and should be treated as such, however, the news coverage was yellow journalism at best. All of the initial reports were sensationalist and biased with little facts (partially due to the tight lipped nature of TEPCO) to support any of their claims.

To see how bad the yellow journalism got just read the following headline blurb:

Radioactive leak, tremors follow Japan quake
A strong earthquake struck northwestern Japan today, causing a radioactive leak and fire at one of the world's most powerful nuclear power plants. Eight people were killed and hundreds injured. The plant leaked about 315 gallons of water, according to a Tokyo Electric official.

So, how many casualties were a result of the radioactive leak and fire:

8 deaths and hundreds of injuries
6 deaths and hundreds of injuries
1 death and tens of injuries
0 deaths and 0 injuries

If you guessed 0 and 0 you would be right!

But wait...

Didn't the blurb say that eight people died and hundreds were injured by the radioactive leak and fire?! Why yes, yes it did. This is far from ethical, yet CNN went right ahead and posted that online.

Don't believe me? Check out this screenshot of CNN's misrepresentation of the Kashiwazaki-Kariwa leak and fire.

Wednesday, July 11, 2007

Lessons Learned Flying Space Available

Flying space available is great. It is cheap, fun, and can be very flexible. However, when things get busy you can find yourself in trouble. Space available travel is prioritized and if you aren't an employee of an airline, you'll find yourself at the bottom of the pile.

You can mitigate your troubles by always arriving for the earliest flight possible. Quite often people oversleep and miss the early flights. Waking up at 3:30 AM to possibly catch the 5:00 AM flight might sound awful, but it may be the only way to make it to your destination.

If you are not able to make it onto a flight you are listed for, you are automatically rolled over to the next flight. However, twice I have not been rolled over for one reason or another and had to be manually moved. Always double check to make sure you have been added to the next flight.

When you miss a flight, it may seem tempting to use the 1-800 number most airlines have to change to a different option. However, I've found they tend to not get the change made properly. Always use the gate agents to make the change and quite often they will also suggest alternate routes (if only to get you off their back).

If you happen to be woefully unlucky and are bumped from one day to the next save your boarding pass from the day before. You can use that to skip the lines at the ticketing counter and go straight through security. Once you get to the gate double check that you are indeed listed on the flight and get yourself a new flight coupon (or if really lucky a boarding pass).

The last bit of advice is to be as sociable as possible. Hang out with the gate agents, any flight attendants who are stuck, and yuck it up with other space available passengers. I've met so many great people sitting around in airports waiting on flights. Some of these people can even help you out if you're really stuck, so it's worth being polite and friendly. Getting somebody a $3 airport coffee can go a long way towards making it onto a flight.

Sunday, June 24, 2007

Conflicting COM Interop DLL Names

Recently our job scheduling software got a version bump from 5 to 6 and I'm in charge of bringing our support libraries up to date. A quick look into the changelog showed that I would have to do some under the hood work, but I have a relatively abstracted class library which would be easy to add support for version 6.

Not so fast. The job scheduler uses COM objects as its method of interaction, and .Net has to build an interop assembly to talk to it. The easy way, Add Reference -> COM, should work. It should work, provided the COM DLL's have different names. Thanks to Murphy's Law, the two DLL's, while under different folders, have the same name!

.Net generates Interop.X.dll and Interop.X.dll, even though the two are in different folders and represent different versions, solely because our job scheduler's COM DLL is X.dll, in both folders. While this is not necessarily their fault, it certainly makes the lives of those of us who do integration harder (their COM object model is pretty bad to start with).

Thankfully, Microsoft provides the means to create your own Primary Interop Assembly from a DLL. Using TlbImp you can create your own COM Interop DLL, complete with a non-conflicting name and namespace.

TlbImp version5\X.dll /namespace:X /out:Interop.X.dll
TlbImp version6\X.dll /namespace:X6 /out:Interop.X6.dll

Now I can import these two conflict free and deal with more important issues, like the poor documentation included with the COM library.

Wednesday, June 20, 2007

KB925902: X is not a valid Win32 application

Recently we've had a number of new hires and interns come in as the business expands and schools get out. They've been given new computers, Core 2 Duo's and Xeon 51XX's. Last week a few users reported no longer being able to run certain older applications. Windows XP SP2 stated that "X is not a valid Win32 application."

Yet earlier in the day or in the week they had been able to run the same executable without any issues. On my XPSP2 machine I could run the executables fine, and Win2ksp4 had no problems either. At this point a massive search for "what changed" began.

The executables were compiled on MSVS6 and Compaq Visual FORTRAN, so perhaps it was their newer processors coupled with the way the executables were compiled. Sure enough recompiling under MSVS8 or Intel Visual FORTRAN allowed them to run the executables. However, one of the executables still had a problem. Thankfully, we got a new error (I hate old errors):

XXXXX.EXE - Illegal System DLL Relocation
The system DLL user32.dll was relocated in memory. The application will not run properly. The relocation occurred because the DLL C:\WINNT\system32\HHCTRL.OCX occupied an address range reserved for Windows system DLLs. The vendor supplying the DLL should be contacted for a new DLL.

Lovely, a Windows OCX control bumped user32. A quick google search brought up KB925902 as the offending patch. I went to one of the machines to look for the patch, but it appeared from Add/Remove Programs that this patch was never installed!

Before giving up saying the patch is not installed, a useful thing to note is you can browse to %WINDIR% and take a look at all the NTUNINSTALL$KB* folders to see every patch that has been applied. This list is much more exhaustive than the Add/Remove Programs list.

Sure enough, there was %WINDIR%\ntuninstall$kb925902\, and after uninstalling the patch, everything was fine on these machines. I wonder what KB925902 could have possibly changed to cause such a colossal error.

MS07-017: Vulnerability in GDI could allow remote code execution

So, a security bug in the graphics subsystem gets a patch which affects the ability to run console applications? You should read the article on MS07-017 to get a feel for how many subsystems are affected by their patch. Thank you Windows for making my life so wonderful.

Wednesday, June 13, 2007

New Intel Visual FORTRAN 10.0.025

Intel Visual FORTRAN 10.0.025 was just released, and of course I got my hands on it. I'd had troubles in 9.1 that were "Targeted to Fix," but I needed the codes compiled as soon as possible. So I install IVF10, and get started with my first project. I press F7 and wait...

Crash.

The new version of IVF has a static verification component. It turns out that my project (about 105KLOC) causes the verification tool to run into the per-process memory limit (2GiB). Talk about cool. I've now broken two versions of the Intel compiler right out of the box!

I can't really blame Intel, I would imagine full static verification of a project of that size would be hard to do in 2GiB of memory. Besides, it is only a nicety, so I disabled it and compiled again. Without static verification it worked, however, when I went to run, my program reported that it could not find a file.

This file is pulled from an environment variable and as soon as I stepped through the application it became obvious what happened. Visual Studio's Environment configuration option for debugging had delimited the key value pair lines with \r\n instead of just \n. A temporary solution for this problem is to bring up a C++ project and input the environment there, then copy and paste the string into the IVF10 project's Environment setting. Not sure if this is an IVF10 or VS2005 issue.

I am now known at work as the code killer. Put something in front of me and I'll break it. Whoops.

Tuesday, June 12, 2007

.Net Deployment Build Error HRESULT = '80004005'

I've got a Visual Studio 2005 Deployment project for an application I distribute internally, and came across this crazy error while rebuilding the MSI file.

------ Starting pre-build validation for project 'MyProjectInstall' ------
ERROR: An error occurred while validating.  HRESULT = '80004005'
------ Pre-build validation for project 'MyProjectInstall' completed ------

No other clues as to the actual problem. Some googling revealed that this has to do with a project building with references it does not need (or in my case stale references). A quick fix is to go to the offending 'Primary Output' project, and remove all of its non-Microsoft references. Then add them one by one until the project compiles. At this point your deployment project should compile without any hassles.

Friday, June 8, 2007

Drive Letter Economics

On Windows there is a fun property of the command line (not quite DOS) where you cannot change directory to a UNC path. This effectively makes it impossible to set your working directory to a UNC path from a batch file. To address this issue Microsoft has two methods of switching to a UNC path.

You can NET USE the path as a drive letter. However, you have to be sure that the drive letter you chose is not in use. When running in a large multi-user environment, you can see how this would become troublesome. More importantly, NET USE is semi-permanent, living for as long as the computer is on. You must unuse the drive letter assignment to free this up for other people.

Your other option is pushd which pushes a path name onto a virtual stack, making the path you specify the current working directory. If you pushd a UNC path, it is assigned a drive letter from the pool of open drive letters. Now, this too is semi-permanent (i.e. outlives the cmd instance it was done in). This assignment lives on until you unuse it or use popd. The more vexing part is on Windows 2000 Server these drive letter assignments affect everyone who uses the machine.

Let's say user A has a script that calls pushd without popd, if his script gets run enough times, eventually Windows 2000 Server machines begin running out of drive letters. So when user B's script runs on a machine without free drive letters, they are greeted with this fun message:

C:\>PUSHD \\machine\unc\path\here\
' ' is an invalid current directory path.  UNC paths are not supported.

Aren't you glad you get an error message which reflects the problem?

Now on Windows 2003 Server this problem is non-existent. Users can only muck up drive letter assignments for themselves, not for everyone logged in to a machine. However, upgrading production servers to another operating system is not always a valid fix. The problem does not go away, just users are insulated from other users.

The correct solution is to follow the best practices and have a matching popd for ever pushd call you make. Of course, it wouldn't be a best practice if nobody ignored it.

Friday, June 1, 2007

Premature Optimization is the Root of all Evil

Donald Knuth was indeed right when he said that, "premature optimization is the root of all evil." In a few FORTRAN codes I have, the original programmers made use of boolean short circuiting. This technique is extremely popular in languages which support it. If you are unfamiliar with short circuiting it goes a little something like this, given:

if (expression1 .and. expression2 ... .and. expressionN) then
 ! some code here
end if

Short circuiting relies on the fact that the language will evaluate boolean expressions in order of precedence, from left to right. So if and only if expression1 is .TRUE. then expression2 will be evaluated. If and only if expression2 is .TRUE. then expression3 is evaluated, and so on and so forth. If, from left to right, any expression is found to be .FALSE. then the entire If statement is considered to be .FALSE., which in boolean algebra makes sense.

A common use of boolean short circuiting would be to protect against out of bounds array access in loops which may not stop at the end of an array. For instance:

real, dimension(:), allocatable :: myArray
allocate(myArray(n))
...
do i = 1,m
if (i .lt. n .and. myArray(i) .op. someVal) then
  ! do something
end if
end do

Many languages support short circuiting by design, many support it by consensus, however FORTRAN does not make short circuiting part of the design and there is no consensus on its adoption. The above example works fine under Compaq Visual FORTRAN, but if you enable bounds checking on Intel Visual FORTRAN you get a run-time error.

Both CVF and IVF are following the standard with their interpretations, FORTRAN does not specify how a compiler should implement the above if statement. However, often times people adopt the unofficial standards created by compilers which interpret the standard in a certain way. CVF evaluates the statement above left-to-right and applies boolean short circuiting. IVF evaluates all components of the expression before making a decision. Both of these interpretations are correct, but they have interesting implications.

if (b .op. k .and. somefunc() .op. someval) then
! CVF and IVF may not execute this in the same fashion
end if

The problem with the above statement is that if IVF were to evaluate somefunc() before the comparison between b and k, potential side effects inside somefunc() could alter b or k, fundamentally changing the meaning of the statement. Worse still if the code was originally defined for CVF, the side effects of somefunc() could depend on being ignored when the comparison between b and k is .FALSE..

As a programmer you should mind the relevant standards and strive to rely on as few platform or compiler specific behaviors. The two above examples could be rewritten with their intentions preserved in only a few extra lines.

real, dimension(:), allocatable :: myArray
allocate(myArray(n))
...
do i = 1,m
if (i .lt. n) then
  if (myArray(i) .op. someVal) then
     ! all FORTRAN compilers will get here for the same
     ! reason
  end if
end if
end do
...
if (b .op. k) then
if (somefunc() .op. someval) then
  ! all FORTRAN compilers will get here for the same
  ! reason
end if
end if

So pay attention to the fun problems you may create for the guy who inherits your code when you get all crazy. It has been said that 60% of programming is maintaining your code, however, I find in my job that number is closer to 80 or even 90%. Don't make your life any harder than it already is.