Sunday, June 24, 2007

Conflicting COM Interop DLL Names

Recently our job scheduling software got a version bump from 5 to 6 and I'm in charge of bringing our support libraries up to date. A quick look into the changelog showed that I would have to do some under the hood work, but I have a relatively abstracted class library which would be easy to add support for version 6.

Not so fast. The job scheduler uses COM objects as its method of interaction, and .Net has to build an interop assembly to talk to it. The easy way, Add Reference -> COM, should work. It should work, provided the COM DLL's have different names. Thanks to Murphy's Law, the two DLL's, while under different folders, have the same name!

.Net generates Interop.X.dll and Interop.X.dll, even though the two are in different folders and represent different versions, solely because our job scheduler's COM DLL is X.dll, in both folders. While this is not necessarily their fault, it certainly makes the lives of those of us who do integration harder (their COM object model is pretty bad to start with).

Thankfully, Microsoft provides the means to create your own Primary Interop Assembly from a DLL. Using TlbImp you can create your own COM Interop DLL, complete with a non-conflicting name and namespace.
TlbImp version5\X.dll /namespace:X /out:Interop.X.dll
TlbImp version6\X.dll /namespace:X6 /out:Interop.X6.dll
Now I can import these two conflict free and deal with more important issues, like the poor documentation included with the COM library.

Wednesday, June 20, 2007

KB925902: X is not a valid Win32 application

Recently we've had a number of new hires and interns come in as the business expands and schools get out. They've been given new computers, Core 2 Duo's and Xeon 51XX's. Last week a few users reported no longer being able to run certain older applications. Windows XP SP2 stated that "X is not a valid Win32 application."

Yet earlier in the day or in the week they had been able to run the same executable without any issues. On my XPSP2 machine I could run the executables fine, and Win2ksp4 had no problems either. At this point a massive search for "what changed" began.

The executables were compiled on MSVS6 and Compaq Visual FORTRAN, so perhaps it was their newer processors coupled with the way the executables were compiled. Sure enough recompiling under MSVS8 or Intel Visual FORTRAN allowed them to run the executables. However, one of the executables still had a problem. Thankfully, we got a new error (I hate old errors):
XXXXX.EXE - Illegal System DLL Relocation
The system DLL user32.dll was relocated in memory. The application will not run properly. The relocation occurred because the DLL C:\WINNT\system32\HHCTRL.OCX occupied an address range reserved for Windows system DLLs. The vendor supplying the DLL should be contacted for a new DLL.
Lovely, a Windows OCX control bumped user32. A quick google search brought up KB925902 as the offending patch. I went to one of the machines to look for the patch, but it appeared from Add/Remove Programs that this patch was never installed!

Before giving up saying the patch is not installed, a useful thing to note is you can browse to %WINDIR% and take a look at all the NTUNINSTALL$KB* folders to see every patch that has been applied. This list is much more exhaustive than the Add/Remove Programs list.

Sure enough, there was %WINDIR%\ntuninstall$kb925902\, and after uninstalling the patch, everything was fine on these machines. I wonder what KB925902 could have possibly changed to cause such a colossal error.
MS07-017: Vulnerability in GDI could allow remote code execution
So, a security bug in the graphics subsystem gets a patch which affects the ability to run console applications? You should read the article on MS07-017 to get a feel for how many subsystems are affected by their patch. Thank you Windows for making my life so wonderful.

Wednesday, June 13, 2007

New Intel Visual FORTRAN 10.0.025

Intel Visual FORTRAN 10.0.025 was just released, and of course I got my hands on it. I'd had troubles in 9.1 that were "Targeted to Fix," but I needed the codes compiled as soon as possible. So I install IVF10, and get started with my first project. I press F7 and wait...

Crash.

The new version of IVF has a static verification component. It turns out that my project (about 105KLOC) causes the verification tool to run into the per-process memory limit (2GiB). Talk about cool. I've now broken two versions of the Intel compiler right out of the box!

I can't really blame Intel, I would imagine full static verification of a project of that size would be hard to do in 2GiB of memory. Besides, it is only a nicety, so I disabled it and compiled again. Without static verification it worked, however, when I went to run, my program reported that it could not find a file.

This file is pulled from an environment variable and as soon as I stepped through the application it became obvious what happened. Visual Studio's Environment configuration option for debugging had delimited the key value pair lines with \r\n instead of just \n. A temporary solution for this problem is to bring up a C++ project and input the environment there, then copy and paste the string into the IVF10 project's Environment setting. Not sure if this is an IVF10 or VS2005 issue.

I am now known at work as the code killer. Put something in front of me and I'll break it. Whoops.

Tuesday, June 12, 2007

.Net Deployment Build Error HRESULT = '80004005'

I've got a Visual Studio 2005 Deployment project for an application I distribute internally, and came across this crazy error while rebuilding the MSI file.
------ Starting pre-build validation for project 'MyProjectInstall' ------
ERROR: An error occurred while validating. HRESULT = '80004005'
------ Pre-build validation for project 'MyProjectInstall' completed ------
No other clues as to the actual problem. Some googling revealed that this has to do with a project building with references it does not need (or in my case stale references). A quick fix is to go to the offending 'Primary Output' project, and remove all of its non-Microsoft references. Then add them one by one until the project compiles. At this point your deployment project should compile without any hassles.

Friday, June 8, 2007

Drive Letter Economics

On Windows there is a fun property of the command line (not quite DOS) where you cannot change directory to a UNC path. This effectively makes it impossible to set your working directory to a UNC path from a batch file. To address this issue Microsoft has two methods of switching to a UNC path.

You can NET USE the path as a drive letter. However, you have to be sure that the drive letter you chose is not in use. When running in a large multi-user environment, you can see how this would become troublesome. More importantly, NET USE is semi-permanent, living for as long as the computer is on. You must unuse the drive letter assignment to free this up for other people.

Your other option is pushd which pushes a path name onto a virtual stack, making the path you specify the current working directory. If you pushd a UNC path, it is assigned a drive letter from the pool of open drive letters. Now, this too is semi-permanent (i.e. outlives the cmd instance it was done in). This assignment lives on until you unuse it or use popd. The more vexing part is on Windows 2000 Server these drive letter assignments affect everyone who uses the machine.

Let's say user A has a script that calls pushd without popd, if his script gets run enough times, eventually Windows 2000 Server machines begin running out of drive letters. So when user B's script runs on a machine without free drive letters, they are greeted with this fun message:
C:\>PUSHD \\machine\unc\path\here\
' ' is an invalid current directory path. UNC paths are not supported.
Aren't you glad you get an error message which reflects the problem?

Now on Windows 2003 Server this problem is non-existent. Users can only muck up drive letter assignments for themselves, not for everyone logged in to a machine. However, upgrading production servers to another operating system is not always a valid fix. The problem does not go away, just users are insulated from other users.

The correct solution is to follow the best practices and have a matching popd for ever pushd call you make. Of course, it wouldn't be a best practice if nobody ignored it.

Friday, June 1, 2007

Premature Optimization is the Root of all Evil

Donald Knuth was indeed right when he said that, "premature optimization is the root of all evil." In a few FORTRAN codes I have, the original programmers made use of boolean short circuiting. This technique is extremely popular in languages which support it. If you are unfamiliar with short circuiting it goes a little something like this, given:
if (expression1 .and. expression2 ... .and. expressionN) then
! some code here
end if
Short circuiting relies on the fact that the language will evaluate boolean expressions in order of precedence, from left to right. So if and only if expression1 is .TRUE. then expression2 will be evaluated. If and only if expression2 is .TRUE. then expression3 is evaluated, and so on and so forth. If, from left to right, any expression is found to be .FALSE. then the entire If statement is considered to be .FALSE., which in boolean algebra makes sense.

A common use of boolean short circuiting would be to protect against out of bounds array access in loops which may not stop at the end of an array. For instance:
real, dimension(:), allocatable :: myArray
allocate(myArray(n))
...
do i = 1,m
if (i .lt. n .and. myArray(i) .op. someVal) then
! do something
end if
end do
Many languages support short circuiting by design, many support it by consensus, however FORTRAN does not make short circuiting part of the design and there is no consensus on its adoption. The above example works fine under Compaq Visual FORTRAN, but if you enable bounds checking on Intel Visual FORTRAN you get a run-time error.

Both CVF and IVF are following the standard with their interpretations, FORTRAN does not specify how a compiler should implement the above if statement. However, often times people adopt the unofficial standards created by compilers which interpret the standard in a certain way. CVF evaluates the statement above left-to-right and applies boolean short circuiting. IVF evaluates all components of the expression before making a decision. Both of these interpretations are correct, but they have interesting implications.
if (b .op. k .and. somefunc() .op. someval) then
! CVF and IVF may not execute this in the same fashion
end if
The problem with the above statement is that if IVF were to evaluate somefunc() before the comparison between b and k, potential side effects inside somefunc() could alter b or k, fundamentally changing the meaning of the statement. Worse still if the code was originally defined for CVF, the side effects of somefunc() could depend on being ignored when the comparison between b and k is .FALSE..

As a programmer you should mind the relevant standards and strive to rely on as few platform or compiler specific behaviors. The two above examples could be rewritten with their intentions preserved in only a few extra lines.
real, dimension(:), allocatable :: myArray
allocate(myArray(n))
...
do i = 1,m
if (i .lt. n) then
if (myArray(i) .op. someVal) then
! all FORTRAN compilers will get here for the same
! reason
end if
end if
end do
...
if (b .op. k) then
if (somefunc() .op. someval) then
! all FORTRAN compilers will get here for the same
! reason
end if
end if
So pay attention to the fun problems you may create for the guy who inherits your code when you get all crazy. It has been said that 60% of programming is maintaining your code, however, I find in my job that number is closer to 80 or even 90%. Don't make your life any harder than it already is.

Thursday, May 31, 2007

.Net XmlSerializer and InvalidCastException

Many of our applications work via a plugin architecture, which allows us to be flexible in a lot of ways. A while back I ran into a problem with XML serialization and our plugin system. The error was confusing and the solution was non-obvious. The exception I recieved was the following:
System.InvalidOperationException: There was an error generating the XML document.
---System.InvalidCastException: Unable to cast object
of type 'MyNamespace.Settings' to type 'MyNamespace.Settings'. at
Microsoft.Xml.Serialization.GeneratedAssembly.
XmlSerializationWriterSettings.Write3_Settings(Object o)
I've made bold the confusing (and vexing!) part of the error. Apparently the XmlSerializer could not cast a type to itself? Worse still, the MSDN documentation does not list InvalidCastException as a common exception (which normally lists the boneheaded mistake your program made).

After a large amount of googling, I came across a snippet--which if you place in App.Config--makes the error disappear (but is not meant to remove any errors):
<system.diagnostics>
<switches>
<add name="XmlSerialization.Compilation" value="4" />
</switches>
</system.diagnostics>
What the "4" means, I could not tell you, but this magical block of code solved my problem. However, I am never satisfied with hacks like this, so I dug deeper. The root cause apparently is due to how I load my plugin and where the assembly is that called the XmlSerializer.

In .Net there are 3 assembly load contexts (plus assemblies can be loaded without context), each causes your types to be slightly different. If your plugin is loaded in the Load-From context (as mine was), the type MyNamespace.Settings is "branded" (so to speak) with the context it was resolved in. If your plugin uses an XmlSerializer, the temporary assemblies generated to speed (de)serialization are part of the Load context (or perhaps are without context, I haven't found out for sure). Therefore the type the XmlSerializer attempts to create is different in context from the type in your plugin.

I found the most effective strategy to combat this interesting error is to always use the Load context. This requires your plugin DLLs lie under the ApplicationBase or PrivateBinBase paths. All in all this is the best solution, considering Side-by-Side is the new Microsoft way of deploying applications and DLLs (to avoid DLL Hell).

Here is a short snippet of what the plugins may look like in your App.Config:
<plugins>
<plugin
name="My Plugin"
assemblyName="MyPlugin, Version=1.0.0.0,
Culture=neutral, PublicKeyToken=deadbeefbaadf00d" />
</plugins>
You could then load this plugin (after reading in the appropriate ConfigurationSection) like so, to ensure XmlSerializer works in your plugin:
PluginsSection pluginsSection =
config.GetSection("plugins") as PluginsSection;
foreach(PluginElement elt in pluginsSection.Plugins)
{
Assembly pluginAsm = Assembly.Load(elt.AssemblyName);
/* Reflect across the assembly looking for types with
* [MyAppPluginAttribute] or those that implement
* IMyAppPlugin, so an assembly can contain more than
* one plugin.
*/
}
The .Net world has many intricacies and most seem to stem from this notion of Assemblies and satellite assemblies and manifests and ligers and unicorns, so don't be discouraged if you have a hard time working it all out.

Wednesday, May 30, 2007

Tracking down network gremlins

I've been besieged as of late by gremlins somewhere in the ether. They have stolen our token rings and have set fire to my home. Actually, it appears our file server is crapping out (again with those technical terms) at random intervals.

Well, how do I know it is the file server?

I did not know at first, the errors returned from FORTRAN applications were code 30, which basically means it could not open a file, but it did not know why. Later, I received some errors during reading and writing, which confirmed an issue with the file server (and not the application).

However, there were no useful error codes being returned!

Instead of rewriting these older applications to return the system error codes (newer ones include said detail) I wrote a canary application (in C if you must know). This tester would attempt to open a few files thousands of times in random order. Then read, write, read+write to each of these files thousands of times. It would do all of this in a giant loop, sleeping for a set amount of time at the end. During this loop it would rigorously check the return values of the functions, and die immediately (and loudly!) with the corresponding error code.

Sure enough it caught the error!

Wait, now that we know what the error is, why are we getting this error?

Preliminary analysis had it that the file server was CPU bound during the "hiccup". How could we really know what was the cause? Sysinternals has a lovely suite called PsTools which provides everything you could ever need to monitor processes from the command line. A simple trigger for the canary job to run a PsExec job when it died with an error was implemented:
psexec \\machinename pslist -s 90 -r 5 -x
Now we could get some output from the file server as to what it was doing when the job had the "hiccup". This worked well and we were able to identify the offending process (and even the offending thread!), yet that did not solve our problem. It only identified a cause and most likely not even the root cause! Eventually we will drill down to the actual problem and solve that (only to move on to the next issue, phew).

VAX Floating Point Numbers

So in the world of old hardware you have the DEC VAX. Big ole honkin' machines from the days of yore. They were introduced a decade before I was born and support for them was withdrawn before I graduated high school. By the time I began interacting with them, they were the old gray mare having been largely replaced by hardware like the DEC Alpha (AXP).

The transition from VAX to AXP was pretty smooth on OpenVMS and many companies, including the one I work for, made the move. Modern AXP processors are impressive and for a long time held the record for the fastest supercomputers in the United States.

Part of the allure of the AXP was it's support for data found on the VAX. VAXen came long before the IEEE 754 standard for floating point numbers, so it is not hard to see how they developed their own standard. IBM mainframes and Cray supercomputers both have (popular) floating point formats from around that time. Interestingly the VAX floating point format has some formatting dependencies on the PDP-11 (craaaazy) format, which can really make life hell.

So why would I bring this up?

When a company has been using computers for a long time, you end up with a need to store data somewhere. Now data that is a decade old is easy to interact with. Imagine going back another ten years. Imagine another ten. You're now knocking on the door of the advent of (roughly) modern computing. FORTRAN 66 (and later 77) is in its prime. VAXen and IBM mainframes rule the earth! Kidding, but at least VAXen ruled my company.

The amount of data which has been preserved is staggering. The only issue is, the number of machines which can natively read the data is diminishing rapidly. Compaq (the new DEC) is phasing out support for the AXP in 2004 and transitioning users to the Intel Itanium and Itanium 2 (cue up Itanic jokes). A certain nagging problem with this transition is the loss of native support for the VAX floating point format.

The two common formats I deal with are the VAX F_Float and G_Float, single and double precision respectively. The F_Float is bias-128 and the G_Float is bias-1024. Both the F and G representations have an implicitly defined hidden-bit normalized mantissa (m) like so:
0.1mmm...mmm
F_Float is held in 32 bits and G_Float is held in 64 bits. Both formats sufferinherit from the PDP-11 memory layout, so the actual bits stored on disk is not true little endian.

So why is this a problem?

There are no modern processors (read: with future support) with native support for the VAX format. All of our codes which read in floating point data from old data files must make the conversion from the VAX format to their host format (which in all cases is IEEE754). This conversion is not nice and is in fact lossy.

IEEE754 S_Float and T_float, single and double precision respectively, cannot exactly represent all VAX floating point data. S_Float is bias-127 and T_Float is bias-1023 (note this is different than F and G). Both S and T have hidden-bit normalized mantissas, however IEEE754 supports "subnormal" or "denormal" forms, where the leading bit could be a 1 or a 0.
1.mmm...mmm (normal)
0.mmm...mmm (subnormal)
This does not bode well for direct conversion between the formats.

Even if the byte layout was the same, we still have two different forms for floating point numbers. Every time we make the conversion we lose precision. What is even more insidious is that VAX and IEEE754 do not have the same rounding rules (I'm not even sure the VAX has defined rounding rules!). Floating point formats are inherently inexact and how these inexact representations are interpreted with respect to rounding is very important.

Moreover, even if we overlooked the problems in representation of floating point numbers, what about exceptional numbers like Infinity and the result of a divide by zero operation? The VAX format only defines positive and negative "excess," which while akin to Infinity, causes an exception and cannot be used in math. IEEE754 encodes both positive and negative Infinity and includes a special case for mathematical operations which have no defined result, Not A Number (NaN). IEEE754 supports both quiet NaN's, which always produce NaN, and loud NaN's which throw floating point exceptions.

Ok, so if we ignore Infinity and NaN we still have a problem. IEEE754 supports positive and negative zero. VAX only supports positive zero. Why is this a problem? Not only is negative zero unrepresentable on the VAX, but many common mathematical operations on IEEE754 can result in a negative zero (say converging from the "left" of zero).

Wow, so basically we're screwed.

Or not. The path to go down is one where the data gets converted to the new standard (new being in the last 15 years or so) which is (more-or-less) a universal standard on processors. This is a time consuming task, and one that needs to be approached carefully to ensure a high degree of fidelity. However, it needs to be made to ensure the longevity of both the software and the data.

Tuesday, May 29, 2007

Intel Visual FORTRAN oddity

So I come across some excellent FORTRAN77 code that I must convert to F90 and use Intel Visual FORTRAN with. Not a big deal, the code is well formed F77 and should convert to F90 in a straightforward manner.

Ha ha ha, I know, what was I thinking.

The conversion was easy going until mysteriously the compiler began crapping out (yes how very technical) with an abort code of 3. No error in my code, just the compiler was having internal issues. The specific error from the Intel FORTRAN 9.1 compiler was:
GEM_LO_GET_LOCATOR_INFO: zero locator value
This was truly vexing, because at the time I was in a rush to get this code ported over to IVF. Sure enough, there was an internal problem with the Intel compiler, confirmed by their support staff. A specific variable name (SNGL), coupled with some specific compiler flags (/iface:stdref /names:as_is) caused the abort.

A patch is in the works, meanwhile SNGL becomes singleVal in the converted code, and viola the problem vanishes. I'd love to see the root cause analysis on that bug!