Thursday, May 31, 2007

.Net XmlSerializer and InvalidCastException

Many of our applications work via a plugin architecture, which allows us to be flexible in a lot of ways. A while back I ran into a problem with XML serialization and our plugin system. The error was confusing and the solution was non-obvious. The exception I recieved was the following:
System.InvalidOperationException: There was an error generating the XML document.
---System.InvalidCastException: Unable to cast object
of type 'MyNamespace.Settings' to type 'MyNamespace.Settings'. at
XmlSerializationWriterSettings.Write3_Settings(Object o)
I've made bold the confusing (and vexing!) part of the error. Apparently the XmlSerializer could not cast a type to itself? Worse still, the MSDN documentation does not list InvalidCastException as a common exception (which normally lists the boneheaded mistake your program made).

After a large amount of googling, I came across a snippet--which if you place in App.Config--makes the error disappear (but is not meant to remove any errors):
<add name="XmlSerialization.Compilation" value="4" />
What the "4" means, I could not tell you, but this magical block of code solved my problem. However, I am never satisfied with hacks like this, so I dug deeper. The root cause apparently is due to how I load my plugin and where the assembly is that called the XmlSerializer.

In .Net there are 3 assembly load contexts (plus assemblies can be loaded without context), each causes your types to be slightly different. If your plugin is loaded in the Load-From context (as mine was), the type MyNamespace.Settings is "branded" (so to speak) with the context it was resolved in. If your plugin uses an XmlSerializer, the temporary assemblies generated to speed (de)serialization are part of the Load context (or perhaps are without context, I haven't found out for sure). Therefore the type the XmlSerializer attempts to create is different in context from the type in your plugin.

I found the most effective strategy to combat this interesting error is to always use the Load context. This requires your plugin DLLs lie under the ApplicationBase or PrivateBinBase paths. All in all this is the best solution, considering Side-by-Side is the new Microsoft way of deploying applications and DLLs (to avoid DLL Hell).

Here is a short snippet of what the plugins may look like in your App.Config:
name="My Plugin"
assemblyName="MyPlugin, Version=,
Culture=neutral, PublicKeyToken=deadbeefbaadf00d" />
You could then load this plugin (after reading in the appropriate ConfigurationSection) like so, to ensure XmlSerializer works in your plugin:
PluginsSection pluginsSection =
config.GetSection("plugins") as PluginsSection;
foreach(PluginElement elt in pluginsSection.Plugins)
Assembly pluginAsm = Assembly.Load(elt.AssemblyName);
/* Reflect across the assembly looking for types with
* [MyAppPluginAttribute] or those that implement
* IMyAppPlugin, so an assembly can contain more than
* one plugin.
The .Net world has many intricacies and most seem to stem from this notion of Assemblies and satellite assemblies and manifests and ligers and unicorns, so don't be discouraged if you have a hard time working it all out.

Wednesday, May 30, 2007

Tracking down network gremlins

I've been besieged as of late by gremlins somewhere in the ether. They have stolen our token rings and have set fire to my home. Actually, it appears our file server is crapping out (again with those technical terms) at random intervals.

Well, how do I know it is the file server?

I did not know at first, the errors returned from FORTRAN applications were code 30, which basically means it could not open a file, but it did not know why. Later, I received some errors during reading and writing, which confirmed an issue with the file server (and not the application).

However, there were no useful error codes being returned!

Instead of rewriting these older applications to return the system error codes (newer ones include said detail) I wrote a canary application (in C if you must know). This tester would attempt to open a few files thousands of times in random order. Then read, write, read+write to each of these files thousands of times. It would do all of this in a giant loop, sleeping for a set amount of time at the end. During this loop it would rigorously check the return values of the functions, and die immediately (and loudly!) with the corresponding error code.

Sure enough it caught the error!

Wait, now that we know what the error is, why are we getting this error?

Preliminary analysis had it that the file server was CPU bound during the "hiccup". How could we really know what was the cause? Sysinternals has a lovely suite called PsTools which provides everything you could ever need to monitor processes from the command line. A simple trigger for the canary job to run a PsExec job when it died with an error was implemented:
psexec \\machinename pslist -s 90 -r 5 -x
Now we could get some output from the file server as to what it was doing when the job had the "hiccup". This worked well and we were able to identify the offending process (and even the offending thread!), yet that did not solve our problem. It only identified a cause and most likely not even the root cause! Eventually we will drill down to the actual problem and solve that (only to move on to the next issue, phew).

VAX Floating Point Numbers

So in the world of old hardware you have the DEC VAX. Big ole honkin' machines from the days of yore. They were introduced a decade before I was born and support for them was withdrawn before I graduated high school. By the time I began interacting with them, they were the old gray mare having been largely replaced by hardware like the DEC Alpha (AXP).

The transition from VAX to AXP was pretty smooth on OpenVMS and many companies, including the one I work for, made the move. Modern AXP processors are impressive and for a long time held the record for the fastest supercomputers in the United States.

Part of the allure of the AXP was it's support for data found on the VAX. VAXen came long before the IEEE 754 standard for floating point numbers, so it is not hard to see how they developed their own standard. IBM mainframes and Cray supercomputers both have (popular) floating point formats from around that time. Interestingly the VAX floating point format has some formatting dependencies on the PDP-11 (craaaazy) format, which can really make life hell.

So why would I bring this up?

When a company has been using computers for a long time, you end up with a need to store data somewhere. Now data that is a decade old is easy to interact with. Imagine going back another ten years. Imagine another ten. You're now knocking on the door of the advent of (roughly) modern computing. FORTRAN 66 (and later 77) is in its prime. VAXen and IBM mainframes rule the earth! Kidding, but at least VAXen ruled my company.

The amount of data which has been preserved is staggering. The only issue is, the number of machines which can natively read the data is diminishing rapidly. Compaq (the new DEC) is phasing out support for the AXP in 2004 and transitioning users to the Intel Itanium and Itanium 2 (cue up Itanic jokes). A certain nagging problem with this transition is the loss of native support for the VAX floating point format.

The two common formats I deal with are the VAX F_Float and G_Float, single and double precision respectively. The F_Float is bias-128 and the G_Float is bias-1024. Both the F and G representations have an implicitly defined hidden-bit normalized mantissa (m) like so:
F_Float is held in 32 bits and G_Float is held in 64 bits. Both formats sufferinherit from the PDP-11 memory layout, so the actual bits stored on disk is not true little endian.

So why is this a problem?

There are no modern processors (read: with future support) with native support for the VAX format. All of our codes which read in floating point data from old data files must make the conversion from the VAX format to their host format (which in all cases is IEEE754). This conversion is not nice and is in fact lossy.

IEEE754 S_Float and T_float, single and double precision respectively, cannot exactly represent all VAX floating point data. S_Float is bias-127 and T_Float is bias-1023 (note this is different than F and G). Both S and T have hidden-bit normalized mantissas, however IEEE754 supports "subnormal" or "denormal" forms, where the leading bit could be a 1 or a 0.
1.mmm...mmm (normal)
0.mmm...mmm (subnormal)
This does not bode well for direct conversion between the formats.

Even if the byte layout was the same, we still have two different forms for floating point numbers. Every time we make the conversion we lose precision. What is even more insidious is that VAX and IEEE754 do not have the same rounding rules (I'm not even sure the VAX has defined rounding rules!). Floating point formats are inherently inexact and how these inexact representations are interpreted with respect to rounding is very important.

Moreover, even if we overlooked the problems in representation of floating point numbers, what about exceptional numbers like Infinity and the result of a divide by zero operation? The VAX format only defines positive and negative "excess," which while akin to Infinity, causes an exception and cannot be used in math. IEEE754 encodes both positive and negative Infinity and includes a special case for mathematical operations which have no defined result, Not A Number (NaN). IEEE754 supports both quiet NaN's, which always produce NaN, and loud NaN's which throw floating point exceptions.

Ok, so if we ignore Infinity and NaN we still have a problem. IEEE754 supports positive and negative zero. VAX only supports positive zero. Why is this a problem? Not only is negative zero unrepresentable on the VAX, but many common mathematical operations on IEEE754 can result in a negative zero (say converging from the "left" of zero).

Wow, so basically we're screwed.

Or not. The path to go down is one where the data gets converted to the new standard (new being in the last 15 years or so) which is (more-or-less) a universal standard on processors. This is a time consuming task, and one that needs to be approached carefully to ensure a high degree of fidelity. However, it needs to be made to ensure the longevity of both the software and the data.

Tuesday, May 29, 2007

Intel Visual FORTRAN oddity

So I come across some excellent FORTRAN77 code that I must convert to F90 and use Intel Visual FORTRAN with. Not a big deal, the code is well formed F77 and should convert to F90 in a straightforward manner.

Ha ha ha, I know, what was I thinking.

The conversion was easy going until mysteriously the compiler began crapping out (yes how very technical) with an abort code of 3. No error in my code, just the compiler was having internal issues. The specific error from the Intel FORTRAN 9.1 compiler was:
GEM_LO_GET_LOCATOR_INFO: zero locator value
This was truly vexing, because at the time I was in a rush to get this code ported over to IVF. Sure enough, there was an internal problem with the Intel compiler, confirmed by their support staff. A specific variable name (SNGL), coupled with some specific compiler flags (/iface:stdref /names:as_is) caused the abort.

A patch is in the works, meanwhile SNGL becomes singleVal in the converted code, and viola the problem vanishes. I'd love to see the root cause analysis on that bug!

Finally got one of these for work

I now have a blog for work related things, finally. I found my company's "Social Media & Blogging Guidelines," document and we're allowed to blog. We have to keep things appropriate, of course, but otherwise we are golden.

So I work for GE Energy, NuclearGE-Hitachi Nuclear Energy Americas (ed: name change as of 4 June 2007) as a software engineer. I'm the responsible engineer for codes ranging from FORTRAN 77/90, K&R C, C++, VB, VB.Net, Java, and C# 2.0. Mostly I work on GUI's (C# and Java) and support libraries (C, C++, FORTRAN, C#, Java), however, being a jack of many trades I also get in on the technology codes in FORTRAN.

Our systems range from Windows 2000 and XP on the desktop, Windows 2000 and 2003 on the servers, OpenVMS 7.X and 8.X servers, and a few scattered Linux/HP-UX/Tru64 boxen. We're trying to consolidate all of these systems, but personally I would rather the effort be placed on insuring interoperability across all of them (while least common denominator programming is at times frustrating, it keeps your code simple and most of the time easier to debug).

I spend a lot of time ensuring that our software remains well integrated, mainly utilizing API's which were set in stone before I was born. I get called upon to debug the crazy situations which happen when you bring together such an unholy trinity as FORTRAN, C, and C#. Yet the work is challenging and fun; my biggest grief being hard to find bugs and managing to break things which should not break. Ok, I lied, my biggest grief is procedures, but I think any engineer will tell you that.

I will be posting lots of technical issues that I come across and how I made it around them (or why I cannot seem to get around them). We'll see how this goes.