Well, how do I know it is the file server?
I did not know at first, the errors returned from FORTRAN applications were code 30, which basically means it could not open a file, but it did not know why. Later, I received some errors during reading and writing, which confirmed an issue with the file server (and not the application).
However, there were no useful error codes being returned!
Instead of rewriting these older applications to return the system error codes (newer ones include said detail) I wrote a canary application (in C if you must know). This tester would attempt to open a few files thousands of times in random order. Then read, write, read+write to each of these files thousands of times. It would do all of this in a giant loop, sleeping for a set amount of time at the end. During this loop it would rigorously check the return values of the functions, and die immediately (and loudly!) with the corresponding error code.
Sure enough it caught the error!
Wait, now that we know what the error is, why are we getting this error?
Preliminary analysis had it that the file server was CPU bound during the "hiccup". How could we really know what was the cause? Sysinternals has a lovely suite called PsTools which provides everything you could ever need to monitor processes from the command line. A simple trigger for the canary job to run a PsExec job when it died with an error was implemented:
psexec \\machinename pslist -s 90 -r 5 -xNow we could get some output from the file server as to what it was doing when the job had the "hiccup". This worked well and we were able to identify the offending process (and even the offending thread!), yet that did not solve our problem. It only identified a cause and most likely not even the root cause! Eventually we will drill down to the actual problem and solve that (only to move on to the next issue, phew).