Wednesday, July 25, 2007

MPICH.NT, MPICH2, and TCP Offloading (continued)

Well we tried updated drivers for the NC373i but with no avail. We were still seeing TOE errors which resulted in MPICH.NT and MPICH2 hanging. Luckily the machines have second NIC's, NC380T PCIe DP Multifunc Gigabit Server Adapters. Unluckily for us, those network cards exhibited the same problem as the embedded NC373i.

This leads me to believe that there is a problem with the Windows 2003 R2 Scalable Networking Pack. Specifically with the Chimney TCP Offloading portion. There may even be an issue in MPICH.NT and MPICH2 and Windows 2003 R2's SNP. However, that just seems highly unlikely as MPICH.NT and MPICH2 are wildly different under the hood.

We've initiated a case with Microsoft to get to the bottom of this issue. Meanwhile our new servers have TOE disabled, which isn't bad, but it isn't good either.

Monday, July 23, 2007

MPICH.NT, MPICH2, and TCP Offloading

Recently we came across a strange problem where on some new machines MPICH.NT and MPICH2 both would fail to correctly operate across machines. On a single machine there were no problems, and actually the application partially worked under MPICH.NT.

My initial guess was MPICH.NT did not play well with Windows 2003 Server R2. Previously we have only had Windows 2000 Server for the MPI jobs, so it seemed logical that changing the server OS would cause some issues. I recompiled the application for MPICH2 (bigger, newer, better) and found the application "hung" in exactly the same fashion as under MPICH.NT.

So now I had a common failure mode across two versions of MPICH (which are wildly different under the hood) on the same OS. I started running MPICH in debug/verbose mode and spent a lot of time looking at the thousands of lines of output and noticed that both under MPICH.NT and MPICH2 they came to the same place and halted, only I couldn't tell where in the code this was.

You may think this is where I fired up the parallel debugger and did wild and crazy things, but that takes too much time. I went with good ole fashioned printf debugging. I ended up getting output like the following:
[0] calling MPI_Bcast ...
[1] calling MPI_Bcast ...
[2] calling MPI_Bcast ...
[3] calling MPI_Bcast ...
[4] calling MPI_Bcast ...
[N-2] calling MPI_Bcast ...
[N-1] calling MPI_Bcast ...
Where [X] is the process number and if the call was successful the MPI_STATUS code would be returned on the next line, however, none of these calls returned. An important thing to note is these MPI_Bcast calls were sending buffers of 150MiB+, which is quite large. This fact drove me to check on the network card settings.

While poking around in the network card settings, I noticed two things:
  1. Network card drivers were out of date
  2. Network card status tool counters showed some errors in TCP Offloading
Using the information that TOE had reported errors in the past, I reran the application and watched the TOE error counters. Sure enough, when that MPI_Bcast line was reached, the TOE error counters incremented by one. So I went and disabled TOE on my two test machines and reran the application.

Boom, problem solved. Well...not really.

Disabling TOE will kill performance for other applications that do not have an issue with offloading. However, I cannot upgrade the drivers for the network card (even as a test) without going through 10 miles of red tape. That is neither here nor there, the important part is the problem has been identified and can be solved.

So if you have an HP NC373i Multifunction Gigabit Ethernet Adapter and experience problems with MPICH or MPICH2 on Windows, it is probably the TCP Offloading Engine. Try updating your drivers or disabling TOE to solve the issue. I will post an update if the latest drivers indeed fix this issue.

Tuesday, July 17, 2007

CNN, Kashiwazaki NPP, and the 2007 Niigata Earthquake

(ed: I normally don't blog about non-technical issues, but this chaps my cheeks to no end)

If you haven't heard yesterday an earthquake happened of the coast off the Niigata prefecture in Japan. It had a magnitude of 6.8 and caused serious problems for the Kashiwazaki-Kariwa NPP. A transformer caught fire at unit 3 (there are 7 units) and radioactive liquids spilled into the ocean. This is obviously a serious event and should be treated as such, however, the news coverage was yellow journalism at best. All of the initial reports were sensationalist and biased with little facts (partially due to the tight lipped nature of TEPCO) to support any of their claims.

To see how bad the yellow journalism got just read the following headline blurb:
Radioactive leak, tremors follow Japan quake
A strong earthquake struck northwestern Japan today, causing a radioactive leak and fire at one of the world's most powerful nuclear power plants. Eight people were killed and hundreds injured. The plant leaked about 315 gallons of water, according to a Tokyo Electric official.
So, how many casualties were a result of the radioactive leak and fire:
  • 8 deaths and hundreds of injuries
  • 6 deaths and hundreds of injuries
  • 1 death and tens of injuries
  • 0 deaths and 0 injuries
If you guessed 0 and 0 you would be right!

But wait...

Didn't the blurb say that eight people died and hundreds were injured by the radioactive leak and fire?! Why yes, yes it did. This is far from ethical, yet CNN went right ahead and posted that online.

Don't believe me? Check out this screenshot of CNN's misrepresentation of the Kashiwazaki-Kariwa leak and fire.

Wednesday, July 11, 2007

Lessons Learned Flying Space Available

Flying space available is great. It is cheap, fun, and can be very flexible. However, when things get busy you can find yourself in trouble. Space available travel is prioritized and if you aren't an employee of an airline, you'll find yourself at the bottom of the pile.

You can mitigate your troubles by always arriving for the earliest flight possible. Quite often people oversleep and miss the early flights. Waking up at 3:30 AM to possibly catch the 5:00 AM flight might sound awful, but it may be the only way to make it to your destination.

If you are not able to make it onto a flight you are listed for, you are automatically rolled over to the next flight. However, twice I have not been rolled over for one reason or another and had to be manually moved. Always double check to make sure you have been added to the next flight.

When you miss a flight, it may seem tempting to use the 1-800 number most airlines have to change to a different option. However, I've found they tend to not get the change made properly. Always use the gate agents to make the change and quite often they will also suggest alternate routes (if only to get you off their back).

If you happen to be woefully unlucky and are bumped from one day to the next save your boarding pass from the day before. You can use that to skip the lines at the ticketing counter and go straight through security. Once you get to the gate double check that you are indeed listed on the flight and get yourself a new flight coupon (or if really lucky a boarding pass).

The last bit of advice is to be as sociable as possible. Hang out with the gate agents, any flight attendants who are stuck, and yuck it up with other space available passengers. I've met so many great people sitting around in airports waiting on flights. Some of these people can even help you out if you're really stuck, so it's worth being polite and friendly. Getting somebody a $3 airport coffee can go a long way towards making it onto a flight.