Recently we got an escalation from the field on AIX in which we saw that gab paniced the node because it got a packet with a wrong sequence number. On further debugging we found that everything was correct except that 1 bit in the gab sequence number was flipped from 0 to 1 thus causing gab to panic the node since it suspected data corruption.
So, we asked the customer to enable LLT checksums (which are disabled by default) and see what happens. With LLT checksums enabled we saw that one of the LLT links reported a few LLT checksums errors and then again gab paniced the node with the same error - same sequence number bit flipped from a 0 to a 1.
So, we then asked the customer to turn on LLT checksum level 2. This level basically calculates 2 checksums, one for the messages received from gab (called mcksum - which is verified for corruption just before we deliver the msg to gab on the receiver) and one for the full packet which is calculated just before we give the pkt to DLPI (called pcksum - this is verified just after we receive the pkt on the receiver to detect on the wire corruptions). If mcksum does not add up then LLT panics the box. With this we saw that one of the LLT links reported a few LLT checksums errors and then LLT paniced the node with a checksum error (mcksum) just before delivering the packet to gab.
Now, this was very puzzling - since we now thought that LLT had corrupted the pkt, since pcksum verified that the pkt was received fine from the network but mcksum verification showed that the packet got corrupted before LLT gave it to gab. So, LLT received a good pkt but gave up a bad pkt - hence LLT corrupted the pkt.
After a long debugging session we found out that 1 bit in the LLT header had flipped from 0 to 1, and 1 bit in the gab header had flipped from 1 to 0. And these bits were in the same bit location if you broke up the pkt into 16 bit chunks. Thus pcksum was not able to catch the corrupt pkt because the two bit flips cancelled each other out. And the bit flip in the LLT header did not cause a problem for LLT because it was a bitfield which was not being used. But mcksum was able to catch the error since only 1 bit had flipped in the msg (gab header).
Thus we found that 16 bit 1's complement checksums do have their limitations (this is what TCP and IP also use). and cannot catch all errors. In fact if the whole pkt was swapped in endianness (at 16 bit boundary) then the 16 bit 1's complement sum would still be the same and we would not be able to catch the error.
The CRC does not help here since that checks only for on the wire errors - not for errors that happen after the pkt has been received.
Here's a paper (link provided by Dave Thompson) which describes in detail the CRC and TCP checksums:
When The CRC and TCP Checksum Disagree
So much for trusting the hardware. A NIC can do funny things which will make your checksum calculations useless. If you are paranoid its best to have the applications checksum their own data - even then some errors would go undetected.