we are only rarely scheduling to multiple machines at once, and even if i do schedule multiple machines, it fails just as rarely. we do have a pretty huge chained build job for both XP and Win7, and i've done a good bit of work (all in the past, none recently) to modularize it and break it down into lots of smaller parts rather than one massive job. i don't use file copy tasks - i do everything with robocopy if i copy anything, and the steps it seems to fail on most often are anything but long-running - the most common one is a little vbscript to add an AD group to the local admins group, which when it doesn't fail this way, runs nearly instantly. the step before that is simply setting the machine to use a specific power profile - also near instant.
no ports are being blocked. windows firewall is disabled and the machine has no antivirus at that point in the build process. simply retrying the task (and having it work) proves that it's not a blocked port, and that it's not a too-big transaction log.
i already use wlogevent quite a bit to put more useful status messages in the console for my techs.
i have noticed that it mostly only happens on tasks that use vbscript, and only during production. thing is, it only fails on about 1 in every 10 machines. so i can't easily reproduce it.