Deployment Solution

 View Only
  • 1.  DS 6.9 SP5 MR3 - intermittent failure to copy unattend.xml

    Posted Jan 20, 2014 09:32 PM

    Got a strange one here.

    2 organisations - have both migrated their DS to new (virtual) servers. Were both on Server 2003 R2, now on 2008 R2, running DS 6.9 SP5 MR3 - ie. the latest version prior to SP6.

    The image deployment process is fairly generic, although I don't use the Altiris-provided GUI for image deployment.  It's a Windows 7 Enterprise WIM image - 1 of them is 32 bit, the other is 64 bit.

    My image deployment works in the following way, and it has worked flawlessly ever since Windows 7 was released:

    • Boot to WinPE (32-bit)
    • Map the drives via PXE redirection
    • Create partitions/format the drives
    • Apply the image via imagex
    • Perform token replacement on unattend.xml - the only token is the computername.  One of the environments uses %SERIALNUM% for the computername, the other one uses %ASSETTAG%
    • Copy the unattend.xml down to \windows\panther\unattend.xml
    • Copy drivers down based on the model
    • Inject the drivers
    • Reboot - and Windows takes care of the rest - the dagent installs via setupcomplete.cmd

    The intermittent issue I'm seeing is - the PC's are failing to get the unattend.xml copied down properly. The reason I know this is because the token-replaced file is sitting in the express share under \TEMP\%ID%.xml.  And the deployment process "thinks" that the file copies down to the client machine OK.  But when the machine boots up the first time, it doesn't get the right host name which proves that it's a problem copying down the token-replaced XML file.  How do I know this?  Because the unattend.xml file on the image used an "*" for the hostname, so the machines that fail to get the token replaced file copied down boot up with a ORGNAME-SERIAL type of hostname.  Proof that the token replaced xml file didn't copy down.

    Of course, when I put a pause (press any key) in the job after the file copy, it's fine - which leads me to believe that it's a timing issue.

    I'm currently trialing the use of a PING -n 10 127.0.0.1 >NUL after the file copies down, in the thought that it's a timing thing.  This never occurred when the server was 2003, so I can only think that it's an issue with the express share now being on a 2008 R2 server and Windows PE 2.1.

     

    This is one of the most bizarre bugs/issues I've seen in DS in the years I've been working with it, and I'm wondering if anyone has any suggestions/fixes or if anyone has seen it before.

     

     



  • 2.  RE: DS 6.9 SP5 MR3 - intermittent failure to copy unattend.xml

    Posted Jan 21, 2014 04:22 PM

    Question... are you using a copy file task or are these through batch/command-line jobs?

    We're using a mix of DS 6.9 SP5 MR3 and DS 6.9 SP6 currently on both 2k3 and 2k8R2 servers (the 2k8R2 have both versions, the 2k3 are only SP5MR3 of course) and haven't had these sorts of problems, but our jobs go between the overly complicated - a job to wipe the hard drive and then deploy a Win8 VHD in Native Boot which uses 3 separate diskpart scripts - to the simple standard Deploying Image jobs.



  • 3.  RE: DS 6.9 SP5 MR3 - intermittent failure to copy unattend.xml

    Posted Jan 21, 2014 07:45 PM

    Thanks for the response Maymne.  I am simply using a straightforward batch type of command in a run script job to copy the file down... eg. copy /y k:\temp\%ID%.xml q:\windows\panther\unattend.xml (I am mapping express to the K drive, and assigning the local drive as "q" because local optical drives, usb drives etc often steal the c and d drives).

    I've also tried using robocopy via a run script job, and also xcopy.  Same results - intermittent failure to copy the file even though the job "thinks" the file has been copied down.

    I might just try your method of using a copy file task via the gui instead of the script/batch method.



  • 4.  RE: DS 6.9 SP5 MR3 - intermittent failure to copy unattend.xml

    Posted Jan 23, 2014 06:26 PM

    Yeah, we change the PXE eXpress drive letter mapping from F to Q as part of our automatic changes that happen as soon as a new server is set up, because we've been bit by that too many times.

    I'd suggest setting up a diskpart script to remap drive letters if you want to be sure that your drive is mapped where you expect it with the batch script. If you want to just use the copy task, that should hopefully work, but it might have the same problems if the C or Q drive isn't the local drive.

    If this is something that you can actually replicate on call, throw a "pause" after your copy and check what happened. If it's raw luck, then you can either throw the pause in and watch it, knowing that it's no longer fully automated for this period, or you can be sneaky and output the results of your copy task to a log file. I suggest something like: copy /y k:\temp\%ID%.xml q:\windows\panther\unattend.xml > k:\copylogs\%ALTIRIS_PXE_CLIENTMAC%.txt

    That would log the copy message to a file with the computer's MAC address as its name in the copylogs folder (if you've made one) on your eXpress share. Then, next time there's a failure, just check the log file to see what it said. If it said that it finished, then it does look like you'd just need to delay the copy for a few seconds so that the install doesn't move on until it's finished. If it says it failed... that failure message should help you know what happened - if it mapped the wrong drive letters, if it didn't see the file, or whatever caused it.

    One caveat here - if you're using DS 6.9 SP6, the environmental variables are broken in the release version. You'll need to follow one of the sets of directions on fixing those and recompile the PXE files. If you're using DS 6.9 SP5 MR3 or before, you should be fine. :)



  • 5.  RE: DS 6.9 SP5 MR3 - intermittent failure to copy unattend.xml

    Posted Jan 23, 2014 07:24 PM

    Hey thanks for the suggestions Maymne,

    Whenever we pause things it's fine, which leads me to believe that it's a timing thing.  I have now changed the jobs that copy down unattend.xml and the dagent msi so they use a copy file task via the good old DS gui... and so far it seems pretty good.  I suspect the dagent is moving on before the work is done when using batch, whereas with the GUI it is specifically waiting for a return code from the Altiris-specific job... Well that's my theory so far!!!!

    Will be deploying 100 machines next week so I hope it holds out...

     

    With diskpart, we are mapping the local c (bitlocker/recovery partition 300mb) and d (rest of the drive) drives to p: and q: for the purpose of the deployment.  As you know, Windows maps q: back to C: when sysprep takes over, so that part's ok.  PXE maps to k: for express and l: for the local site's image store.

    We found that using ping -n 10 127.0.0.1 after the copy, even up to 30 seconds, didn't solve the issue - it certainly reduced the incidence of issues, but didn't eliminate it. I was even going to set up an if exist batch file to look for the unattend.xml after the copy, and if it wasn't there, to run another job to copy it again... and keep looping through that over and over again until the silly thing copied...

    Will keep bashing away at it this weekend, it's getting better... Time to torture test it and see if it can handle multiple deployments at once.  Otherwise it's out the window.   We haven't gone with SP6 yet - the client wants a reliable deployment first and foremost :-)  But useful to know that, thanks for the info.