Client Management Suite

 View Only

A Tale of Upgrades, Blue Screens of Death and Trains 

Oct 31, 2011 12:36 PM

This article is for anyone who just can't get their head into work-mode. The text below contains Altiris keywords just to get through the Connect content filters, and might also (but no guarantees) convince your boss that you're working...  What follows is merely a tale of what I endured for a 'simple'  disk upgrade. The tale's moral can be nicely summarised as "Try, Try and Try Again. And then Give Up. Really. Just Give Up". 

 

Introduction

Not so long ago, and in a country not too far away, I became very frustrated with the speed of my Lenovo T400 laptop. I have a habit of leaving lots of docs open and that kind of slows things down. This is something that I naturally caution users against (I also advise against neglecting to take regular backups which we'll see the importance of later).  If you've read my articles before, you'll know I have a lot of train commute time. It's a picturesque commute through the rolling hillsides of the Worcestershire and Oxfordshire Cotswolds.  Asides from being pretty, the journey is long enough to undertake detailed simulations, so on the commute run I invariably add to this plethora of open documents a few virtual machines. These VMs cover lightweight Linux routers to Windows clients and servers. By the time all these have powered up my poor overburdened laptop is pretty much ready to fry a couple of eggs.

So, time for an upgrade I thought. Over the last year I'd been hearing a lot of good things about the improving reliability of Solid State Disk (SSD) harddisk technology. Apparently, the prices of this 'Must Have' technology have been falling, and as my poor Lenovo needed a bit more kick I thought it was about time I got me one of those... ;-)

Now to the bit which makes your boss thinks you're working. Current SSD Drives are large capacity, high quality, NAND Flash memory devices. There are two categories of NAND technologies used in SSD - Single-Level-Cell (SLC) and Multi-Level-Cell (MLC). The core distinction between these technologies is that SLC stores one bit per cell, whereas the MLC technology stores two. This is great for MLC because it means you instantly get double the data density.  This higher data density comes at a cost however -SSDs based on Multi-Level Cell technology are lower performing, have a lower lifetime and suffer from a higher margin of error. Ouch.

In the enterprise, MLC based SSD devices have been pretty much frowned upon as the poor brother of SLC. This is changing though –higher quality chips, improved error correction algorithms and over provisioning have enabled MLC SSDs to be touted now for major enterprise use. FusionIO for example (a high-end enterprise SSD vendor), offer in the same price bracket the option of either a 160GB SLC card or a 320GB MLC card.

So, with the SSD market looking less bleak, my first port of call was to take a look at some prices on the consumer side of the fence.

  • Intel x25-E 64GB SLC -£600
  • Intel x25-M 80GB MLC -£150
  • Western Digital 64GB N1X SSD  -£460
  • Western Digital 64GB Silicon Edge MLC - £175

Wow. OK. That's expensive. And let’s be honest, 64GB isn’t enough.

I quickly realised that for my upgrade to come to pass, I might have to downgrade my expectations. My current laptop is consuming about 120GB of its 360GB C:\ Drive at the moment, which I generally consider to be my working limit of live drive usage (I never like to use more that 1/3 of a mechanical drive -see my IOPS article for reasons why).  Consequently, if I move to SSD I'm pretty much going to have to drop my virtual machines or run them from an external disk.  Hmm.... Run my bulky vmdk chunks of developer heaven over USB2 and also suffer a reduced battery life on the train? Not ideal -so what else is there?

Well, it turns out there is a hybrid option. A drive technology exists which combines the enormous and cheap disk capacities readily available through mechanical disks with the super-speedy performance of the SLC SSD as cache. And the best thing is, if you're worried about the SSD reliability, you'll still be left with the mechanical harddisk at the end.

The only drive which I could find with this technology is the Seagate Momentus XT. The offering was a 500GB drive with 4GB of SSD cache. All at a cost of £100 -an absolute bargain I thought and sensibly within budget for laptop upgrades at work. So brimming with excitement, I begged for one, promising to gather before and after performance stats to detail how effective the drive upgrade was.

 

First Problem -Err... Challenge!

Within two days of ordering, my new disk arrived. An excellent start. My excitement diminished as I realised that I’m going to have a problem getting my data onto it if I didn’t want to re-install the OS. I didn't think forward enough to get a USB SATA caddie, so the workaround was to upload the existing disk image somewhere. As I didn’t have enough spare capacity on my Altiris server, the fastest option was to add an automation drive mapping in BootDisk creator pointing to my Terabyte desktop. I then created the new USB flash files for Linux, dropped them on my automation boot USB stick, and then booted the laptop for the image upload. This was a slightly convoluted solution (a usb caddie would have been better), but I was keen to get moving. The upload speed quickly reported in at 1GB/min, and predicted an imaging time of 2hrs 10 minutes. Lovely. Just enough time to break the hybrid drive out of its packaging and lose myself in some technology worship.

At this point, an important piece of recent history became important. You see the RJ45 clip had broken off my ethernet cable some weeks ago, and I'd never gotten around to getting a new one. So, imagine my overwhelming sense of stupidity when an hour into imaging I decided to put my feet up and twist the laptop ever so slightly for a better view of progress. The laptop moved, brilliant, but the ethernet cable didn't. My laptop, now catastrophically separated from the network could no longer upload the image and the session was killed. With a big intake of breath, I sat up, pushed the laptop back, rebooted it and started the upload once again.

For general imaging safety, I decided to leave the laptop on the desk and move downstairs on the test network -I could do more damage there. I returned to the office about about an hour later and sat at my desk. Imaging was going well. Completely forgetting about the dud cable, I leaned back on my chair, put my feet up and with a sudden sense of deja vu I twisted the laptop to see it better. At this point the ethernet cable once departed it's socket leaving me with practicing my breathing exercises in front of another dead imaging session. Feeling just a little annoyed with myself, I concluded that it's probably time to find a replacement cable. I found one that was just a fraction too short, but if I put the laptop right on the edge of the desk it just reached. I started imaging again, moved away from the laptop and sat at the spare desk. I looked at the clock... there was still time to do this and just catch my train. 

 

Second Challenge -RDeploy's Time Keeping and Flimsy DVD Drives

I've had a niggling feeling for a while now that something wasn't right with how rdeployt reports imaging speeds and time. I ran through some tests on VMs when my suspicions first arose on this, but of course for the duration of my speed trials everything looked good. I also opened a ticket with Altiris Support just in case they’d heard of this, but they hadn’t. So I let it be. I should have been more thorough...

I started imaging with my short but intact ethernet cable at about 4:30pm, and the predicted completion time was about 6:40pm. Plenty of time for me to catch the train home at 7 I thought. When 6pm came though, I started to worry. Apparently only 54 minutes had elapsed, and it was less than half-way through the upload.

Very carefully, I opened up a second console in Linux automation. Interestingly, the system clock in Linux was reporting fine. So the Linux RDeploy engine wasn't counting the clicks right. This meant also meant of course that the reported 1GB/min upload speed was actually closer to 600MB/min upload. With some reluctance I forced myself to kill the imaging there and then (so I could take the laptop and work on the train). I’d try again first thing tomorrow.

I picked up the laptop, ready to put it in it's bag, but had forgotten in my rush that there was no slack in the ethernet cable. It yanked back, sending the laptop crashing onto the desk. Forcing myself to breathe slowly, I  carefully removed the ethernet cable praying for luck.  I checked the laptop over, and amazingly, it all looked OK. The internal DVD-RW tray had however slid out. Not a problem. I pushed it back in, but no click. I let go, and it came right back out again. I pushed it in again, and true to the laws of physics it comes right back out. The retaining internal clip had obviously broken. Sod it. I removed the entire caddie. Who needs DVD drive anyway.

I considered myself lucky though -at least I hadn't killed the harddisk.

 

Third Challenge -The Momentus XP Drive Firmware

The next day I came, mentally prepared for the erroneous "Time Remaining" data. In the end despite another 2hrs and 10 minutes being predicted, it in the end it took a full 3hrs 30 minutes (in real time) for the 110GB of image files to upload.

Not a problem though –plenty of time. I took lunch, then cracked on. After scrounging some screw drivers, I carefully removed my trusty Seagate Momentus 7200.3 320GB disk from the Lenovo.  With some trepidation, I replaced it with my shiny new 500GB Momentus XT angel. As my experience is that image downloads are generally faster than uploads, I felt completely at ease when I started the download at 3:00pm. Even if it takes a full 3 hours and 30 minutes, I felt confident that I'd be done by 6:30pm, at which point I'd pack up, and be off to the train station with my fast and shiny laptop held proudly before me. Perhaps I'd even be home in time for tea.

But, we live in the world of IT Jenga. Blocks of technology are built upon other blocks technology. And like Jenga, much of this is assembled during happy hour at the pub. So, imagine my complete lack of surprise when I realise that the image was downloading at a reported speed of 665MB/min. Taking into account RDeploy's odd method for measuring time (assuming the clock ticks were counted equally badly on uploads and downloads), this means an actual speed of 400MB/min. So, 110GB was going to take, oh, 4 hours 45mins. In short, I was stuffed for the 7pm train.

On the positive side I had plenty of time to see what was up with this drive...

 

Seagate Policy and Firmware Updates

The first port of call in looking for Seagate drive problems I guessed would be the Seagate website. After a few minutes rummaging, I came up with the Firmware updates for Seagate Products page.  The Momentus XT drive is listed at the top here for updates, and was at that time on Firmware version SD24. I looked around for the release notes, and bizarrely found none.  Back on the Firmware updates page I found this small sentence,

Please note that Seagate does not offer details about specific firmware.

Now, anyone who has wrestled with firmware faults will likely understand my general feeling at reading those words. Manufacturers who take this secretive stance do not appear mystical and all-powerful. They are simply annoying and aloof. I ploughed through the forums, and found that many people had been suffering from performance issues with these drives. The new SD24 update looked promising, so the next thing was for me to find out exactly what firmware version I was running. For that, Seagate provides a handy Windows utility.

At that point, I still had another hour to go in the Linux imaging environment. Being impatient, I decided to see if I could fathom the firmware version from within Linux automation. I moved to the second console (ALT-F2) and typed,

cat /proc/scsi/scsi

From which the following trickled out,

Attached Devices:
Host: scsi5 Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: ST95005620AS           Rev: SD24
Type: Direct-Access                       ANSI SCSI revision: 05

Which although neat, was a tad unsatisfying. My drive had been shipped with the latest firmware version after all.  

 

Fourth Challenge -Catching My Train

Whist not entirely IT related, it does show the sacrifices we make for our art. After the quick scan of the train time tables, I discovered that if I did not catch the train, in 20 minutes then I'd have problems -there were no more trains due to scheduled rail works. Missing this train would mean I’d have to suffer the rail replacement bus service with a three hour journey time. I’d get home for about 1am.

And I get motion sickness.

I looked at my laptop, so very nearly finished, but not quite. I couldn’t stop it now –I was sooo close. So I picked up the phone and call my wife. I'm going to be very, very late dear...

Half an hour later, imaging finished and I booted up the laptop. All was good. Fantastic.  Time for a coffee, then it was off to the bus for a sickening ride home.

When I got up the next day I was frankly pretty tired. But, I wasn’t grumpy. I was buoyed with positive energy from apparently emerging through the IT furnace unscathed. I cycled to the station, got on the train pretty much looking forward to the journey with my revitalised laptop. Just as I got to Moreton-On-The-Marsh, my laptop froze and then offered the following greeting,

 

A problem has been detected and windows has been shutdown to prevent damage
to your computer.
 
A process or thread crucial to system operation has unexpectedly exited or been terminated.
 
If this is the first time you've seen this Stop error screen,
restart your computer. If this screen appears again, follow
these steps:
 
Check to make sure any new hardware of software is properly installed.
If this is a new installation, ask your hardware or software manufacturer
for any Windows updates you might need.
 
If problems continue, disable or remove any newly installed hardware 
or software. Disable BIOS memory options such as caching or shadowing.
If you need to use SafeMode to remove or disable components, restart
your computer, press F8 to select Advanced Startup Options, and then
select Safe Mode.
 
Technical Information:
 
*** STOP: 0x000000F4 (0x00000003, 0x8A0A6990, 0x8A0A6B04, 0x805D29B4)
 
Beginning dump of physical memory

 

It’s times like this where I’m reminded why I should not work in notepad. No auto save. I bravely fought the impulse to scream, and the resulting internal struggle resulted only in a slight whimper. This scared a couple of fellow passengers, and this is where it first occurred to me that creating curious noises on the train was a fine tactic for getting a little extra space. Importantly though, there were no tears. This enabled me to note that that the hardisk light was constantly on for the duration of the blue screen.

When I forced a reboot, I was unsettled to find no indication of either a dump file or an event log alert. So, whatever critical process terminated, a symptom was that disk access was not possible ( I always configure my systems to write a dump file and alerts to the administrative log, as I seem to attract BSODs).

 

 

 I powered up the laptop, and started playing with a few VMs. Within a few minutes, I found that my laptop was intending to be rich and diverse with the STOP errors it could summon,

 

A problem has been detected and windows has been shutdown to prevent damage
to your computer.
 
KERNEL_DATA_INPAGE_ERROR
 
If this is the first time you've seen this Stop error screen,
restart your computer. If this screen appears again, follow
these steps:
 
Check to make sure any new hardware of software is properly installed.
If this is a new installation, ask your hardware or software manufacturer
for any Windows updates you might need.
 
If problems continue, disable or remove any newly installed hardware 
or software. Disable BIOS memory options such as caching or shadowing.
If you need to use SafeMode to remove or disable components, restart
your computer, press F8 to select Advanced Startup Options, and then
select Safe Mode.
 
Technical Information:
 
*** STOP: 0x0000007A (0xC05CF9A8, 0XC0000056, 0XD9F35642, 0XB3C04860)
 
***     ftdisk.sys -Address xxxxxxxxx base at xxxxxxxx, Datestamp xxxxxxxx
 
Beginning dump of physical memory

 

I closed the laptop down, and lacking anything else to do I proceeded to eat the sandwiches I'd bought for lunch. Once I got to work, I consulted with the google oracle -MSDN came up with a Microsoft Bugcheck 0xF4 article. This told me pretty much what I saw in the text -that this is due to a critical process which has terminated unexpectedly. The second bugcheck (with ftdisk.sys being mentioned)  points to something being seriously wrong with the new disk. A quick google for that parameter 0xC0000056 came up with a listing from the Wine project, where this error is listed in the ntstatus header file as meaning STATUS_DELETE_PENDING. All in all, pretty much as informative as most BSOD screens I've encountered. 

Over the week that followed, I sat the laptop in the corner of my desk and had it execute burn-in applications. Infuriatingly, it didn’t blue-screen once.

So, I started using it again, and on the train home it blue-screened. I just stared at the screen -and decided this was tantamount to war. I resolved to  continue using the laptop, and establish the pattern. So, the following week I used the laptop as normal (but for now relinquishing notepad, and typing docs in MS word, saving regularly). And there was a pattern. It seemed I could fairly reliably force a temporary freeze of the laptop by moving it or giving it a jolt (such as experienced on a train going over points). If I added a burst of intensive IO at the same time, by starting a VM for instance, I found I could pretty much guarantee that the laptop would throw a fit and produce a BSOD.

 

The Solution

At this point looked at my options. The solution was to give up;

  1. My laptop drivers and firmware were current
  2. Contacting Lenovo support was pointless -the hardware they supplied worked fine
  3. Contacting Seagate was going to be a pain. From looking at the forums, it came across fairly well that getting support from Seagate was going to be a long, hard road. 

So, I removed the XT drive, and replaced it once again with the old Seagate. Although fully-mechanical, it was reliable. I was going to stick with my trusty Seagate non-hybrid drive FOREVER.

 

Epilogue

Like a teenager in love, forever lasted about 4 months. This article was mostly written back in March, and I decided not to publish it as it didn't have an ending. That changed at the end of last month (September) when my 'reliable' harddisk decided to just die. And not just die a little bit. Die a lot, and instantly. Perhaps it's demise was hastened by the short-ethernet-cable laptop-drop incident. Perhaps it was fate.  

Luckily, being in IT, I of course know the value of keeping backups, so this should not have been a problem. Much to my embarrassment though, I realise I hadn't syncronised with my backup server for just over a week. Further, I'd spent the week previous to the crash programming rather intensively on ImageInvoker. All backups were therefore prior to this spate of productivity. 

I admit at this point I might have panicked. I stuck the drive in the freezer, for an hour. That didn't work, first and foremost (as my wife reminds me) because I'd frozen the wrong drive. I then tried freezing the right disk for an hour. Nope. A whole day. Nope. Then I tried changing the circuit board -not as easy as it looks. I then admit I banged the drive around a bit. That oddly didn't work either. So, in the end, the only answer was professional data recovery. And to be honest after what I put the drive though, they'd have to be professionals to get anything off it. 

Disappointing backup regimes aside, my driveless laptop prompted me to once again checkout the Seagate website for the Seagate Momentus XT.  It turned out they were now on SD28 -that's four firmware updates released in the last 6 months. All in all, a pretty good indication that they've been tackling problems. So I resurrected my Seagate XT, threw my HII Windows 7 64-bit build on to it, and tried it out.

Moving forward a month, it's now the end of October and I've now been running with my freshly installed hybrid drive nearly a full 1 month. I'm happy to say that I haven't had a single blue-screen. This is great, but with my previous drive being dead, I'm not really able to show the nice stats I wanted to highlight as the performance gains due to the SSD cache. It does feel faster though if that's any help....

And in a strange twist of fate, one of the few working folders that could be completely recovered from my harddisk were my ImageInvoker source points. For the first time in 6 month, I felt lucky. My week of programming was not wasted after all. The data recovery firm were also kind enough to return my data on a 500GB Buffallo drive, with instructions on how to backup my data in future. A gesture which I found 90% kind, and 10% infuriating.

All I have to do now is get my DVD drive sorted. Looks like a fairly simple job ;-)

Kind Regards,
Ian 'May all your upgrades be dull' Atkin

Statistics
0 Favorited
0 Views
0 Files
0 Shares
0 Downloads

Tags and Keywords

Related Entries and Links

No Related Resource entered.