nick.org down!
nick.org - down!
After almost four years of no unscheduled downtime, nick.org came to a screeching halt Sunday. I noticed an email from my brother, Kevin, indicating that I had farked something up. “That’s weird”, I thought, “it was working fine last night.” I went to the website and found it very unavailable, with nothing but a strange Apache directory listing. I attempted to ssh and found that it was unavailable too. Uh oh. I happened to be talking to my parents via Skype at the time (it was Mother’s day) and juggling Jerry on my knee.
“I’ve been hacked” - was my first instinct. I started to look at other sites
that I’m hosting and found other strange Apache error messages. Oh no. I
continued to talk to my parents, figuring that I couldn’t do much about it now
- but I let them know what was going on. My dad is very committed to his online journal and I didn’t want to disappoint his adoring fans for long.
I continued to poke around and found some log entries (sent via remote syslog)
that indicated a hardware failure of some sort. A double device failure? Oh no
- data loss! I grew more uncomfortable and decided I needed to tend to this now. Eriko wanted to go downtown for some shopping and it took us a few minutes to get ourselves and Jerry ready.
At the data center, I was confronted with a RAID controller which had gone out
to lunch. I power-cycled the machine and it believed all the disks were still
part of the RAID5 group - but once it started quotascan, the controller locked
again. My experiences at Isilon have given me quite a bit of knowledge into how
drives and controllers fail, so I decided to see if I could identify which
drive (if it were a drive) that was causing the problem. I rebooted again and
sat and watched the drive lights. Blink, blink, blink, stop. Drive 5. Stuck.
I pulled out drive 5 and rebooted. Success! I immediately rsynced a backup copy
of the data. The sites I’m hosting are mainly read-only, besides uploading
pictures, blog entries, or the log files from the websites themselves but I
hadn’t yet setup an automated backup mechanism on this new server.
I left the data center with the server in a WOR (window of risk) and headed
home, glad that I had at least restored service for the time being.
On Monday, I returned to the data center with another drive. To my dismay, the
RAID controller believed it was 250.9 GB instead of 251 GB. Argh!
Reallocations! My only option was to switch to the RAID1 set (which is the boot
disk) so that the data isn’t at risk. I’ll rsync to the RAID5 set as well as a
remote location, which will provide some buffer until I can obtain another
disk.
Here’s a shameless plug for Isilon - but boy would I have loved to have an
Isilon cluster. I could have a lost a head w/o making the data unavailable, and
I would be able to reprotect the data (at the expense of free space, which I
have plenty of) without having to immediately replace a drive. This suits me,
the busy (part-time) storage administrator, to a T. The experience left me even
more convinced that our product is light-years ahead of anything else.