He was given a list of server 2003 servers to delete from vsphere and one of the names in the list was incorrect. He logged in to a 2012 server of the same name (didn’t realise it was 2012 though) and ran the decomissioning script, then deleted the vm.
That was our file server for a whole site.
Its all good, we have backups and its being restored but he’s feeling a bit rosy-cheeked!
We’re sharing our “first f*ckup” stories here in the office. Why not share yours?
So there I was, a young junior who had been given the keys to SCCM to learn a bit about how it works.
I’d already customised the build for my work laptop so it could dual boot. The idea being one half was for personal use (games etc, tsk tsk I know) and the other half would be for work and be ‘clean’.
I set up a task sequence, tweaked it a few times, named it “Tekwulf SCCM Build (do not use)”, set it to mandatory assignment and went to test it with my laptop.
See, in SCCM you use mandatory assignments for apps. This way you can drop people in an AD group and they get the install. Task sequences, not so much. Making a task sequence mandatory makes it push out to everyone in the collection, in this case “all workstations”
So there we have it. all 4000 machines in our estate starting a rebuild, with my name popping up on the screen before it started.
The saving grace was that my own lack of talent meant I’d configured the task sequence so poorly that it didn’t even run step 1 (format disk). I pulled the deployment and no harm was done, but if it hadn’t been for my own incompetence saving me from my own incompetence, I’d have just wiped an entire company.
I’ll play along. My biggest screw up. I was decommissioning an Exchange 2010 server once. At the time I had experience with 2003, but limited experience with 2010. It complained that it couldn’t remove Exchange because there were mailboxes remaining, so obviously I deleted all of the mailboxes. Didn’t know at the time that this also deleted every single user on AD. That was fun.
Production military logistics database in the middle of the Iraq war. A system that literally millions of soldiers rely on(even if they don’t know it) to get their beans and bullets. We just upgraded and I wanted to show off this new feature in Solaris 10 that prevented you from accidently deleting root.
I wrote a script to disable AD accounts for leaving employees based on data from our HR system. Ran script, and it incorrectly disabled 800 employee accounts in the middle of the day.
Oh man, take him out for a beer afterwards and let him know we’ve all done something silly. At least he has a story to tell now :). Welcome to the club!
One of our “Senior” Sysadmins deleted the computers OU on his 3rd day of having credentials. It took us 3 days to recover and 300+ machines had to be touched for repairs.
changed something on firewall, ran restart on haproxy, didn’t got back up. After that senior decided we’re not allowed on production LBs and each commit there have to be confirmed by one of them. some time after that we automated checking config validity before restart
New admin (not me, altho I also didn’t know that feature existed) accidentally turned on “send key to all machines” feature. Then ctrl-alt-deleted.
logged to server before first coffee. Twice in my career. Every time it ended up breaking something.
Early in my career I was tasked with rolling out a new endpoint security agent to all of our workstations. While still setting up policies for testing, I thought I might as well push the agent out to my test group.
It’s kind of like hitting the “Reply All” button instead of “Reply”. I accidentally pushed the agent out to EVERYONE. And since I hadn’t even finished configuring the policies, almost everything was blacklisted by default, so all of our workstations simultaneously lost access to all of their peripherals. Including mice and keyboards.
Easy fix, right? Just uninstall the agent. After all, there’s an “Undeploy” button right next to the “Deploy to every motherfucker in the company” button. Except when I clicked it, nothing happened. In fact, nothing happened when I did anything. Because, of course, my own keyboard and mouse had been blocked as well.
Took the better part of an hour to undo everything, which was made infinitely easier by the fact that the phone would not stop ringing and/or cursing me in increasingly creative ways.
So on that day I learned two things: 1) Have a true test environment that’s completely segregated from your production environment. 2) Always read the documentation. Don’t just “throw it on a server and play around with it”, which is what I was doing.
Worked at a hosting provider, hundreds of small websites on a shared platform with a managed load balancer in front. Someone much more senior than I had built the setup and customized the provisioning scripts, complete with step by step documentation, the whole 9 yards. It worked flawlessly and was exactly what a junior such as myself needed.
We also hosted other customers on their own private stacks. One had a setup very similar to our own shared LB… With some differences, that weren’t as well documented. I get a ticket to add a domain to their private stack, and follow the same tried and true instructions. All hell breaks loose, every website on the stack becomes inaccessible as the LB cluster fails to come back up. Can’t revert changes from SVN for some reason. It’s 11PM and I’m up shit creek.
Luckily, one of our seniors (who happened to have built that very stack) is still around working on a diff issue. Shoulder tap, escalate, fixed within 15 minutes. I then call the customer and apologize profusely.
Learned a lot that night about the pros and cons of documentation.
I’ve told this one in a few different threads now but it’s still my best (worst??), fuckup.
I once turned off all power to our server racks on the busiest day of the month while trying to fix a b0rked APC UPS network card that was spamming us with self test alerts. The webui had locked up so clever me had spotted that there was a serial port on the back that had a command to safely restart the management interface without affecting power.
Helpfully with APC this serial port is non-standard (this is barely mentioned in the manual, just 1 line of warning with no info on how bad things happen when you ignore this), so when I plugged in the our serial to usb adapter I got to experience that always fun sound of silence in the server room as 1.5 racks of servers all when silent as everything turned off.
Amazingly nothing ended up breaking but it was a butt-clenching 20 mins or so as we powered everything back up and checked to make sure it was ok.
Deleted a mislabeled SAN volume that happened to be our production file system. Oh and by the way, we didn’t have good backups of the volume. I learned two things that day. Always, always, always disable a volume first, never delete! 2nd, don’t touch anything if that happens and call support. They were able to bring back the volume because after I deleted the volume I didn’t touch anything else or try to fix it. It also taught upper management a proper lesson of not being cheap when it comes to backups!
About 8 years ago I pulled a drive in a functioning RAID5 array, just because. Sever was running an overloaded SBS with Exchange and SQL. Rebuild took almost 10 hrs because the drives were basically at max utilization before the automatic rebuild started. SQL server running the electronic medical records system slowed so much. I quietly walked away hoping the rebuild didn’t kill another drive knowing there was nothing I could do now but pray.
The rebuild started immediately when the drive was removed because there was a hot spare.
That was my first day. No one knows what happened to this day.
Just a few weeks ago I was updating remote switches in an MPLS migration for a client and uploaded the wrong startup-config to a switch. Wrong site so the VLAN interfaces were on the wrong subnet behind the provider’s router and now completely unreachable. Props to the AT&T engineer who reconfigured the router to the site I just duplicated and the and the site I just orphaned so I was able to get to the switch and reconfigure. The site was about 800 miles away. Luckily, the we’re still in production on the old MPLS so no outage. Remote switch and router updates make me sweat.
You never forget your first/biggest fuckup. Mine was about eighteen months into my IT career, about six months after I’d started going out as a field engineer and doing SBS installs.
Customer’s server hard drives had failed. It had been rebuilt from scratch in the office and they needed it back, they were pushing for it and none of the more senior engineers were available, so off I went to set it back up. Blitzed through it, all going well, nearly done, until someone asks me where their documents had gone. And then someone else asked. And then a third…
Unbeknownst to me (and my own stupidity for not checking), they had folder redirection and offline files enabled, which we would normally disable for PCs on a local network. When I’ve put the PCs onto the new domain, offline files has synced from the server to the PCs and not the other way around.
Lost a LOT of sleep that weekend and the next few days. Got very VERY lucky in the end, someone back at HQ had some old data recovery software that got their data back from the day the server had gone down.
Around linux 2.2 times, I was working for a company that provided telephony services for, well, agricultural communities. Those are renowned for their… conservative approach to expenditure, which is why they ran Z80 based, wall-sized, tape-booted PBXen that the vendor hasn’t supported for 20+ years, thus feeding my employer. One of the newer services we provided to these communities was internet to every community member over the existing infrastructure, by means of cheap chinese DSL modems and a custom-made DSLAM. (Edit: have you ever seen a DYI “layer 2.5 PoE switch”? fun fact: they float in a bucket full of water, wrapped in like ten layers of plastic bags) The DSLAM was a whitebox PC with a bunch of homemade PCI boards, stacked with a new line of experimental ATM SAR/DMT chips that the company owner bought up in bulk (naturally, shortly after that the chip vendor decided to retire the line and offered no support whatsoever). As these were very early versions, prototypes essentially, and they had all sorts of hardware bugs exacerbated by a very homemade Linux driver that was used by nobody but us, the guy who wrote it disappeared to another hemisphere. One of those was a phenomenon we called “tornado”. Basically, over a certain load, the DSLAM chips would go mad and start spewing out crazy nonsense to the kernel driver, which would crash the DSLAM box, meaning instant “no internets” to 50+ users, and require a physical reboot. This would happen once a week or so on a box in a park of probably 50 geographically distinct sites. Usually the PBX technician on the community side would reboot the box.
To combat this phenomenon the company came up with an original solution, which was a hardware watchdog that was hooked up to a serial port and the reset switch of the DSLAM motherboard. A cron script would poke the serial port once a minute that resets the watchdog. If in three minutes or so there was no poke, the watchdog would trigger hardware reset of the box.
As the company was eager to deploy some QoS-related feature, I’ve been instructed to perform kernel upgrades on all the DSLAM boxes remotely. I dutifully did so. Little did I know that the packaged version I was deploying kind of lost the serial port driver in default configuration.
Our SCCM admin gave access to SCCM to one of our interns so he could learn how it works. He quickly showed him how to push an image to a machine and suggested the intern try it himself before going to lunch. The intern managed to deploy a base Windows 7 image to a collection titled “All Systems”. Within minutes we started getting calls of unresponsive servers as well as users computers restarting on them. We were able to cancel the task, but not before losing five production servers and three desktops. We lost another two servers a month later during a patching cycle. Apparently they had cached the task and was waiting for a restart to implement it. It certainly wasn’t the intern’s fault. I don’t think he understood how powerful SCCM could be. The admin working with him got extremely defensive for a while and would make a scene anytime anyone would mention the incident. He left shortly afterwards and I became the new SCCM admin.