So our intern deleted a production server by accident

anyone heard of SSI or shtml files? I had a whole site that was basically:

content.html

content.shtml (with header and footer)

inside content.shtml:

#include content.shtml

what it should have read:

#include content.html

what happened? infinite recursion, and a server hang/crash.

I once had to work all night cleaning up a mess a new employee had created.

They had been given a production database account, and had been told to change their password.

The “DBA” instructed them to update their password by doing a direct update against the user’s table.

User neglected to specify a “where user = me” condition to the query and set all passwords to theirs.

User was very embarrassed, and kept apologizing to me. (I was fixing it because the DBAs had no idea how to).

And I kept telling her:

“This isn’t your fault.
The DBA’s gave you far too many permissions.
This was going to happen eventually.
Our process is at fault, not you.”

Along with “Am I Getting Fucked Friday” can we also do,
“Fuckups Friday?” Mods?

When I was an intern I somehow fucked up Grub on my machine beyond repair, in a way that stumped even the local Wizened Beards, but nothing more than that…

My first production mess up was with Cisco Unified Call Manager 6.x.

I was pretty green and hadn’t ever played with CUCM before the past couple weeks. Just got thrown at it and shown fundamental basics.

I had multiple phones on my desk, and I was working out how to have different phones act as different extensions. Both phones, at this point, had 2 extensions:

  1. My extension
  2. Helpdesk

I decided I’d delete the helpdesk line off one of the two phones.

Very quickly I realized I had deleted the helpdesk phone number for the company (~3500 users, ~40 people on helpdesk), not simply off my phone.

There was an automated script that was then run that (luckily only) archived the helpdesk email address.

My boss had it fixed in about 15 minutes, and frankly just laughed the whole time. I was white as a ghost. They pointed out that “you now know exactly what not to do and how to avoid doing it” and “why would you be in trouble for a first time offense?”

I never did that again.

My first employer was largely a MacOS 9/early OS X environment, and we had several Xserves in our rack. The Xserves had the “push in” drives that could be simply popped in and out of the chassis like a button. Of course, there was a locking screw to keep them from doing this by accident. And, of course, our locking screw wasn’t engaged.

I reached up to get something from the top of the rack and brushed against a drive of the main file server, popping it out of the chassis and crashing the entire server. Thankfully, it all came back online, but that was my first true screw up.

Couple other larger mistakes:

  • Deleted a good chunk of a file server at another job by using Robocopy’s /MIR switch to mirror an empty directory to the root of the file server share.
  • At that job I also removed the wrong drive from our Exchange server (was replacing a failed drive), and took down Exchange for a short period of time while the server came back up. Thank God the array wasn’t hosed.
  • At another job was somewhat lazy when troubleshooting a broken pair of Citrix Netscalers in HA mode and caused a failover from the active, working device to the failed device and knocked out all of our external facing web services for a good 30 minutes. I had just factory reset the second device’s network configuration, so when the failover happened the HA synchronized the blank settings to the previously working node and I had to rebuild the networking settings from scratch. I should have pinned the active node but didn’t.
  • At current job was cleaning up our backup server and accidentally deleted the NetBackup database my predecessor had moved from the default location. Thankfully, I now know how to restore the NetBackup catalog for a potential DR situation…

Everyone makes mistakes. You learn for them. In this case I’m glad to see you have your interns actually doing real work, however, I’d like to know how he mistook Server 2012 for Server 2003…

New IT apprentice and I were installing rails for a new server. The rack in question is accessible on all sides but it’s a tight fit in a small room.

As the apprentice moved to the back of the room, he kicked a switch on an electrical outlet. This cut power to all our network switches killing Internet access, the internal network, WiFi and VOIP phones. (FYI, off switches on electrical outlets is a UK thing).

Before anyone says “Duh! You should have those network switches on a UPS.” Well, they are. The UPS was too massive to install in the server room and sits in the electrical room instead. The outlet that was turned off is a UPS protected outlet.

Never occurred to us that the outlet switch could be a problem. Live and learn. We’re now ordering child proof covers these outlets to prevent a repeat.

TLDR; IT apprentice turned the network off. We’re now buying baby proofing gear to protect equipment from the apprentice.

I was going through a list unsubscribing from unused services… One of them was a G apps domain from an old business project. We made the spreadsheet on a TV in a meeting room, when viewing on my laptop the ‘notes’ column was not visible

‘For the love of god back up all documents before deleting this account’

Luckily a former employee was able to direct us to a backup, it was a tense week.

This was 11 years ago. I was working IT for a small manufacturing company, and the owner was a huge Apple nut. Late 2006, and he’d just finished paying a company to put in brand new Mac Pro towers for the office workers, and a Mac XServe unit as a server with 3x 500GB drives in RAID 5. I was brought on shortly after all of this was installed.

One day, the XServe was randomly crashing and throwing errors. I was on the phone with the company that had installed it They checked the logs and saw one of the three drives was in error state. They recommended, to start, shutting down the server, pulling the sleds and reseating the drives, and we’d go from there.

I sent a shutdown command to the XServe, and waited patiently until the monitor blanked out and the fans spun down. I then started pulling the sleds. The server immediately started beeping, and the fans spun back up.

I didn’t wait long enough. It had not shut down yet.

The drive I pulled first was not the one throwing errors.

The install company had not set up backups yet.

I now had three drives from a RAID 5, one with errors, and the other two that I had interrupted.

Put the sled back in once the server had finished shutting down, and it refused to boot.

The rebuild took around 3 days IIRC, and there were tons of file sync errors to correct when it was finally back up.

Always verify the box is actually powered off, and didn’t just stop displaying video before you go messing around with its insides. They don’t like it when you do that.

I really like that this has a happy ending. People make mistakes. That doesn’t make you dumb; it just confirms you’re human.

One of my first huge f*ck ups was when I was super-green and learning IT. I was tasked with desktop support and backup tape swaps, etc…So, our server room had 4 full 72U racks on one side, and a 12-foot long cable plant on the other side (voice and data), which over the years had turned into a spiderweb-rat-nest-abomination. Cables weren’t labeled or run through the cable management. It was a mess. So little old me gets the job of recabling the whole plant.

So what does my ambitious-go-getter-be-the-ball self do? Exactly what you think. I came in on a Saturday and unplugged everything from one end (from the $80,000 cisco catalyst), and start plugging it all back in all nice and organized, and replaced cables that were too long or too short and oh man…the cable management looks amazing, and I’m super proud. It took me most of the day to reroute all the cables, and I went home very happy and proud of the work I’d done.

Well, Monday morning, I enter my office and all hell’s broken loose and my co-worker (my senior admin) has been there since like 6:00am working with the IT back in New York because I re-cabled the servers and WAN connections in the wrong vlans (which I didn’t know was a thing back then) and of course, I shut the whole office down; an insurance sales call center with over 100 people working in it - down for most of the morning, which as you can imagine on a Monday is pretty bad. I didn’t get fired, thanks to my senior admin, but I did get a stern talking to by our manager.

I´ve updated the server which carried the OTRS ticket system.
I thought “oh, well, what could go wrong with apt-get update and apt-get upgrade?”
I accidentally used apt-get dist-upgrade on machine that has not been updated since the beginning of time.

Afterwards, i just had do edit some configs and it worked again. I was very lucky with that one.

My biggest ‘on the radar’ screw up was during an XP to 7 migration. Testing the Windows XP to 7 backup/re-image process, and had user state turned off, as I hadn’t gotten to that part yet.

Fat-fingered a system name that I was testing; voila, wiped out an Engineer’s machine, and didn’t back up the data, because why would I?

Fun times had by all.

I modified a script that loads data from remote databases, didn’t realize the script is called by another script that uses a batch loading process. since I didn’t provide parameters for the batch loading, it loaded indefinitely, swelling the database by 2 terabytes in a weekend. Luckily it was in a test server.

edit: This was this year and I’ve been a professional for 7 years lol.

I had a closet with like no ventilation; it did have a robust A/C unit, so temps weren’t an issue. I went on vacation; a couple days in I get a call asking why my computer closet was “beeping like crazy”; also “is it always so warm in here”? Someone thoughtfully shut down the A/C unit to help save electricity…

I may have run an UPDATE statement intended to fix a single entry’s typo on an SQL table holding configuration data for a central software system

…without a WHERE clause.

Why do you always realise these things at the very instant that it is too late to stop yourself from pressing the Enter key?

I always tell my guys it’s not the screw up but the cover up that will get you fired. If you own up to your mistakes immediately you’ll usually be ok.

Had a friend imaging a classroom with pxe and Ghost imaged the wrong subnet. Oops. I always wondered if anyone every PXE booted DBAN to the wrong subnet. I can picture someone just walking themselves out and heading to the bar.

There was that time I was switching between our RDS servers, both checking for updates but I was only going to install the updates on our backup server during business hours.

I go to install the updates and restart and it give me the “Other users are logged in” prompt and I think “That’s odd, oh it must be my user account I was working in earlier” and I proceed to shutdown the server. Queue about 10 different calls from people telling them they got kicked off the server. I had shut down the production server but thankfully it came back up very quickly and most people chalked it up to a networking issue.

Lesson learned, now I always check the server name and IP before I proceed to shut it down.

We have a bunch of AIX servers for various facilities. One site has a secondary system that interfaces with it that seems to send some sort of AIX mail to it every time someone logs in.

We developed a script that takes all the user names and puts them in a text file. From there, we vi the text file to remove the next process from touching any system critical accounts (just as an extra set of security - it shouldn’t be a problem, but management wanted it that way).

After that, we run a command that is essentially “for every account in this list, blow away their email messages on this server.”

Well, one night I wasn’t in my right frame of mind and ran the commands in the wrong directory - which was the applications prod directory full of code snippets and the like. When I run the second half of the “blow away email” script, I noticed that what was being cleared was not user account messages! I ctrl-c’d the shit out of it, but not before it got through about a quarter of the directory. Luckily, we have backup code on our dev systems, so I ftp’d them over without anyone noticing. But boy did I have a puckered up moment there.

Not a mayor fuckup, but nevertheless a fun story…

I was waiting for my friends to pick me up for a night out while hacking away on my computer. When the doorbell rang, I quickly issued an “init 0” and joined them outside. Quicker than my monitoring could send an alert, I already had people calling me on my cellphone … turns out, I was on the mailserver’s console and shut down that machine, my wokstation was still up and running.

Had to get the guys at the datacentre to power up the server again - no harm done after all.