Heading image for post: Avoiding Code Catastrophes

Avoiding Code Catastrophes

Profile picture of Jake Worth

"Am I going to make a huge mistake in my first dev job (e.g., delete the production database) and get fired?"

I've been asked this question a few times. It's a concern that new programmers have: that they'll end up in a situation where they make the wrong call in the heat of the moment, harming their company and ending their fledgling career. Cutting the red wire fixes the production bug; cutting the green wire sets our servers on fire, or vice versa. Which one will you cut, junior developer?

I think this is a valid fear when you're new, and it's something that once crossed my mind, as a self-taught beginner. Is it possible for me to make such a gigantic mistake in my first programming job? I think the answer is no.

When you start working as a developer you will have differing levels of power based on the team. At a startup, your level of power may be high. On a large team it will be lower. On any sized team, there should be processes in place that prevent mistakes. Not just mistakes made by junior people, but by people on all levels. When anybody on the team has the power to make a giant mistake, that's a process failure.

As your career progresses, you'll begin to perceive invisible boundaries between safe and dangerous technical choices. Running an SQL query in a console on the production database is dangerous. You might make a mistake and delete a record, or all the records. Writing a database migration in development is safe. That's the type of work we all want to be doing. New programmers should be put in positions where failing to see those boundaries isn't detrimental to themselves or the team. Effective leadership does that.

Here's a practical example: the feared 'bug in production'. Imagine a bug that is affecting user data. A nullable column is leading to bad data that breaks the user experience. What should we do?

A tempting solution might be to connect to the database console and fix the records. The problem goes away in seconds, so it's the fastest choice. Doesn't that also make it the best choice?

The answer is no. Connecting to the production database console is risky; you're just as likely to do more harm as good, regardless of your level of experience. More importantly, you won't learn anything. You won't learn what caused the bug, and you won't be able to prevent it from happening again. And that information is much more valuable than any quick fix. We want to put ourselves in a position where we can learn, and fix the problem instead of the symptoms.

Here's a better way to approach the issue. We have a bug in production. Step one: don't panic! Unless your software is powering a shuttle to Mars, the significance of the bug is probably lower than you think. Is it a bug that affects customers, or just internal users? If it does affect customers, how many? Is the bug in a feature that is used by many people, or just a handful? What does our error monitoring software say? As you start to ask questions like these you realize that few bugs are a five-alarm catastrophe.

No matter how bad the situation seems, you almost always have time. Time, to fix the problem safely, and to ensure it doesn't happen again.

Here's a list of steps I might take to fix this contrived emergency:

  1. Get a copy of the production database, sanitize with the sanitization script you already have
  2. Load the production data into a development database
  3. Try to reproduce the bug! Use every tool you have to recreate the issue. The bug was created with the data and code in front of you; think!
  4. Fix the bug. If this requires you to change data or the database, could you write a script? Test it in development. Does it work, and also clean up after itself if something goes wrong?
  5. Write a test that describes the correct behavior. Verify your bugfix by running it before and after your changes
  6. Test out your script in multiple environments. One of these environments should be staging where the data is the same or almost the same as production
  7. Fix in production

Does this take longer than the shortcut I proposed? Absolutely. But once you've done it a few times, that delay becomes insignificant. By taking these few extra steps, you have fixed the issue in a low stakes environment, and you know why it happened so your solution prevents it from reoccurring.

You might think that solutions like these are what they teach you in school, but not what people do the real world. If you find yourself at a company that thinks this way, start looking for a new job. Get on a team that takes the time to do it right. There, you will learn these techniques and more as your career progresses. By doing your work in a safe and predictable environment, even when the stakes are high, you'll keep your users happy and your career secure.


Photo by darkroomsg on Unsplash