How to debug a production issue?

Amirhosein Zlf
CodeX
Published in
4 min readAug 28, 2021

--

Every developer and manager who worked in tech face an issue that only is happening in a production environment. This type of issue is the hardest one to debug cause you can’t easily check the log or do some regular testing.

In the past couple of years, I was responsible for delivering different products in production and face this type of issue multiple times. In this article, I’m trying to share my experience on how to fix them.

Find the scope

First, take a look at the bug report and try to make some guess about the issue. Is it because of the code or could be something related to the server? Could it be related to the data on the live system?

Try to find the lines of code that are affecting these requests and also wrote down the path that requests go through step by step.

Make Sure

The next step is to make sure that the issue is happening only on the live system. There was a couple of times I got a report for a live issue but it was also happening on staging servers. In these scenarios, you could easily debug the staging server and do not take any risks on the live system.

Check the diff

I always check the latest release or find the release which causes the issue. Review the whole code and take notes for suspicious lines of codes. Sometimes you could find a rookie mistake that nobody paid attention to.

Staging vs. Production

Make sure about the difference between staging and production setup. In production, you might have scaled your project, have different configurations, multiple databases, different caching strategies, etc. Take a note of the differences to have a better overview.

The life cycle

There is always a request or couple of requests which create the bug, wrote them down, check the life cycle for each of them. Make sure the requests are going through the way that you think. It could waste your time because the request is going through some other lines of code.
For instance, we had a release on the live system and some requests were failing because of naming. The developer added a function with the same name and there was a conflict. We changed the function name and everything goes back to normal.

Minimize the scope

Test and debug everything one by one to minimize the scope for yourself. If you change multiple things together, it’s harder to find out which one was the issue.

Caching

Sometimes the caching configurations on the live system create the issue. Make sure the required caches have been updated.

Load Balancing

In a large-scale project in which you load balance the requests between multiple servers, you might have a broken node. Make sure all the nodes are updated and getting the request properly.

Everything is possible

The problem is not always the hardest one and anything is possible. Just remember to create different test cases to cover every possibility. Go through your test cases one by one and try to create the smallest scope for it.

Meaningful log messages

When you are adding logs to debug, remember to have a clear message in your logs. It’s easier to find them and also you don’t need to check the code again to see where was that log.

This is an index action for categories. Some categories are not shown here and we want to find out why.

def index
city_id = params[:city_id]
city = Category.city(city_id)&.first
return head 400 unless city.present?

cache_key = "categories_#{city_id}"
expires_in = Rails.application.config.api_cache_lengths.soon

cats = Rails.cache.fetch(cache_key, expires_in: expires_in) do
global_cats = Category.get_global_categories
cats = []
if city.present?
city_cats = Category.get_categories(city) || []
end
cats += city_cats
cats += global_cats
end
cats = Category.remove_no_deals(cats)
cats = Category.categories_sort_by_position(cats)
categories_serializer = CategorySerializer.new(cats) render json: categories_serializer.serializable_hash
end

Now I’m going to add some logs to see how the code is working and understand which line of the code is creating the issue.

def index
Rails.logger.info "Index action started .."
city_id = params[:city_id]
Rails.logger.info "City id: #{city_id}"
city = Category.city(city_id)&.first
return head 400 unless city.present?
Rails.logger.info "The city was present"

cache_key = "categories_#{city_id}"
expires_in = Rails.application.config.api_cache_lengths.soon
Rails.logger.info "Cache Key: #{cache_key}"

cats = Rails.cache.fetch(cache_key, expires_in: expires_in) do
global_cats = Category.get_global_categories
Rails.logger.info "Global categories count: #{global_cats.count}"
cats = []
if city.present?
city_cats = Category.get_categories(city) || []
end
cats += city_cats
cats += global_cats
Rails.logger.info "Final cached categories count: #{cats.count}"
end
cats = Category.remove_no_deals(cats)
Rails.logger.info "Categories after removing no deals count: #{cats.count}"
cats = Category.categories_sort_by_position(cats)
Rails.logger.info "Categories after sorting: #{cats.count}"
categories_serializer = CategorySerializer.new(cats)
Rails.logger.info "Response should be: #{categories_serializer.serializable_hash}"

render json: categories_serializer.serializable_hash
end

Now you could see where the categories are getting removed from the list and find out why., go and check that line of code.

You should know that there is not the correct answer for these types of issues. Always share your experience with others and also get help from colleagues.

--

--

Amirhosein Zlf
CodeX

Tech enthusiast with a love for Football, Formula 1, and Snooker. A fan of Real Madrid, Mercedes AMG Petronas F1 Team, Ronnie O'Sullivan, and Coldplay.