I needed to run one of our Java server in production environment with full traces enabled.
The server would crash under load, due to out of memory error. I've looked into the heap dump that was produced, with eclipse MemoryAnalayzer, and found that the memory outage was due to overflow in the incoming message queue. I concluded that running with traces damaged the throughput and caused the outage.
First, I wanted to measure the throughput without traces and found it is takes 0.18 ms to process a message. Then I run with full traces and measure that it took 20 ms to process a message, more than 100 times longer.
My next step was to review the code related to the logger initialization and formatter. I found that when the logger was initialized then it was using a ConsoleHandler (as well as FileHandler) by default while our server runs as NT service and have no console. This code was insert during development phase but was not needed in production. After removing this code, I measured 9.32 ms per message i.e., now it takes only 50 times longer.
Next I tried to change the way we format a trace message e.g., In the formater, we had code that calculated the name of the method from the call stack. Similar, we had code that calculated the class name from the logger record. both were designed to remove the burden form the programmer to write the class name and method each time he writes a trace message and let this information be appended automatically. After removing this code, I measure 1.94 ms per message i.e., now it takes 10 times longer. This code change means that the programmers will have to fix all existing message to include class and method names, and all future traces should be added with this information as well. This is a heavy task as hundreds of trace line should be manually fix but the impact of this change is an improvement by 500%. in case this change will enable running with traces in production than I think its worth the work.
Next, I replaced the time calculation that is appended to each trace line and put System.currentTime, but this change had a minor impact of 1.69 ms per message.
I had to decide if the burden of running a special tool to transfer the currentTime millisecond to date and daily time each time a trace is investigated is worth the minor throughput improvement and decided no to go with this change.
Here is a graph that summarize the results: the blue graph depict the improvements of average time of processing message due to code change and the pink graph is the average time of processing a message without traces.
the Vertical axis is the average time it took to process a message and the horizontal axis is the code changes I did:
- original code (name "base" in the graph)
- removal of tracing to console (named "console" in the graph)
- removal of class and method name from the formatter. (named "class and methods" in the graph)
- replacing Date with System.currentTime (named "currentTime" in the graph)
I've conducted a load test against our application with two set ups: one with the original logging we had and the other with logging version were I removed the logging to console and the class and method name from the formatter.
The result of the tests show that in the first set up the application was not available 100% of the time while in the second set up it was not available only 15% of the time. the maximum response time was ~10 minutes and the average response time was 24.6 seconds.