A Puzzling Performance Problem
At one of the biggest VistA EMR sites, Hakeem (VistA adapted to the requirements of healthcare in Jordan) was running well, with one exception: when running a MUPIP REORG operation for the database defragmention, the system had an IO bottleneck. After tracking the issue and linking it to the database growth and usage trends, the root cause was found to be related to low Hard Disk Drive (HDD) performance which could not perform a MUPIP REORG operation without impacting user experience.
To resolve this issue, we decided to upgrade the server’s HDD to Solid State Disks (SSDs). After installing the new disks the performance became unpredictable and the application performance degraded the user experience .
Following were the symptoms we observed, along with degraded application performance :
- The new SSDs were operating as expected, with no IO delays or bottlenecks.
- There were no processes in the “D” state, waiting for IO.
- Apart from degraded performance, the application was behaving normally and delivering correct results.
- The only anomaly was that the CPU run queue increased in a semi-linear trend, sometimes reaching 120, but the server had 86 CPUs running at 100% capacity.
It took us about fifteen hours of troubleshooting, to find a solution: Set a tuned profile to throughput-performance profile:
tuned-adm profile throughput-performance systemctl restart tuned
During the troubleshooting phase; it was clear that the CPU queue was going high while it shouldn’t have! This was causing the poor system performance. The only fact that brought to mind the idea of enhancing CPU performance by setting the tuned profile to throughput-performance , thus forcing the CPUs to work at maximum performance, was https://github.com/redhat-performance/tuned/blob/master/profiles/throughput-performance/tuned.conf.
After enabling this profile, the system performance improved, and the CPU queue became empty in a few seconds!
The Root Cause
After successfully resolving the issue, based on IT operation management strategy in EHS, we initiated the problem management process to find the root cause and to prevent it from happening in the future.
The server model used for this site is HP ProLiant DL380 Gen9, which has a BIOS setting for CPC (Collaborative Power Control) set to OS Control mode (located under Power Management -> Power Profile -> Advanced Power Options -> Collaborative Power Control). This feature enables the OS kernel to use Processor Clocking Control (PCC). When CPC is set to OS Control mode; the OS will control the use of intel_pstate, and since the initial OS tuned profile was set to ondemand; the OS did not utilize the full CPU performance and under production workload it lead to low performance.
The OS installed on the new disks is RedHat Enterprise Linux 7.9 (RHEL); which installs the CPU driver pcc-cpufreq driver for this server model. Using this driver will set the tuned profile to ondemand by default.
Therefore; we recommend overcoming this issue by using one of the following solutions:
- Set the tuned profile to throughput-performance: this profile will set the maximum performance for the server’s CPUs, because it includes the following CPU settings
[cpu] governor=performance energy_perf_bias=performance min_perf_pct=100
- Disable CPC in the server’s BIOS by changing it to Disabled
When doing system performance enhancement by upgrading hardware, it is mandatory to review software performance related settings!
About Yazeed Smadi
Yazeed Smadi leads the Linux System Administration team at Electronic Health Solutions, Amman Jordan. A graduate of German Jordanian University and Technische Hochschule Deggendorf, he worked on blockchain software at T-Mobile in Germany before returning to Jordan. Yazeed is passionate about DevOps automation and has scripted automated YottaDB switchovers to take place in under one minute using Jenkins. Since the world’s best falafel is found in Jordan, Yazeed and Bhaskar together try to find the best falafel in Amman when Bhaskar visits EHS!
Published on March 23, 2021