Stories on how to use the APM tool to find bottlenecks in Atlassian Confluence?

Glowroot

Today I will tell you how to find bottlenecks in Confluence On-Prem in the shortest possible time based on one industrial installation. As the booth was used for training, where the knowledge base source was Confluence. And autumn is the time for an influx of student users and it was necessary to conduct an audit and prepare changes, since the system had previously experienced problems of timely delivery of content to users during periods of an influx of readers.

Of course, it was the first step to decide to check the OS parameters and they are right with a margin of resources – ok, and the access log of the nginx reverse proxy, but they were not there due to the access log off directive. From the useful, I tweaked the ssl termination, connected http2. I didn’t focus on it, because it became clear that we had to go in the direction of Tomcat (vendor recommendations).

Then I had one restart, where I added the Glowroot agent and went to look at the charts (I wrote about how to install here). Further, there are three analyzes at different levels of application management and maintenance, which seemed to me useful and interesting to the reader.

Parsing first (Start)

Continued with the analysis of the DBMS and Confluence connection, since Confluence is nginx, worked fine.

First of all, Glowroot showed that for each request in the database, Confluence checks the connection status in the following way.

But this number is 11, and it takes from 0.20 ms to 0.32 ms for each request. That is, on average, 2 to 3 milliseconds are spent on status checks. And if we do not have caching at the application level or when caching is invalidated, then this is a bottleneck.

Further clarifying about the version of the subd and the version of the driver through this link {CONFLUENCE_URL} /admin/systeminfo.action there was room for thought. Especially, with the good old sign on delays, since it always helps me to identify bottlenecks by eye.

In principle, the Atlassian vendor actively uses PostgreSQL and it can be seen, in principle, by supported by DBMS

And since I have been using MySQL so far, updating the driver has traditionally reduced the symptoms. To understand the reasons, the article helped me “After updating the online database, there is a problem with excessive rollback of database transactions.”… Where the solution was actually to switch to another puller (https://github.com/alibaba/druid/wiki/FAQ). And in my case, switching to PostgreSQL. (:

About how the useSessionStatus parameter works, you can download the connector and check the use of the isReadOnly method.

And it is easy to check the version dependency by looking at the first condition in the method.

To summarize, the short-term solution is to take risks and update the connector driver to the DBMS, and the long-term solution is the transition to PostgreSQL 11.

Parsing the second (Attention)

Often in analysis, looking for patterns such as time of complaints and correlation with application response time graphs always helps, especially in percentiles, that there is an understanding where you can see what is happening with a small volume of requests.

By switching to slow traces mode in the Glowroot dashboard, you can take a closer look at the behavior of the system.

and also the next observed the following behavior.

All this prompted me to look at the schedule of batch operations, for example, when the backup process was started.

First of all, I checked in the web interface ({CONFLUENCE_URL} /admin/scheduledjobs/viewscheduledjobs.action) and disabled backups in xml at the application level. You can also see the history of launches, and, in principle, understand that it ends by 6:30 in the morning.

But as soon as I logged into the system as root, I checked the crontab, where I also saw that the backup script was running from 5 am and finished at about 6:15 am my PC time. Since this is a classic bottleneck, it is just a shutdown due to its uselessness. Since at the virtualization level, backups are made once an hour, and the DBMS is on another server, so the most important thing is the Confluence home directory.

In summary, simple reasons should always be checked. The backup schedule is the first thing to check if the behavior should somehow correlate with the problem.

Analysis of the third (March)

Since our analysis was not completed precisely with the work between the DBMS and the application, then going to the Queries tab and sorting it by the average execution time, I saw an interesting pattern, which you can see in the screenshot below.

Yes, exactly to the table AO_7B47A5_EVENT occasionally INSERT is slow when a large number of users open pages in Confluence. And since the prefix AO_* is an indicator of tables created by ORM Active Objects… As a rule, these are plugins installed on the platform. Here Google quickly pointed to the next ticket https://jira.atlassian.com/browse/CONFSERVER-69474, which pointed out that the culprit was the Confluence Analytics app. Searching for the application (Managed Apps) in the admin panel, we find it, after setting All Apps in the search scope. Voila, we find the application, and it is not licensed yet, according to the screenshot.

Since this is not very acceptable, I sent a request to the vendor about the need to submit metrics and provide rules for the retention policy of analytics logs in a database full of asynchronous analytics work. Since after clearing the records in the table, the content delivery has accelerated before our eyes.

To summarize, it is important to verify slow queries and watch the logs of slow queries (at that time there weren’t any for analysis). Since it helps a lot to provide information to vendors in order to improve, for example, the analytics extension in the latest version of Confluence works much better and does not affect the system so much, although there is a problem with working with attachments.

As a conclusion, the main business metric is the user experience, which is confirmed by the answer from the business customer “Yes, there were no complaints”. But we have something to continue to do, since a rather large backlog of changes to improve the system has gathered.

Thank you for reading to the end, I will be glad if these situations were interesting, then I will continue to share similar stories of using tracing and search tools.

Always glad to your questions, as well as to see you Telegram chat community Atlassian

Have a nice day, Gonchik.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *