Troubleshooting Applications

Access Troubleshooting

When starting to troubleshoot an application problem, you should begin in the Troubleshoot section of the UI. You can access from the left-hand navigation pane of the Controller UI in an application context.

The area includes pages for analyzing slow response times, errors and exceptions, and health rule violations. It also provides access to war rooms, an area of the UI dedicated to troubleshooting a specific problem.

Restricting Memory Amounts


Agent	Functionality
Analytics Agent	This functionality is configurable and documented, see Enable the Standalone Analytics Agent.
C/C++ SDK Agent	This functionality is not configurable.
Database Agent	This functionality is configurable and documented, see Install the Database Agent.
Go SDK Agent	This functionality is not configurable.
IIB Agent	This functionality is not configurable.
Java Agent	This functionality is not configurable.
Machine Agent	This functionality is configurable and documented, see Machine Agent Requirements and Supported Environments.
.NET Agent	This functionality is not configurable.
Network Agent	This functionality is not configurable.
Node.js Agent	This functionality is not configurable.
PHP Agent	This functionality is not configurable.
Python Agent	This functionality is not configurable.

Additional Help

If slow response time persists even after you've completed the steps outlined above, you may need to perform deeper diagnostics. If you cannot find the information you need in the Splunk AppDynamics documentation, consider posting a note about your problem in the Community Discussion Boards.

Slow Response Times

You may receive a notification based on a health rule violation, see performance indicators in flow maps or transaction scorecards that indicate slow response times. When you do, the following guidelines provide a strategy for troubleshooting and diagnosis.

The Slow Response Times Dashboard in the Troubleshoot menu lists the Business Transactions most responsible for slow Average Response Time (ART). From that dashboard, you can select a time range centered around a spike in ART that also captures more normal-looking time spans. This information is also available through the Top Business Transactions tab on the By Contribution to App Average Response Time tile.

Step 1: Check for slow or stalled business transactions.

Splunk AppDynamics uses transaction thresholds to detect slow, very slow, and stalled transactions. Follow these steps to determine if you complete Step 2 or move on to Step 3.

Do you have slow or stalled transactions?

Make sure that the time frame selected in the Controller UI encompasses the time when the performance issue occurred. If it's a continuing condition, you can keep the time frame relatively brief. Use the Time Range dropdown.
Click Troubleshoot > Slow Response Times.
Click the Slow Transactions tab if it is not already selected.
Do you see one or more slow transaction snapshots on this page?
1. If Yes– Go to Step 2 to drill down to find the root cause.
2. If No– Go to Step 3 and check for slow backends.

Step 2: Drill into slow or stalled transactions to determine the root cause.

From the left menu, navigate to Troubleshoot > Slow Response Times.
In the lower pane of the Slow Transactions tab, click the Exe Time (ms) column to sort the transactions from slowest to fastest.
Select a snapshot from the list and click Details.
Review the Potential Issues list to see the longest-running method and SQL calls in the transaction.
Click a potential issue and select Drill Down into Call Graph, or click Drill Down in the transaction flow map pane to see the complete set of call graph segments retained for this transaction.
View the Time (ms) column to see how long this method execution takes relative to the transaction execution time.
Click the HTTP link in the last column on the right to display the information details pane.
Note theClass, Method,andLine Number(if available) represented by the execution segment. This information provides a starting point for troubleshooting this code issue.

If there are multiple slow or stalled transactions, repeat this step until you have resolved them all and then continue to Step 3.

Step 3: Check for slow database or remote service calls on the backend.

Splunk AppDynamics collects metrics about the performance of business transaction calls to the databases and remote servers from the instrumented app servers. You can drill down to the root cause of slow database and remote service calls.

Click Troubleshoot > Slow Response Times, then click the Slowest DB & Remote Service Calls tab.
Select a call from the list and click the View Snapshots link to show a list of Correlated Snapshots.
Click the Exe Time (ms) column to sort the transactions from slowest to fastest.
Select a transaction and click Drill Down.
Select a potential issue from the Potential Issues list
Click Drill Down into Call Graph to go directly to that point in the call graph, or click Drill Down in the flow map pane to see the complete set of call graph segments retained for this transaction.
Review the Time (ms) column and select the transaction with the longest execution time for this method relative to the transaction execution time.
Select the DB & Remote Service Calls tab.
Do you see one or more slow calls on the SQL Calls or Remote Service Calls tabs?

- If Yes – Go to Step 4 to drill down and find the root cause.
- If No – Go to Step 5 to determine if the problem is affecting all nodes in the slow tier.

Step 4: Drill into SQL or Remote Service Calls to determine the root cause.

Slow database call – click the database call to gain information about the call.
1. If you have Correlated snapshots between Java applications and Oracle databases – drill down into the Oracle database on the Transaction Snapshot to view database details captured during the snapshot.
2. If you have Splunk AppDynamics for Databases – right-click the database on the Application, Tier, Node or Backend Flow Map, and choose Link to Splunk AppDynamics for Databases. You can use Splunk AppDynamics for Databases to diagnose database issues.
3. If you have Database Monitoringri ght-click the database on the Application, Tier, Node or Backend Flow Map, and chooseViewto review database problems.On the SQL Calls tab of the transaction snapshot, sort the SQL Calls by Avg. Time (ms).
On the Remote Service Calls tab, sort the queries by Avg. Time (ms).
Select the slow call.
Click Drill Down intoDownstream Callto gain further insight into the methods of the service call.
Sort the methods by theTime (ms)column.
Select a slow method.
ClickDetails.

Step 5: Does the problem affect all nodes in the slow tier.

In the Application or Tier Flow Map, click the tier or node icon to see a quick overview of the health of each node in the tier.
Is the problem affecting all nodes in the slow tier? If all the nodes are yellow or red, the answer to this question is Yes. Otherwise, the answer is No.
- If Yes– Go to Step 6.
- If No– The problem is either in the node's hardware or in the way the software is configured on the node. If only one node in a tier is affected, the problem is probably not related to the application code. Follow these steps to determine a hardware related problem.
1. In the left navigation pane, click Tiers & Nodes.
2. Expand the Tier in the right pane and double-click the affected node to open its Node Dashboard.
3. Click the Memory tab
  1. Explore each of the available tabs to determine if you need to add memory to the node, configure additional memory for the application, or take some other corrective action.
4. Click the Server tab.
  1. If the Hardware tab indicates a hardware related problem, contact your IT department.

You have isolated the problem.

Step 6: Check to see whether the problem affects most business transactions

On the Application Dashboard, look at theBusiness Transaction Healthpane on the right side of the screen.
Is the bar representing Business Transaction Health is primarily yellow or red? Yes or No?
- If No – Then:
- i. Click Business Transactions in the left navigation pane.
  ii. Sort by Health, Transaction Score or other column headings to find the business transaction that is experiencing issues.
  
  iii. Double-click the problematic business transaction to see its dashboard, then use the tabs to diagnose the problem.

You have isolated the problem.

Additional Help

If you've tried to diagnose the problem using the previous steps and haven't found the problem, see additional information for your specific agent:

Errors and Exceptions

Splunk AppDynamics Application Intelligence Platform captures and presents information on business transaction errors in the monitored environment.

At a high-level, a business transaction is considered to have an error if its normal processing has been affected by a code-level exception or error event, including custom error events based on methods you specify.

View Error and Exception Information

The Controller UI presents information on errors and exceptions in various places in the UI, including in transaction snapshots, metrics, and dashboards.

The informational popups for tiers in flow maps have an error tab that displays error rate metrics for the tier:

Error Rate Metrics

On the application and tier flow maps, the error rate is for all business transactions. On the business transaction flow map, errors apply only to the current business transaction.

The Metric Browser includes Error metrics:

Metric Browser

The Troubleshoot > Errors page shows all error transactions. The page contains two tabs, one for transaction errors and one for Exceptions.

The tabs show information on the rate of errors or exceptions, and lets you drill down to the error or exception for more information, as shown:

Exceptions

Business Transaction Error

All transaction errors that have been detected according to the configured error detection rules in the selected time frame of the Controller UI appear in the Error Transactions tabs of the Errors page.

By default, Splunk AppDynamics considers a business transaction to be in error if it detects one of the following types of events in the context of the transaction:

An unhandled exception or error. An exception that is thrown and never caught or caught after the business transaction terminates results in a transaction error, and the exception is presented in Splunk AppDynamics. An exception that is thrown and caught within the context of the business transaction is not considered a transaction error and the exception is not captured in Splunk AppDynamics.
An exception caught in an exit call, such as a web service or database call.
An HTTP error response, such as a status code 404 or 500 response.
A custom-configured error method and error message.

Error detection configuration is described in Error Detection.

Errors that occur on a downstream tier that are not propagated to the originating tier do not result in a business transaction error. If the originating client receives a 200 success response, for example, the business transaction is not considered an error. The error contained within the downstream tier does count against the Error Per Minute metric for the continuing segment.

When a business transaction experiences an error, it is counted only as an error transaction. It is not also counted as a slow, very slow or stalled transaction, even if the transaction was also slow or stalled.

Code Exceptions in Splunk AppDynamics

Code exceptions are a common cause of business transaction errors. The Exceptions tab in theErrorspage shows an aggregated view of the exceptions across all transactions. For purposes of this view, Splunk AppDynamics considers the following types of incidents to be exceptions:

Any exception logged with a severity of Error or Fatal (using Log4j, java.util.logging, Log4Net/NLog, or another supported logger). This applies even if the exception occurs outside the context of a business transaction, in which case the exception type is specified asApplication Server.
HTTP errors that do not occur in the context of a Business Transaction.
Error page redirects.

Exceptions that are thrown and handled within a business transaction are not captured by Splunk AppDynamics and do not appear in the Exceptions tab.

When troubleshooting errors, notice that the number of business transaction errors does not necessarily correspond to the number of exceptions in a given time frame. A single transaction that counts as an error transaction can correspond to multiple exceptions. For example, as the transaction traverses tiers, it can generate an exception on each one. Troubleshooting an error typically involves finding the exception closest to the originating point of the error.

If a stack trace for the exception is available, you can access it from the Exception tab in the Controller UI. A stack trace is available for a given exception if the exception was passed in the log call. For example, a logging call in the form of logger.log(Level.ERROR, String msg, Throwable e) would include a stack trace, whereas a call in the form of logger.log(Level.ERROR, String msg) would not.

Agent Errors

The Java Agent differentiates between agent internal errors and application errors. By default, agent internal errors no longer set off health rule violations. You can view agent internal errors in the following Metric Browser path: Application Infrastructure Performance > <tier> > Agent > Internal Errors.

Error and Exception Limits

Splunk AppDynamics limits the number of registered error types (based on error-logging events, exception chains, and so on) to 4000. It maintains statistics only for registered error types.

Reaching the limit generates the CONTROLLER_ERROR_ADD_REG_LIMIT_REACHED event. While it is possible to increase the limit, we recommend refining the default error detection rules to reduce the number of error registrations to have the error you are not interested in capturing ignored.

For more information, see information on configuring log errors and exceptions to ignore in Error Detection.

Configure Errors and Exceptions

Splunk AppDynamics automatically recognizes errors and exceptions for many common frameworks. You can customize the default error detection behavior as needed, for example, if you use your own custom error framework. See Error Detection.

Slow Response Times for .NET

You may notice that your application's response time is slow from these methods:

You receive an alert: If you have received an email alert from Splunk AppDynamics that was configured through the use of health rules and policies, the email message provides a number of details about the problem that triggered the alert. See information about Email Notifications in Notification Actions. If the problem is related to slow response time, see Initial Troubleshooting Steps.
You view the Application Dashboard for a business application and see slow response times.
A user reported slow response time that relates to a particular business transaction, for example, an internal tester reports "Searching for a hotel is slow".

Initial Troubleshooting Steps

In some cases, the source of your problem might be easily diagnosed by choosing Troubleshoot > Slow Response Times in the left navigation pane. See Slow Response Times.

.NET Resource Troubleshooting

If you've tried to diagnose the problem using those techniques and haven't found the problem, use the following troubleshooting approaches to find other ways to determine the root cause of the issue.

Step 1 - CPU saturated?

Is the CPU of the CLR saturated?

Display the Tier Flow Map.
Click the Nodes tab, and then click the Hardware tab.
Sort by CPU % (current).

If the CPU % is 90 or higher, the answer to the question in Step 4 is Yes. Otherwise, the answer is No.

Yes – Go toStep 2

No – Review various metrics in the Metric Browser to pinpoint the problem.

In the left navigation pane, click Servers > Tiers & Nodes > slow tier. Review these metrics in particular:

ASP.NET -> Application Restarts
ASP.NET -> Request Wait Time
ASP.NET -> Requests Queued

CLR -> Locks and Threads -> Current Logical Threads
CLR -> Locks and Threads -> Current Physical Threads

IIS -> Number of working processes
IIS -> Application pools -> <Business application name> -> CPU%
IIS -> Application pools -> <Business application name> -> Number of working processes
IIS -> Application pools -> <Business application name> -> Working Set

You have isolated the problem and don't need to continue with the rest of the steps below.

Step 2 - Significant garbage collection activity?

Display the Tier Flow Map.
Click the Nodes tab, and then click the Memory tab.
Sort by Time Spent on Collections (%) to see what percentage of processing time is being taken up with garbage collection activity.

If Time Spent on Collections (%) is higher than acceptable (say, over 40%), the answer to the question in Step 5 is Yes. Otherwise, the answer is No.

Is there significant garbage collection activity?

Yes – Go toStep 3.

No – Use your standard tools to produce memory dumps; review these to locate the source of the problem.

You have isolated the problem and don't need to continue with the rest of the steps below.

Step 3 - Memory leak?

Is there a memory leak?

From the list of nodes displayed in the previous step (when you were checking for garbage collecting activity), double-click a node that is experiencing significant GC activity.
Select Memory from the drop-down menu, then review the committed bytes counter and the size of the Gen0, Gen1, Gen2 and large heaps.

If memory is not being released (one or more of the above indicators is trending upward), the answer to the question in Step 6 is Yes. Otherwise, the answer is No.

Yes – Use your standard tools for troubleshooting memory problems. You can also review ASP.NET metrics; click Tiers & Nodes > slow tier > ASP.NET.

No – Use your standard tools to produce memory dumps; review these to locate the source of the problem.

Whether you answered Yes or No, you have isolated the problem.

Java Resource Issues

These troubleshooting guidelines may help you determine the root cause of many Java-related issues.

Step 1. CPU saturated?

Is the CPU of the JVM saturated?

How do I know?

Display the Tier Flow Map.
Click the Nodes tab, and then showHardwaredata.
Sort by CPU % (current).
If the CPU % is 90 or higher, the answer to this question is Yes. Otherwise, the answer is No.
- Yes – Go to Step 2.
- No – The issue is probably related to a custom implementation your organization has developed. Take snapshots of the affected tier or node(s) and work with internal developers to resolve the issue.

Step 2. Significant garbage collection activity?

Is there significant garbage collection activity?

How do I know?

Display the Tier Flow Map.
Click the Nodes tab, and then click the Memory tab.
Sort by GC Time Spent to see how many milliseconds per minute is being spent on GC; 60,000 indicates 100%.
If GC Time Spent is higher than 500 ms, the answer to the question in Step 5 is Yes. Otherwise, the answer is No.
- Yes – Go to Step 3.
- No – Go to Step 4.

Step 3. Memory leak?

Is there a memory leak?

How do I know?

From the list of nodes displayed in the previous step (when you were checking for Garbage Collecting activity), double-click a node that is experiencing significant GC activity.
Click the Memory tab, then scroll down to display the Memory Pool graphs at the bottom of the panel.
Double-click theOld Gen memory poolschart.
If memory is not being released (use is trending upward), the answer to this question is Yes. Otherwise, the answer is No.
- Yes – Use various Splunk AppDynamics features to track down the leak. One useful tool for diagnosing a memory leak is object instance tracking, which lets you track objects you are creating and determine why they aren't being released as needed. Using object instance tracking, you can pinpoint exactly where in the code the leak is occurring. For instructions on configuring object instance tracking, as well as links to other tools for finding and fixing memory leaks, see Need more help?.
- No – Increase the size of the JVM. If there is significant GC activity but there isn't a memory leak, then you probably aren't configuring a large enough heap size for the activities the code is performing. Increasing the available memory should resolve your problem.
Whether you answered Yes or No, you have isolated the problem.

Step 4. Resource leak?

Is there a resource leak?

How do I know?

In the left Navigation pane, go to (for example) Metric Browser > Application Infrastructure Performance > TierName > Individual Nodes > NodeName > JMX > JDBC Connection Pools > PoolName.
Add the Active Connections and Maximum Connection metrics to the graph.
Repeat as needed for various pools your application is using.
If connections are not being released (use is trending upward), the answer to the question in Step 7 is Yes. Otherwise, the answer is No.
- Yes – To determine where in your code resources are being created but not being released as needed, take a few thread dumps using standard commands on the problematic node. You can also create a diagnostic action within Splunk AppDynamics to create a thread dump; see Thread Dump Actions in Diagnostic Actions.
- No – Restart the JVM. If none of the above diagnostic steps addressed your issue, it's possible you're simply seeing a one-time unusual circumstance, which restarting the JVM can resolve.

Java Memory Leaks

This page describes how to detect and troubleshoot Java memory leaks.

Attention: To activate Automatic Leak Detection, you need the Configure Agent Properties permission.To start an On Demand Capture Session, you need the Advanced Agent Operation permission. See Manage Custom Roles for Splunk AppDynamics.

The garbage collection feature of the JVM greatly reduces the opportunities to introduce memory leaks into a codebase. However, because garbage collection does not eliminate memory leaks completely, Splunk AppDynamics includes Automatic Leak Detection for supported JVMs.

Automatic Leak Detection

You can access Automatic Leak Detection on the Memory tab of the Node Dashboard. Automatic Leak Detection is disabled by default because it increases overhead on the JVM. You should enable leak detection mode only when you suspect a memory leak problem. Turn off Automatic Leak Detection after you identify the cause for the leak.

Automatic Leak Detection uses On Demand Capture Sessions to capture actively used collections, any class that implements JDK Map or Collection interface during the capture period. The default capture period is 10 minutes.

Splunk AppDynamics tracks every Java collection that meets the following criteria:

The collection has been alive for at least 30 minutes.
The collection has at least 1000 elements.
The collection Deep Size is at least 5 MB. The agent calculates Deep Size by traversing recursive object graphs of all the objects in the collection.

The following node properties define the defaults for leak detection criteria:

minimum-age-for-evaluation-in-minutes
minimum-number-of-elements-in-collection-to-deep-size
minimum-size-for-evaluation-in-mb

See App Agent Node Properties.

The Java Agent tracks the collection and identifies potential leaks using a linear regression model. You can identify the root cause of the leak by tracking frequent access to the collection over a period of time.

After it qualifies a collection, Splunk AppDynamics monitors the collection size for a long-term growth trend. Positive growth indicates the collection is the potential source of a memory leak.

After Splunk AppDynamics identifies a leaking collection, the Java Agent automatically triggers diagnostics every 30 minutes. The diagnostics capture a shallow content dump and activity traces of the code path and business transactions that access the collection. You can drill down into any leaking collection monitored by the agent, to manually trigger Content Summary Capture and Access Tracking sessions.

You can also monitor memory leaks for custom memory structures. Typically custom memory structures are used as caching solutions. In a distributed environment, caching can easily become a prime source of memory leaks. It is therefore important to manage and track memory statistics for these memory structures. To do this, you must first configure custom memory structures. See Custom Memory Structures for Java.

Workflow to Troubleshoot Memory Leaks

You can use this workflow to troubleshoot memory leaks on JVMs that have been identified with a potential memory leak problem:

Monitor memory for potential JVM memory leaks.
Enable automatic leak detection.
Start an on demand capture session.
Detect and troubleshoot leaking conditions.

Monitor Memory for Potential JVM Leaks

Use the Node dashboard to identify the memory leak. A possible memory leak is indicated by a growing trend in the heap as well as the old/tenured generation memory pool.

An object is automatically marked as a potentially leaking object when it shows a positive and steep growth slope.

The Automatic Memory Leak dashboard shows:

Collection Size—The number of elements in a collection.
Potentially Leaking—Potentially leaking collections display as red. You should start diagnostic sessions on potentially leaking objects.
Status—Indicates if a diagnostic session has been started on an object.
Collection Size Trend—A positive and steep growth slope indicates a potential memory leak.

Tip: To identify long-lived collections, compare the JVM start time and Object Creation Time.

If no captured collections display, ensure that you have the correct configuration for detecting potential memory leaks.

Enable Memory Leak Detection

Memory leak detection is available through the Automatic Leak Detection feature. Once the Automatic Leak Detection feature is turned on and a capture session has been started, Splunk AppDynamics tracks all frequently used collections. Therefore, using this mode results in higher overhead.

Turn on Automatic Leak Detection mode only when a memory leak problem is identified
Click Start On Demand Capture Session to start monitoring frequently used collections and detect leaking collections.
After you identify and resolve the leak, turn the capture session and the leak detection modes off.
Start diagnosis on one individual collection at a time to achieve optimum performance.

Troubleshoot Memory Leaks

After detecting a potential memory leak, troubleshooting the leak involves performing these three actions:

Select the Collection Object to Monitor

On the Automatic Leak Detection dashboard, right-click the class name and click Drill Down.

For performance reasons, start the troubleshooting session on a single collection object at a time.

Use Content Inspection

Content Inspection identifies which part of the application the collection belongs to so that you can start troubleshooting. It allows monitoring histograms of all the elements in a particular collection.

Enable Automatic Leak Detection by starting an On Demand Capture Session, select the object you want to troubleshoot, and then follow the steps listed below:

Click the Content Inspection tab.
Click Start Content Summary Capture Session to start the content inspection session.
Enter the session duration. Allow at least 1 – 2 minutes for data generation.
Click Refresh to retrieve the session data.
Click on the snapshot to view details about an individual session.

Use Access Tracking to view the actual code paths and business transactions accessing the collections object.

As described above in Workflow to Troubleshoot Memory Leaks, enable Automatic Leak Detection, start an On Demand Capture Session, select the object you want to troubleshoot, and then follow the steps listed below:

Select the Access Tracking tab
Click Start Access Tracking Session to start the tracking session.
Enter the session duration. Allow at least 1-2 minutes for data generation.
Click Refresh to retrieve session data.
Click the snapshot to view details about an individual session.

The troubleshooting information pane shows the Java stack trace associated with the session. By default, the stack trace is shown to a depth of 10 lines. If you would like to temporarily increase the number of lines captures, you can use the maximum-activity-trace-stack-depthIncreasing the stack trace depth can consume a significant amount of system resources. You must remove the property or set it back to the default value of 10 after you have captured the desired information. See App Agent Node Properties Reference.

Java Memory Thrash

Memory thrash is caused when a large number of temporary objects are created in very short intervals. Although these objects are temporary and are eventually cleaned up, the garbage collection mechanism may struggle to keep up with the rate of object creation. This may cause application performance problems. Monitoring the time spent in garbage collection can provide insight into performance issues, including memory thrash.

For example, an increase in the number of spikes for major collections either slows down a JVM or indicates potential memory thrash. Use object instance tracking to isolate the root cause of the memory thrash. To configure and enable object instance tracking, see Object Instance Tracking for Java.

Splunk AppDynamics automatically tracks object instances for the top 20 core Java (system) classes and the top 20 application classes.

The Object Instance Tracking subtab provides the number of instances for a particular class and graphs the count trend of those object in the JVM. It provides the shallow memory size (the memory footprint of the object and the primitives it contains) used by all the instances.

Analyze Memory Thrash

Once a memory thrash problem is identified in a particular collection, start the diagnostic session by drilling down into the suspected problematic class.

Select the class name to monitor and click Drill Down at the top of the Object Instance Tracking dashboard or right-click the class name and select the Drill Down option.

Note: For optimal performance, trigger a drill-down action on a single instance or class name at a time.

After the drill down action is triggered, data collection for object instances is performed every minute. This data collection is considered to be a diagnostic session and the Object Instance Tracking dashboard for that class is updated with this icon , to indicate that a diagnostic session is in progress.

The Object Instance Tracking dashboard indicates possible cases of memory thrash. Prime indicators of memory thrash problems indicated on the Object Instance Tracking dashboard are:

Current Instance Count: A high number indicates the possible allocation of a large number of temporary objects.
Shallow Size: Is the approximate memory used by all instances in a class. A large number for shallow size signals potential memory thrash.
Instance Count Trend: A saw wave is an instant indication of memory thrash.

If you suspect you have a memory thrash problem at this point, then you should verify that this is the case. See To verify memory thrash.

Verify Memory Thrash

Select the class name to monitor and click Drill Down at the top of the Object Instance Tracking dashboard. On the Object Instance Tracking window, click Show Major Garbage Collections.

If the instance count does not vary with the garbage collection cycle, it is an indication of a potential leak and not a memory thrash problem. See Java Memory Leaks.

Troubleshoot Java Memory Thrash Using Allocation Tracking

Allocation Tracking tracks all the code paths and those business transactions that are allocating instances of a particular class. It detects those code path/business transactions that are creating and throwing away instances.

To use allocation tracking:

Using the Drill Down option, trigger a diagnostic session.
Click the Allocation Tracking tab.
Click Start Allocation Tracking Session to start tracking code paths and business transactions.
Enter the session duration and allow at least 1 to 2 minutes for data generation.
Click Refresh to retrieve the session data.
Click a session to view its details.
Use the Information presented in the Code Paths and Business Transaction panels to identify the origin of the memory thrash problem.

Monitor Java Object Instances

If the application uses a JRE (rather than a JDK), use these steps to enable object instance tracking:

Ensure the tools.jar file is in the jre/lib/ext directory.
On the Node Dashboard, click the Memory tab.
On the Memory tab, click the Object Instance Tracking subtab.
Click On and then OK.

See Object Instance Tracking for Java.

Code Deadlocks for Java

By default, the Java Agent detects code deadlocks. You can find deadlocks and see their details using the Events list or the REST API.

Code Deadlocks and Their Causes

In multi-threaded development environments, it is common to use more than a single lock. However, sometimes deadlocks will occur. Here are some possible causes:

The order of the locks is not optimal
The context in which they are being called (for example, from within a callback) is not correct
Two threads may wait for each other to signal an event

Finding Deadlocks Using the Events List

Select Code Problems (or just Code Deadlock) in the Filter By Event Type list to see code deadlocks in the Events list.

To examine a code deadlock, double-click the deadlock event in the events list and then click the Code Deadlock Summary tab. Details about the deadlock are in the Details tab. See Monitor Events.

Find Deadlocks Using the REST API

You can detect a DEADLOCK event-type using the Splunk AppDynamics REST API. For details, see the example Retrieve event data.

Thread Contention

Thread contention arises when two or more threads attempt to access the same resource at the same time. This page describes how Splunk AppDynamics helps you diagnose and resolve thread contention issues.

Performance Issues Resulting from Thread Contention

Multithreaded programming techniques are common in applications that require asynchronous processing. Although each thread has its own call stack in such applications, threads may need to access shared resources, such as a lock, cache, or counter. See Enabling Thread Correlation.

While synchronization techniques can help to prevent interference between threads in such scenarios, they may nevertheless compete for access to shared resources. This can result in application performance degradation or even data integrity issues.

Splunk AppDynamics can help you identify and resolve problems relating to thread contention in business transactions and service endpoints. See Trace Multithreaded Transactions for Java.

Thread Contention Detection

Splunk AppDynamics detects thread contention based on the thread state of the instrumented application.

It identifies these block or waiting states in the JVM:

Acquiring a lock (MONITOR_WAIT)
Waiting for a condition (CONDOR_WAIT)
Sleeping (OBJECT_WAIT)
A blocking I/O operation

The OBJECT_WAIT

Thread.sleep
Object.wait
Thread.join
LockSupport.parkNano s
LockSupport.parkUntil
LockSupport.park

The Controller alerts you to possible thread contention problems in the Potential Issues pane of the Business Transaction Flow Map. From there, you can use the browser to access additional information about blocked and waiting threads in business transactions or service endpoints, and determine the cause of the performance problem.

The following sections explain how you use the browser to surface contention information for business transaction and service endpoints.

Thread Contention in Transaction Snapshots

To view information about thread contention:

In the transaction snapshot navigation page, look for items labeled asThread Contentionissues in the Potential Issues pane. The time column indicates blocked or wait time.
To display more information about the blocked method, click the thread contention item and select Drill Down into Call Graph. The call graph shows the following information relevant to thread contention:
- In the Call Graph header, Wait Time and Block Time indicate aggregate measures for the thread in one segment of the business transaction.
- In the Call Graph header, Node specifies the name of the node hosting the contending threads, PojoNode in the example above.
- The Time column indicates the total self time for the method.
- The Percent% column shows the amount of time spent in the method as a percentage of overall time for the thread.
- The Thread State Column indicates the degree of thread contention issues for the method. Gray means no problems; yellow to red shading signals the severity of contention problems. (When you hover over the bar, a breakdown of the elements that make up the thread state is shown: This includes Block time and Wait time by default. To include Cpu Time in the Thread State detail, Dev mode must be enabled.)

Right-click on any method with a thread state that indicates block or wait times and select View Details. The Thread Contention details pane appears.

The Thread Contention details pane displays the name of the blocked method in the top left corner and adds the following information in the Thread Contention table:


Element	Meaning
Blocking Thread	The thread holding a lock on the blocking object.
Blocking Object	The object that the blocked thread is waiting to access.
Block Time	The amount of time waiting to access the object.
Line Number	The line number in the blocked method where the blocking object is being accessed. With respect to the example above, `run` is attempting to access a locked object at line 114.

The order in which blocking threads are shown in the table is not significant; it does not imply a call order or time sequence.

Note: In development mode, Splunk AppDynamics reports explicit locks such as java.util.concurrent.locks.ReentrantLock as Wait Time instead of Block Time in the Thread Details Call Graph view. Take this into consideration when monitoring business transactions and analyzing performance related to lock contentions.

Thread Contention in Service Endpoints

You can view thread contention information for service endpoint methods in Splunk AppDynamics. Call graphs identify service endpoint methods with this icon:.

Select More > Service Endpoints from the menu bar to view thread contention information by service endpoint.

Export Contention Information

When you export the Call Graph for a Business Transaction, Splunk AppDynamics includes Transaction Contention information.

The Summary pane includes Block Time data: the block time specified is the sum of all block times for the blocked methods shown in the CallGraph pane.
The Call Graph pane lists block time by method:

Event Loop Blocking in Node.js

You can use process snapshots to examine Node.js event loop activity and identify functions with high CPU times that are blocking the event loop.

Latency in Node.js Event Loops

The event loop of a Node.js process is a single thread that polls for incoming connections and executes all application code. When a Node.js request makes a call to an external database, remote service or the filesystem, the event loop automatically directs the application's control flow to some other task, including other connections or callbacks.

CPU-intensive operations block the event loop, preventing it from handling incoming requests or finishing existing requests. A CPU-intensive operation in one business transaction may cause slowness in other business transactions.

Process Snapshots In Splunk AppDynamics

A process snapshot describes an instance of a CPU process on an instrumented Node.js node. It generates a process-wide flame graph for a Node.js process over a configurable time range.

Process snapshots provide visibility into the Node.js event loop across all business transactions for the duration of the process snapshot. Process snapshots are useful when the main troubleshooting tools (such as, business transaction snapshots) are inconclusive because the source of latency is a CPU-intensive operation in another business transaction. You can use lists of process snapshots to identify which functions have high CPU times. From the list, you can select and examine process snapshots to identify exactly which functions in your code are blocking the CPU.

For a given Node.js node or tier, you can access the list of process snapshots from the Process Snapshots tab of the node or tier dashboard. You can filter the process snapshot list to display only the snapshots that you are interested in. You can filter by execution time, whether the snapshot is archived, and the GUID of the request. If you access the list from the tier dashboard, you can also filter by node.

For more information on how process snapshots are generated and how to configure them, see Manage Node.js Process Snapshots.

To learn how process snapshots and business transaction snapshots are created, see Process Snapshots and Business Transaction Snapshots.

Process snapshots persist for 14 days, unless you archive them, in which case they are available forever.

A process snapshot contains these tabs:

Overview
Flame Graph
Call Graph
Allocation Call Graphs
Hot Spots

Overview

Summarizes the snapshot. Contents vary based on the available information.

Usually contains at least the total execution time, tier and node of the process, timestamp, slowest method and request GUID.

Flame Graph

Provides a visualization of each stack frame’s frequency on the CPU over the duration of a process snapshot. The frame’s position relative to the bottom-most stack depicts the call-stack depth.

The flame graph contains the same information as the call graph, but allows you to quickly spot methods that are consuming more CPU resources relative to others.

The method corresponding to the stack frame on the top edge of the flame graph represents the method’s CPU resource consumption frequency.

To identify long-running CPU executions, look for long horizontal cells on the top edge of the flame graph.

A healthy Node.js process has minimal CPU-blocking activity; correspondingly, a flame graph for a healthy Node.js process has minimal long, horizontal cells along the top edge of its flame graph. See The Flame Graph.

Call Graph

Shows the total execution time and the percentage of the total execution time of each method on the process's call stack. The numbers at the ends of the methods are the line numbers in the source code. You can filter out methods below a certain time to simplify the graph and isolate the trouble spots.

The Time and Percentage columns identify which calls take the longest time to execute.

To see more information about a call, select the call and click Details.

Allocation Call Graph

Available only for process snapshots that are collected manually. See Manage Node.js Process Snapshots.

Shows the amount and percentage of the memory allocated and not freed by each method on the process's call stack during the process snapshot. You can use the Method Size slider to configure how much memory a method must allocate to be displayed in the allocation call graph. You can also filter out methods that consume less than a certain amount of memory to simplify the graph and isolate the trouble spots.

The Size and Percentage columns identify which calls consume the most memory.

The agent cannot report allocations made prior to the beginning of the allocation snapshot.

The allocation reported in the snapshot is the memory that is still referenced when the snapshot ends: memory allocated during the snapshot period minus memory freed during the snapshot period.

For more information about a call, select the call and click Details.

Hot Spots

This tab displays the calls by execution time, with the most expensive calls at the top. To see the invocation trace of a single call in the lower panel, select the call in the upper panel.

Use the Method Time slider in the upper right corner to configure how slow a call must be to be considered a hot spot.

Manage Node.js Process Snapshots

This page describes how process snapshots are generated and viewed.

Automatic Process Snapshot Generation

When a business transaction snapshot is triggered by periodic collection or by a diagnostic session, a ten-second process snapshot is automatically started. By default, the agent starts no more than two process snapshots per minute automatically, but this behavior is configurable.

You can also start process snapshots manually on demand. See Collect Process Snapshots Manually.

Configure Automatic Collection

You can configure automatic process snapshot collection using these settings:

processSnapshotCountResetPeriodSeconds : Frequency, in seconds, at which the automatic process snapshot count is reset to 0; default is 60 seconds.
maxProcessSnapshotsPerPeriod : Number of automatic process snapshots allowed in processSnapshotCountResetPeriodSeconds seconds; default is 2 snapshots.
autoSnapshotDurationSeconds : Duration of an automatically-generated process snapshot; default is 10 seconds.

To configure these settings, add them to the require statement in your application source code as described in Install the Node.js Agent. Then stop and restart the application.

Collect Process Snapshots Manually

If you want to generate some process snapshots now, you can start them manually.

Navigate to the dashboard for the tier or node for which you want to collect process snapshots.
Click the Process Snapshots tab.
Click Collect Process Snapshots.
If you are in the Tier dashboard, select the node for which you want to collect snapshots from the Node dropdown. If you are in the Node dashboard, you can only set up snapshot collection for that node.
Enter how many seconds you want to collect process snapshots for this node. The maximum is 60 seconds.
Click Create.

The agent collects process snapshots for the configured duration. Process snapshots that are started manually include an allocation call graph that shows how much memory has been allocated and not freed during the period recorded by the snapshot.

Process Snapshots and Business Transaction Snapshots

This page explains the relationship between transaction snapshots and process snapshots created by the Node.js Agent.

V8 Sampler

Node.js is built on the V8 JavaScript engine, which includes a code sampler.

The Node.js Agent uses the V8 sampler to create process-wide process snapshots, which contain call graphs of the methods on the Node.js process's call stack.

Call Graph Data in Snapshots

Call graph data displays in business transaction snapshots as well as process snapshots.

When you view a business transaction snapshot, the displayed call graph specific to the transaction instance is derived from the concurrent process snapshot call graph.

When you view a process snapshot, the complete call graph of all the business transactions executed while the process snapshot was captured is displayed.

The call graph in a business transaction snapshot displays a view of the data from a concurrent process snapshot that is filtered to display only time in methods attributable to the specific business transaction. It is a subset of the concurrent process snapshot call graph.

For this reason, you might see an execution time for a method in a business transaction call graph that is less than the execution time for the same method in the concurrent process snapshot call graph. This would indicate that some calls to that method were made outside the context of the business transaction instance captured by the transaction snapshot.

The summary tab of a transaction snapshot includes a link to the process snapshot that was taken during the time covered by the transaction snapshot.

Business Transaction Snapshots Trigger Process Snapshots

To provide call graph data associated with business transaction snapshots, the agent starts a ten-second process snapshot whenever it starts a business transaction snapshot that is triggered by periodic collection or a diagnostic session if there is no existing process snapshot in progress for the current process. Process snapshots do not overlap. Periodic collection means that a business transaction is collected at periodic intervals, by default every ten minutes, but configurable. Diagnostic session means that either the agent has detected a pattern of possible performance issues and automatically started capturing transaction snapshots or a human has manually started a diagnostic session for the same reason.

Concurrent Business Transaction and Process Snapshots

The result presented is a process snapshot that ran concurrently with a business transaction. How well the two snapshots line up depends on the relative durations and start times of the transaction and the process snapshots.

In the scenario sketched below, all of the five-second blue transaction's calls, and most of the 10-second green transaction's calls are captured by a 10-second process snapshot, but only the about half of the 14-second orange transaction snapshot's calls.

Business Transaction and Process Snapshots

If you find that your business transactions are running longer than your process snapshots, you can increase the default length of a process snapshot in the autoSnapshotDurationSeconds

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Access Troubleshooting

Restricting Memory Amounts

Additional Help

Slow Response Times

Step 1: Check for slow or stalled business transactions.

Step 2: Drill into slow or stalled transactions to determine the root cause.

Step 3: Check for slow database or remote service calls on the backend.

Step 4: Drill into SQL or Remote Service Calls to determine the root cause.

Step 5: Does the problem affect all nodes in the slow tier.

Step 6: Check to see whether the problem affects most business transactions

Additional Help

Errors and Exceptions

View Error and Exception Information

Business Transaction Error

Code Exceptions in Splunk AppDynamics

Agent Errors

Error and Exception Limits

Configure Errors and Exceptions

Slow Response Times for .NET

Initial Troubleshooting Steps

.NET Resource Troubleshooting

Step 1 - CPU saturated?

Step 2 - Significant garbage collection activity?

Step 3 - Memory leak?

Java Resource Issues

Step 1. CPU saturated?

Step 2. Significant garbage collection activity?

Step 3. Memory leak?

Step 4. Resource leak?

Java Memory Leaks