Troubleshooting Applications
Splunk AppDynamics includes many tools and views designed to help you diagnose application-related problems.
Access Troubleshooting
When starting to troubleshoot an application problem, you should begin in the Troubleshoot section of the UI. You can access from the left-hand navigation pane of the Controller UI in an application context.
The area includes pages for analyzing slow response times, errors and exceptions, and health rule violations. It also provides access to war rooms, an area of the UI dedicated to troubleshooting a specific problem.
Restricting Memory Amounts
| Agent | Functionality |
|---|---|
| Analytics Agent | This functionality is configurable and documented, see Enable the Standalone Analytics Agent. |
| C/C++ SDK Agent | This functionality is not configurable. |
| Database Agent | This functionality is configurable and documented, see Install the Database Agent. |
| Go SDK Agent | This functionality is not configurable. |
| IIB Agent | This functionality is not configurable. |
| Java Agent | This functionality is not configurable. |
| Machine Agent | This functionality is configurable and documented, see Machine Agent Requirements and Supported Environments. |
| .NET Agent | This functionality is not configurable. |
| Network Agent | This functionality is not configurable. |
| Node.js Agent | This functionality is not configurable. |
| PHP Agent | This functionality is not configurable. |
| Python Agent | This functionality is not configurable. |
Additional Help
If slow response time persists even after you've completed the steps outlined above, you may need to perform deeper diagnostics. If you cannot find the information you need in the Splunk AppDynamics documentation, consider posting a note about your problem in the Community Discussion Boards.
Slow Response Times
You may receive a notification based on a health rule violation, see performance indicators in flow maps or transaction scorecards that indicate slow response times. When you do, the following guidelines provide a strategy for troubleshooting and diagnosis.
The Slow Response Times Dashboard in the Troubleshoot menu lists the Business Transactions most responsible for slow Average Response Time (ART). From that dashboard, you can select a time range centered around a spike in ART that also captures more normal-looking time spans. This information is also available through the Top Business Transactions tab on the By Contribution to App Average Response Time tile.
Step 1: Check for slow or stalled business transactions.
Splunk AppDynamics uses transaction thresholds to detect slow, very slow, and stalled transactions. Follow these steps to determine if you complete Step 2 or move on to Step 3.
Do you have slow or stalled transactions?
- Make sure that the time frame selected in the Controller UI encompasses the time when the performance issue occurred. If it's a continuing condition, you can keep the time frame relatively brief. Use the Time Range dropdown.
- Click Troubleshoot > Slow Response Times.
- Click the Slow Transactions tab if it is not already selected.
- Do you see one or more slow transaction snapshots on this page?
Step 2: Drill into slow or stalled transactions to determine the root cause.
- From the left menu, navigate to Troubleshoot > Slow Response Times.
- In the lower pane of the Slow Transactions tab, click the Exe Time (ms) column to sort the transactions from slowest to fastest.
- Select a snapshot from the list and click Details.
- Review the Potential Issues list to see the longest-running method and SQL calls in the transaction.
- Click a potential issue and select Drill Down into Call Graph, or click Drill Down in the transaction flow map pane to see the complete set of call graph segments retained for this transaction.
- View the Time (ms) column to see how long this method execution takes relative to the transaction execution time.
- Click the HTTP link in the last column on the right to display the information details pane.
- Note theClass, Method,andLine Number(if available) represented by the execution segment. This information provides a starting point for troubleshooting this code issue.
Step 3: Check for slow database or remote service calls on the backend.
Splunk AppDynamics collects metrics about the performance of business transaction calls to the databases and remote servers from the instrumented app servers. You can drill down to the root cause of slow database and remote service calls.
- Click Troubleshoot > Slow Response Times, then click the Slowest DB & Remote Service Calls tab.
- Select a call from the list and click the View Snapshots link to show a list of Correlated Snapshots.
- Click the Exe Time (ms) column to sort the transactions from slowest to fastest.
- Select a transaction and click Drill Down.
- Select a potential issue from the Potential Issues list
- Click Drill Down into Call Graph to go directly to that point in the call graph, or click Drill Down in the flow map pane to see the complete set of call graph segments retained for this transaction.
- Review the Time (ms) column and select the transaction with the longest execution time for this method relative to the transaction execution time.
- Select the DB & Remote Service Calls tab.
- Do you see one or more slow calls on the SQL Calls or Remote Service Calls tabs?
Step 4: Drill into SQL or Remote Service Calls to determine the root cause.
- Slow database call – click the database call to gain information about the call.
- If you have Correlated snapshots between Java applications and Oracle databases – drill down into the Oracle database on the Transaction Snapshot to view database details captured during the snapshot.
- If you have Splunk AppDynamics for Databases – right-click the database on the Application, Tier, Node or Backend Flow Map, and choose Link to Splunk AppDynamics for Databases. You can use Splunk AppDynamics for Databases to diagnose database issues.
- If you have Database Monitoringri ght-click the database on the Application, Tier, Node or Backend Flow Map, and chooseViewto review database problems.On the SQL Calls tab of the transaction snapshot, sort the SQL Calls by Avg. Time (ms).
- On the Remote Service Calls tab, sort the queries by Avg. Time (ms).
- Select the slow call.
- Click Drill Down intoDownstream Callto gain further insight into the methods of the service call.
- Sort the methods by theTime (ms)column.
- Select a slow method.
- ClickDetails.
Step 5: Does the problem affect all nodes in the slow tier.
Step 6: Check to see whether the problem affects most business transactions
Additional Help
If you've tried to diagnose the problem using the previous steps and haven't found the problem, see additional information for your specific agent:
Errors and Exceptions
Splunk AppDynamics Application Intelligence Platform captures and presents information on business transaction errors in the monitored environment.
At a high-level, a business transaction is considered to have an error if its normal processing has been affected by a code-level exception or error event, including custom error events based on methods you specify.
View Error and Exception Information
The Controller UI presents information on errors and exceptions in various places in the UI, including in transaction snapshots, metrics, and dashboards.
The informational popups for tiers in flow maps have an error tab that displays error rate metrics for the tier:
On the application and tier flow maps, the error rate is for all business transactions. On the business transaction flow map, errors apply only to the current business transaction.
The Metric Browser includes Error metrics:
The page shows all error transactions. The page contains two tabs, one for transaction errors and one for Exceptions.
The tabs show information on the rate of errors or exceptions, and lets you drill down to the error or exception for more information, as shown:
Business Transaction Error
All transaction errors that have been detected according to the configured error detection rules in the selected time frame of the Controller UI appear in the Error Transactions tabs of the Errors page.
By default, Splunk AppDynamics considers a business transaction to be in error if it detects one of the following types of events in the context of the transaction:
- An unhandled exception or error. An exception that is thrown and never caught or caught after the business transaction terminates results in a transaction error, and the exception is presented in Splunk AppDynamics. An exception that is thrown and caught within the context of the business transaction is not considered a transaction error and the exception is not captured in Splunk AppDynamics.
- An exception caught in an exit call, such as a web service or database call.
- An HTTP error response, such as a status code 404 or 500 response.
-
A custom-configured error method and error message.
Error detection configuration is described in Error Detection.
Errors that occur on a downstream tier that are not propagated to the originating tier do not result in a business transaction error. If the originating client receives a 200 success response, for example, the business transaction is not considered an error. The error contained within the downstream tier does count against the Error Per Minute metric for the continuing segment.
When a business transaction experiences an error, it is counted only as an error transaction. It is not also counted as a slow, very slow or stalled transaction, even if the transaction was also slow or stalled.
Code Exceptions in Splunk AppDynamics
Code exceptions are a common cause of business transaction errors. The Exceptions tab in theErrorspage shows an aggregated view of the exceptions across all transactions. For purposes of this view, Splunk AppDynamics considers the following types of incidents to be exceptions:
- Any exception logged with a severity of Error or Fatal (using Log4j, java.util.logging, Log4Net/NLog, or another supported logger). This applies even if the exception occurs outside the context of a business transaction, in which case the exception type is specified asApplication Server.
- HTTP errors that do not occur in the context of a Business Transaction.
- Error page redirects.
Exceptions that are thrown and handled within a business transaction are not captured by Splunk AppDynamics and do not appear in the Exceptions tab.
When troubleshooting errors, notice that the number of business transaction errors does not necessarily correspond to the number of exceptions in a given time frame. A single transaction that counts as an error transaction can correspond to multiple exceptions. For example, as the transaction traverses tiers, it can generate an exception on each one. Troubleshooting an error typically involves finding the exception closest to the originating point of the error.
If a stack trace for the exception is available, you can access it from the Exception tab in the Controller UI. A stack trace is available for a given exception if the exception was passed in the log call. For example, a logging call in the form of logger.log(Level.ERROR, String msg, Throwable e) would include a stack trace, whereas a call in the form of logger.log(Level.ERROR, String msg) would not.
Agent Errors
Error and Exception Limits
Splunk AppDynamics limits the number of registered error types (based on error-logging events, exception chains, and so on) to 4000. It maintains statistics only for registered error types.
Reaching the limit generates the CONTROLLER_ERROR_ADD_REG_LIMIT_REACHED event. While it is possible to increase the limit, we recommend refining the default error detection rules to reduce the number of error registrations to have the error you are not interested in capturing ignored.
For more information, see information on configuring log errors and exceptions to ignore in Error Detection.
Configure Errors and Exceptions
Splunk AppDynamics automatically recognizes errors and exceptions for many common frameworks. You can customize the default error detection behavior as needed, for example, if you use your own custom error framework. See Error Detection.
Slow Response Times for .NET
You may notice that your application's response time is slow from these methods:
- You receive an alert: If you have received an email alert from Splunk AppDynamics that was configured through the use of health rules and policies, the email message provides a number of details about the problem that triggered the alert. See information about Email Notifications in Notification Actions. If the problem is related to slow response time, see Initial Troubleshooting Steps.
- You view the Application Dashboard for a business application and see slow response times.
- A user reported slow response time that relates to a particular business transaction, for example, an internal tester reports "Searching for a hotel is slow".
Initial Troubleshooting Steps
In some cases, the source of your problem might be easily diagnosed by choosing Troubleshoot > Slow Response Times in the left navigation pane. See Slow Response Times.
.NET Resource Troubleshooting
If you've tried to diagnose the problem using those techniques and haven't found the problem, use the following troubleshooting approaches to find other ways to determine the root cause of the issue.
Step 1 - CPU saturated?
Is the CPU of the CLR saturated?
- Display the Tier Flow Map.
- Click the Nodes tab, and then click the Hardware tab.
- Sort by CPU % (current).
If the CPU % is 90 or higher, the answer to the question in Step 4 is Yes. Otherwise, the answer is No.
Yes – Go toStep 2
No – Review various metrics in the Metric Browser to pinpoint the problem.
In the left navigation pane, click . Review these metrics in particular:
- ASP.NET -> Application Restarts
- ASP.NET -> Request Wait Time
- ASP.NET -> Requests Queued
- CLR -> Locks and Threads -> Current Logical Threads
- CLR -> Locks and Threads -> Current Physical Threads
- IIS -> Number of working processes
- IIS -> Application pools -> <Business application name> -> CPU%
- IIS -> Application pools -> <Business application name> -> Number of working processes
- IIS -> Application pools -> <Business application name> -> Working Set
You have isolated the problem and don't need to continue with the rest of the steps below.
Step 2 - Significant garbage collection activity?
Is there significant garbage collection activity?
Yes – Go toStep 3.
No – Use your standard tools to produce memory dumps; review these to locate the source of the problem.
You have isolated the problem and don't need to continue with the rest of the steps below.
Step 3 - Memory leak?
Is there a memory leak?
Yes – Use your standard tools for troubleshooting memory problems. You can also review ASP.NET metrics; click .
No – Use your standard tools to produce memory dumps; review these to locate the source of the problem.
Whether you answered Yes or No, you have isolated the problem.
Java Resource Issues
These troubleshooting guidelines may help you determine the root cause of many Java-related issues.
Step 1. CPU saturated?
Is the CPU of the JVM saturated?
How do I know?Step 2. Significant garbage collection activity?
Is there significant garbage collection activity?
How do I know?Step 3. Memory leak?
Is there a memory leak?
How do I know?Step 4. Resource leak?
Is there a resource leak?
How do I know?Java Memory Leaks
Automatic Leak Detection
You can access Automatic Leak Detection on the Memory tab of the Node Dashboard. Automatic Leak Detection is disabled by default because it increases overhead on the JVM. You should enable leak detection mode only when you suspect a memory leak problem. Turn off Automatic Leak Detection after you identify the cause for the leak.
Automatic Leak Detection uses On Demand Capture Sessions to capture actively used collections, any class that implements JDK Map or Collection interface during the capture period. The default capture period is 10 minutes.
Splunk AppDynamics tracks every Java collection that meets the following criteria:
- The collection has been alive for at least 30 minutes.
- The collection has at least 1000 elements.
- The collection Deep Size is at least 5 MB. The agent calculates Deep Size by traversing recursive object graphs of all the objects in the collection.
The following node properties define the defaults for leak detection criteria:
-
minimum-age-for-evaluation-in-minutes
-
minimum-number-of-elements-in-collection-to-deep-size
-
minimum-size-for-evaluation-in-mb
See App Agent Node Properties.
The Java Agent tracks the collection and identifies potential leaks using a linear regression model. You can identify the root cause of the leak by tracking frequent access to the collection over a period of time.
After it qualifies a collection, Splunk AppDynamics monitors the collection size for a long-term growth trend. Positive growth indicates the collection is the potential source of a memory leak.
After Splunk AppDynamics identifies a leaking collection, the Java Agent automatically triggers diagnostics every 30 minutes. The diagnostics capture a shallow content dump and activity traces of the code path and business transactions that access the collection. You can drill down into any leaking collection monitored by the agent, to manually trigger Content Summary Capture and Access Tracking sessions.
You can also monitor memory leaks for custom memory structures. Typically custom memory structures are used as caching solutions. In a distributed environment, caching can easily become a prime source of memory leaks. It is therefore important to manage and track memory statistics for these memory structures. To do this, you must first configure custom memory structures. See Custom Memory Structures for Java.
Workflow to Troubleshoot Memory Leaks
You can use this workflow to troubleshoot memory leaks on JVMs that have been identified with a potential memory leak problem:
- Monitor memory for potential JVM memory leaks.
- Enable automatic leak detection.
- Start an on demand capture session.
- Detect and troubleshoot leaking conditions.
Monitor Memory for Potential JVM Leaks
Use the Node dashboard to identify the memory leak. A possible memory leak is indicated by a growing trend in the heap as well as the old/tenured generation memory pool.
An object is automatically marked as a potentially leaking object when it shows a positive and steep growth slope.
The Automatic Memory Leak dashboard shows:
- Collection Size—The number of elements in a collection.
- Potentially Leaking—Potentially leaking collections display as red. You should start diagnostic sessions on potentially leaking objects.
- Status—Indicates if a diagnostic session has been started on an object.
- Collection Size Trend—A positive and steep growth slope indicates a potential memory leak.
If no captured collections display, ensure that you have the correct configuration for detecting potential memory leaks.
Enable Memory Leak Detection
Memory leak detection is available through the Automatic Leak Detection feature. Once the Automatic Leak Detection feature is turned on and a capture session has been started, Splunk AppDynamics tracks all frequently used collections. Therefore, using this mode results in higher overhead.
- Turn on Automatic Leak Detection mode only when a memory leak problem is identified
- Click Start On Demand Capture Session to start monitoring frequently used collections and detect leaking collections.
- After you identify and resolve the leak, turn the capture session and the leak detection modes off.
- Start diagnosis on one individual collection at a time to achieve optimum performance.
Troubleshoot Memory Leaks
After detecting a potential memory leak, troubleshooting the leak involves performing these three actions:
Select the Collection Object to Monitor
On the Automatic Leak Detection dashboard, right-click the class name and click Drill Down.
For performance reasons, start the troubleshooting session on a single collection object at a time.
Use Content Inspection
Content Inspection identifies which part of the application the collection belongs to so that you can start troubleshooting. It allows monitoring histograms of all the elements in a particular collection.
Enable Automatic Leak Detection by starting an On Demand Capture Session, select the object you want to troubleshoot, and then follow the steps listed below:
- Click the Content Inspection tab.
- Click Start Content Summary Capture Session to start the content inspection session.
- Enter the session duration. Allow at least 1 – 2 minutes for data generation.
- Click Refresh to retrieve the session data.
- Click on the snapshot to view details about an individual session.
Use Access Tracking to view the actual code paths and business transactions accessing the collections object.
As described above in Workflow to Troubleshoot Memory Leaks, enable Automatic Leak Detection, start an On Demand Capture Session, select the object you want to troubleshoot, and then follow the steps listed below:
- Select the Access Tracking tab
- Click Start Access Tracking Session to start the tracking session.
- Enter the session duration. Allow at least 1-2 minutes for data generation.
- Click Refresh to retrieve session data.
- Click the snapshot to view details about an individual session.
Java Memory Thrash
Memory thrash is caused when a large number of temporary objects are created in very short intervals. Although these objects are temporary and are eventually cleaned up, the garbage collection mechanism may struggle to keep up with the rate of object creation. This may cause application performance problems. Monitoring the time spent in garbage collection can provide insight into performance issues, including memory thrash.
For example, an increase in the number of spikes for major collections either slows down a JVM or indicates potential memory thrash. Use object instance tracking to isolate the root cause of the memory thrash. To configure and enable object instance tracking, see Object Instance Tracking for Java.
Splunk AppDynamics automatically tracks object instances for the top 20 core Java (system) classes and the top 20 application classes.
The Object Instance Tracking subtab provides the number of instances for a particular class and graphs the count trend of those object in the JVM. It provides the shallow memory size (the memory footprint of the object and the primitives it contains) used by all the instances.
Analyze Memory Thrash
Once a memory thrash problem is identified in a particular collection, start the diagnostic session by drilling down into the suspected problematic class.
Select the class name to monitor and click Drill Down at the top of the Object Instance Tracking dashboard or right-click the class name and select the Drill Down option.
After the drill down action is triggered, data collection for object instances is performed every minute. This data collection is considered to be a diagnostic session and the Object Instance Tracking dashboard for that class is updated with this icon , to indicate that a diagnostic session is in progress.
The Object Instance Tracking dashboard indicates possible cases of memory thrash. Prime indicators of memory thrash problems indicated on the Object Instance Tracking dashboard are:
- Current Instance Count: A high number indicates the possible allocation of a large number of temporary objects.
- Shallow Size: Is the approximate memory used by all instances in a class. A large number for shallow size signals potential memory thrash.
- Instance Count Trend: A saw wave is an instant indication of memory thrash.
If you suspect you have a memory thrash problem at this point, then you should verify that this is the case. See To verify memory thrash.
Verify Memory Thrash
Select the class name to monitor and click Drill Down at the top of the Object Instance Tracking dashboard. On the Object Instance Tracking window, click Show Major Garbage Collections.
If the instance count does not vary with the garbage collection cycle, it is an indication of a potential leak and not a memory thrash problem. See Java Memory Leaks.
Troubleshoot Java Memory Thrash Using Allocation Tracking
Allocation Tracking tracks all the code paths and those business transactions that are allocating instances of a particular class. It detects those code path/business transactions that are creating and throwing away instances.
To use allocation tracking:
- Using the Drill Down option, trigger a diagnostic session.
- Click the Allocation Tracking tab.
- Click Start Allocation Tracking Session to start tracking code paths and business transactions.
- Enter the session duration and allow at least 1 to 2 minutes for data generation.
- Click Refresh to retrieve the session data.
- Click a session to view its details.
- Use the Information presented in the Code Paths and Business Transaction panels to identify the origin of the memory thrash problem.
Monitor Java Object Instances
If the application uses a JRE (rather than a JDK), use these steps to enable object instance tracking:
- Ensure the tools.jar file is in the
jre/lib/extdirectory. - On the Node Dashboard, click the Memory tab.
- On the Memory tab, click the Object Instance Tracking subtab.
- Click On and then OK.
Code Deadlocks for Java
By default, the Java Agent detects code deadlocks. You can find deadlocks and see their details using the Events list or the REST API.
Code Deadlocks and Their Causes
In multi-threaded development environments, it is common to use more than a single lock. However, sometimes deadlocks will occur. Here are some possible causes:
- The order of the locks is not optimal
- The context in which they are being called (for example, from within a callback) is not correct
- Two threads may wait for each other to signal an event
Finding Deadlocks Using the Events List
Select Code Problems (or just Code Deadlock) in the Filter By Event Type list to see code deadlocks in the Events list.
To examine a code deadlock, double-click the deadlock event in the events list and then click the Code Deadlock Summary tab. Details about the deadlock are in the Details tab. See Monitor Events.
Find Deadlocks Using the REST API
You can detect a DEADLOCK event-type using the Splunk AppDynamics REST API. For details, see the example Retrieve event data.
Thread Contention
Thread contention arises when two or more threads attempt to access the same resource at the same time. This page describes how Splunk AppDynamics helps you diagnose and resolve thread contention issues.
Performance Issues Resulting from Thread Contention
Multithreaded programming techniques are common in applications that require asynchronous processing. Although each thread has its own call stack in such applications, threads may need to access shared resources, such as a lock, cache, or counter. See Enabling Thread Correlation.
While synchronization techniques can help to prevent interference between threads in such scenarios, they may nevertheless compete for access to shared resources. This can result in application performance degradation or even data integrity issues.
Splunk AppDynamics can help you identify and resolve problems relating to thread contention in business transactions and service endpoints. See Trace Multithreaded Transactions for Java.
Thread Contention Detection
Splunk AppDynamics detects thread contention based on the thread state of the instrumented application.
It identifies these block or waiting states in the JVM:
- Acquiring a lock (
MONITOR_WAIT) - Waiting for a condition (
CONDOR_WAIT) - Sleeping (
OBJECT_WAIT) - A blocking I/O operation
The OBJECT_WAIT
Thread.sleepObject.waitThread.joinLockSupport.parkNanosLockSupport.parkUntilLockSupport.park
The Controller alerts you to possible thread contention problems in the Potential Issues pane of the Business Transaction Flow Map. From there, you can use the browser to access additional information about blocked and waiting threads in business transactions or service endpoints, and determine the cause of the performance problem.
The following sections explain how you use the browser to surface contention information for business transaction and service endpoints.
Thread Contention in Transaction Snapshots
To view information about thread contention:
- In the transaction snapshot navigation page, look for items labeled asThread Contentionissues in the Potential Issues pane. The time column indicates blocked or wait time.
- To display more information about the blocked method, click the thread contention item and select Drill Down into Call Graph. The call graph shows the following information relevant to thread contention:
- In the Call Graph header, Wait Time and Block Time indicate aggregate measures for the thread in one segment of the business transaction.
- In the Call Graph header, Node specifies the name of the node hosting the contending threads,
PojoNodein the example above. - The Time column indicates the total self time for the method.
- The Percent% column shows the amount of time spent in the method as a percentage of overall time for the thread.
-
The Thread State Column indicates the degree of thread contention issues for the method. Gray means no problems; yellow to red shading signals the severity of contention problems. (When you hover over the bar, a breakdown of the elements that make up the thread state is shown: This includes Block time and Wait time by default. To include Cpu Time in the Thread State detail, Dev mode must be enabled.)
- Right-click on any method with a thread state that indicates block or wait times and select View Details. The Thread Contention details pane appears.
The Thread Contention details pane displays the name of the blocked method in the top left corner and adds the following information in the Thread Contention table:
Element Meaning Blocking Thread The thread holding a lock on the blocking object. Blocking Object The object that the blocked thread is waiting to access. Block Time The amount of time waiting to access the object. Line Number The line number in the blocked method where the blocking object is being accessed. With respect to the example above, runis attempting to access a locked object at line 114.The order in which blocking threads are shown in the table is not significant; it does not imply a call order or time sequence.
Thread Contention in Service Endpoints
You can view thread contention information for service endpoint methods in Splunk AppDynamics. Call graphs identify service endpoint methods with this icon:.
Select More > Service Endpoints from the menu bar to view thread contention information by service endpoint.
Export Contention Information
- The Summary pane includes Block Time data: the block time specified is the sum of all block times for the blocked methods shown in the CallGraph pane.
- The Call Graph pane lists block time by method:
Event Loop Blocking in Node.js
You can use process snapshots to examine Node.js event loop activity and identify functions with high CPU times that are blocking the event loop.
Latency in Node.js Event Loops
The event loop of a Node.js process is a single thread that polls for incoming connections and executes all application code. When a Node.js request makes a call to an external database, remote service or the filesystem, the event loop automatically directs the application's control flow to some other task, including other connections or callbacks.
CPU-intensive operations block the event loop, preventing it from handling incoming requests or finishing existing requests. A CPU-intensive operation in one business transaction may cause slowness in other business transactions.
Process Snapshots In Splunk AppDynamics
A process snapshot describes an instance of a CPU process on an instrumented Node.js node. It generates a process-wide flame graph for a Node.js process over a configurable time range.
Process snapshots provide visibility into the Node.js event loop across all business transactions for the duration of the process snapshot. Process snapshots are useful when the main troubleshooting tools (such as, business transaction snapshots) are inconclusive because the source of latency is a CPU-intensive operation in another business transaction. You can use lists of process snapshots to identify which functions have high CPU times. From the list, you can select and examine process snapshots to identify exactly which functions in your code are blocking the CPU.
For a given Node.js node or tier, you can access the list of process snapshots from the Process Snapshots tab of the node or tier dashboard. You can filter the process snapshot list to display only the snapshots that you are interested in. You can filter by execution time, whether the snapshot is archived, and the GUID of the request. If you access the list from the tier dashboard, you can also filter by node.
For more information on how process snapshots are generated and how to configure them, see Manage Node.js Process Snapshots.
To learn how process snapshots and business transaction snapshots are created, see Process Snapshots and Business Transaction Snapshots.
Process snapshots persist for 14 days, unless you archive them, in which case they are available forever.
A process snapshot contains these tabs:
- Overview
- Flame Graph
- Call Graph
- Allocation Call Graphs
- Hot Spots
Overview
Summarizes the snapshot. Contents vary based on the available information.
Usually contains at least the total execution time, tier and node of the process, timestamp, slowest method and request GUID.
Flame Graph
Provides a visualization of each stack frame’s frequency on the CPU over the duration of a process snapshot. The frame’s position relative to the bottom-most stack depicts the call-stack depth.
The flame graph contains the same information as the call graph, but allows you to quickly spot methods that are consuming more CPU resources relative to others.
The method corresponding to the stack frame on the top edge of the flame graph represents the method’s CPU resource consumption frequency.
To identify long-running CPU executions, look for long horizontal cells on the top edge of the flame graph.
A healthy Node.js process has minimal CPU-blocking activity; correspondingly, a flame graph for a healthy Node.js process has minimal long, horizontal cells along the top edge of its flame graph. See The Flame Graph.
Call Graph
Shows the total execution time and the percentage of the total execution time of each method on the process's call stack. The numbers at the ends of the methods are the line numbers in the source code. You can filter out methods below a certain time to simplify the graph and isolate the trouble spots.
The Time and Percentage columns identify which calls take the longest time to execute.
To see more information about a call, select the call and click Details.
Allocation Call Graph
Available only for process snapshots that are collected manually. See Manage Node.js Process Snapshots.
Shows the amount and percentage of the memory allocated and not freed by each method on the process's call stack during the process snapshot. You can use the Method Size slider to configure how much memory a method must allocate to be displayed in the allocation call graph. You can also filter out methods that consume less than a certain amount of memory to simplify the graph and isolate the trouble spots.
The Size and Percentage columns identify which calls consume the most memory.
The agent cannot report allocations made prior to the beginning of the allocation snapshot.
The allocation reported in the snapshot is the memory that is still referenced when the snapshot ends: memory allocated during the snapshot period minus memory freed during the snapshot period.
For more information about a call, select the call and click Details.
Hot Spots
This tab displays the calls by execution time, with the most expensive calls at the top. To see the invocation trace of a single call in the lower panel, select the call in the upper panel.
Use the Method Time slider in the upper right corner to configure how slow a call must be to be considered a hot spot.
Manage Node.js Process Snapshots
This page describes how process snapshots are generated and viewed.
Automatic Process Snapshot Generation
When a business transaction snapshot is triggered by periodic collection or by a diagnostic session, a ten-second process snapshot is automatically started. By default, the agent starts no more than two process snapshots per minute automatically, but this behavior is configurable.
You can also start process snapshots manually on demand. See Collect Process Snapshots Manually.
Configure Automatic Collection
You can configure automatic process snapshot collection using these settings:
processSnapshotCountResetPeriodSeconds: Frequency, in seconds, at which the automatic process snapshot count is reset to 0; default is 60 seconds.maxProcessSnapshotsPerPeriod: Number of automatic process snapshots allowed inprocessSnapshotCountResetPeriodSecondsseconds; default is 2 snapshots.autoSnapshotDurationSeconds: Duration of an automatically-generated process snapshot; default is 10 seconds.
To configure these settings, add them to the require statement in your application source code as described in Install the Node.js Agent. Then stop and restart the application.
Collect Process Snapshots Manually
If you want to generate some process snapshots now, you can start them manually.
- Navigate to the dashboard for the tier or node for which you want to collect process snapshots.
- Click the Process Snapshots tab.
- Click Collect Process Snapshots.
- If you are in the Tier dashboard, select the node for which you want to collect snapshots from the Node dropdown. If you are in the Node dashboard, you can only set up snapshot collection for that node.
- Enter how many seconds you want to collect process snapshots for this node. The maximum is 60 seconds.
- Click Create.
Process Snapshots and Business Transaction Snapshots
This page explains the relationship between transaction snapshots and process snapshots created by the Node.js Agent.
V8 Sampler
Node.js is built on the V8 JavaScript engine, which includes a code sampler.
The Node.js Agent uses the V8 sampler to create process-wide process snapshots, which contain call graphs of the methods on the Node.js process's call stack.
Call Graph Data in Snapshots
Call graph data displays in business transaction snapshots as well as process snapshots.
When you view a business transaction snapshot, the displayed call graph specific to the transaction instance is derived from the concurrent process snapshot call graph.
When you view a process snapshot, the complete call graph of all the business transactions executed while the process snapshot was captured is displayed.
The call graph in a business transaction snapshot displays a view of the data from a concurrent process snapshot that is filtered to display only time in methods attributable to the specific business transaction. It is a subset of the concurrent process snapshot call graph.
For this reason, you might see an execution time for a method in a business transaction call graph that is less than the execution time for the same method in the concurrent process snapshot call graph. This would indicate that some calls to that method were made outside the context of the business transaction instance captured by the transaction snapshot.
The summary tab of a transaction snapshot includes a link to the process snapshot that was taken during the time covered by the transaction snapshot.
Business Transaction Snapshots Trigger Process Snapshots
To provide call graph data associated with business transaction snapshots, the agent starts a ten-second process snapshot whenever it starts a business transaction snapshot that is triggered by periodic collection or a diagnostic session if there is no existing process snapshot in progress for the current process. Process snapshots do not overlap. Periodic collection means that a business transaction is collected at periodic intervals, by default every ten minutes, but configurable. Diagnostic session means that either the agent has detected a pattern of possible performance issues and automatically started capturing transaction snapshots or a human has manually started a diagnostic session for the same reason.
Concurrent Business Transaction and Process Snapshots
The result presented is a process snapshot that ran concurrently with a business transaction. How well the two snapshots line up depends on the relative durations and start times of the transaction and the process snapshots.
In the scenario sketched below, all of the five-second blue transaction's calls, and most of the 10-second green transaction's calls are captured by a 10-second process snapshot, but only the about half of the 14-second orange transaction snapshot's calls.
If you find that your business transactions are running longer than your process snapshots, you can increase the default length of a process snapshot in the autoSnapshotDurationSeconds