
The Benchmarking sample demonstrates how a simplified version of the above is implemented. This is the value used as final run time for the algorithm. From the result set, we take the median.From all average running times of each batch, we exclude the 5% lowest and highest values.Perform item 3 for at least 5s, making sure that we do it at least 10 times.This is done to exclude the time spent performing the measurement itself from the algorithm runtime. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). Run the algorithm in batches and measure its average running time within each batch.One second warm-up time running the algorithm in a loop.Memories have only the benchmarked backend enabled along with VPI_EXCLUSIVE_STREAM_ACCESS flag on.All payloads, inputs and output memory buffers are created beforehand.This information helps explain what context the performance numbers refer to. The benchmark procedure used to measure the performance numbers is described in detail here. On top of that, it's also more energy efficient. The conclusion is that the PVA backend scales better than other backends with more parallel streams until the processor is saturated. Now changing the number of parallel streams to four, PVA elapsed time increases just a bit, but CPU increases roughly 4x. When comparing CPU and PVA on a NVIDIA® Jetson AGX Xavier™ with one stream, CPU is faster since it has eight cores, and each algorithm invocation fully utilizes all of them, whereas PVA is only using one vector processor from one PVA processor, 1/4th of the installed PVA capacity. As an example, consider the Convolution performance table. In order to make meaningful elapsed time comparison across different backends, the corresponding processors must be fully utilized. The first measurement can be considered higher than the second. \(110 \pm 5\) ms and \(100 \pm 1\) ms represent different elapsed times with a high confidence as there's no overlap between \((105 115)\) and \((99 101)\), respectively.It cannot be said that the first measurement is effectively higher (or slower) than the second. \(110 \pm 5\) ms and \(100 \pm 20\) ms represent similar elapsed times since there's an overlap between the confidence intervals \((105 115)\) and \((80 120)\), respectively.The confidence intervals must be taken into account when comparing measurements. This assumes that the measurements are drawn from a normal distribution. These intervals represent a confidence level of 99.73% ( \(\pm 3\sigma\)) that the true elapsed time lies inside it. For this reason, benchmarking is restricted to Jetson devices.Īll algorithm elapsed time measurements are shown with their confidence intervals, such as \(212.4 \pm 0.4\) ms or \(0.024 \pm 0.002\) ms. Hence executing one to four parallel PVA algorithm instances won't increase the elapsed running time of each instance.Īlthough VPI can be used in x86 architectures with discrete GPUs, the large number of configurations with different performances makes it difficult to make a useful comparison between them and the different algorithms VPI supports. Since NVIDIA ®Jetson Xavier™ devices have two independent PVA processors, each one with two parallel vector processors, the PVA backend is fully utilized only when four or more parallel VPIStream are being used. A notable exception happens with the PVA backend. How does the average running time of each box filter operation change with the number of parallel streams? This is answered by changing the stream count in the performance table.įor most backends and algorithms, the average running time of each invocation increases linearly with the number of streams. Now suppose the program must process several images in parallel, each one in its own VPIStream. This operation alone takes a certain amount of time. Say one VPIStream is set up to apply a box filter on an image. This allows analysis of how the average running time of each algorithm invocation changes with the number of parallel algorithm executions. In each table, the target Jetson device and the number of parallel algorithm invocations can be selected, i.e., the number of parallel VPIStream executing the algorithm.


#Jetson agx orin benchmark how to#
This page explains how VPI performance information is gathered, how to interpret it and how you can reproduce the results.įor performance comparisons between VPI and other Computer Vision libraries, please consult Performance Comparison page.
