Running YOLOv8 at 15 FPS on a Jetson Nano: Lessons from Edge Inference

My bachelor’s thesis was a portable obstacle detection device for visually impaired users. The goal was simple: run a real-time object detector on a small edge device, translate detections into directional audio cues, and make it work outdoors without needing any external infrastructure.

The hard constraint was the hardware. NVIDIA Jetson Nano, 4GB RAM, 128 CUDA cores. That’s what I had, and that’s what the device needed to run on.

Getting from “theoretically possible” to 15 FPS in production took about three months of iteration. Here’s what I learned.

Why 15 FPS Is the Real Target

The first question I had to answer wasn’t which model to use. It was: what frame rate actually matters for navigation assistance?

At 5 FPS, the system feels laggy. Objects appear and vanish before the audio cue can be processed and acted on. At 10 FPS, it’s usable but occasionally misleading when the user is walking at a normal pace. At 15 FPS, the detections feel continuous enough that users can build spatial awareness from them.

I settled on 15 FPS as the minimum viable frame rate before writing a single line of inference code. That target shaped every architectural decision that followed.

Why YOLOv8 Over Other Options

YOLOv8 Nano (the smallest variant) was the obvious starting point. YOLOv8n has 3.2M parameters and runs at roughly 80+ FPS on a modern GPU. On a Jetson Nano with TensorRT, the realistic expectation was somewhere between 10-25 FPS depending on input resolution and quantisation.

I evaluated three alternatives before committing:

MobileNet-SSD: Faster, but COCO accuracy was noticeably worse on small objects, which matters when detecting doorknobs, steps, and kerb edges
EfficientDet-Lite: Good accuracy, but the ONNX export path for Jetson at the time was poorly documented and I lost three days trying to get it working
TensorFlow Lite MobileDetect: Simpler deployment, but no CUDA acceleration on Jetson without a custom delegate

YOLOv8n with TensorRT export gave me the best accuracy-to-speed tradeoff on this specific hardware. The Ultralytics export pipeline also just worked, which mattered more than I expected.

The Optimization Pipeline

Fresh out of training, the model runs at about 6-7 FPS on the Jetson Nano in FP32. That’s not usable.

Getting to 15 FPS required three changes:

TensorRT conversion. Exporting to TensorRT FP16 cut inference time roughly in half. The export command is straightforward with the Ultralytics CLI, but the first time you run it, it takes about 20 minutes to compile the engine. Build it once, cache it, reuse it.

Input resolution reduction. YOLOv8n defaults to 640x640. At 416x416, accuracy drops slightly but inference time drops significantly. For navigation assistance, where you care more about “is there something in front of me” than “what exactly is it,” 416x416 was the right tradeoff.

Skipping frames for audio processing. The audio feedback pipeline (converting detections to spatial audio cues) runs on the CPU. Keeping it synchronised with every inference frame created a bottleneck. Running inference on every frame but only updating audio every other frame freed up CPU headroom and pushed the effective throughput past 15 FPS.

The Part Nobody Warns You About: Thermal Throttling

The Jetson Nano runs hot. Without active cooling, it throttles after about 8-10 minutes of continuous inference, dropping from 15 FPS to 9-10 FPS.

This was a problem because the device is meant for walking, which means continuous use for 20-30 minutes at a time.

Fix: a small 5V fan mounted on the heatsink, controlled by a GPIO pin and a temperature threshold script. Above 65°C the fan kicks in, below 55°C it turns off. Simple, and it completely eliminated throttling across all test sessions.

This took me two weeks to diagnose because I was benchmarking on a desk in a cool room. Field testing revealed the issue immediately.

What the Accuracy Numbers Actually Mean

On the COCO validation set, YOLOv8n at 416x416 with FP16 quantisation scores around 46 mAP. That number means very little for this application.

What mattered in practice:

Large obstacles (people, walls, furniture, parked cars): near-perfect detection rate at any useful distance
Medium objects (chairs, bins, bollards): reliable detection within 3 metres
Small hazards (kerb edges, steps, cables on the floor): inconsistent, especially with motion blur

The device works well for indoor navigation and open outdoor environments. It doesn’t reliably catch every small ground-level hazard, which is an honest limitation I included in the thesis.

A depth camera (RealSense or similar) would solve the distance estimation problem and likely improve small-object relevance. It was out of scope for the hardware budget, but it’s the obvious next iteration.

What I’d Do Differently

Two things.

First: test in real environments from day one, not a controlled lab. The throttling issue, lighting variation, and motion blur all appeared immediately in field testing and would have saved weeks if caught early.

Second: the audio feedback design deserved more attention than the model optimisation. Users adapted to the model’s limitations quickly. What they actually struggled with was interpreting overlapping audio cues when multiple objects were detected simultaneously. Good ML on poor UX still produces a poor device.

The system ran reliably at 15 FPS for the final thesis demonstration and user testing sessions. A solid result for the hardware constraints, and a useful lesson in what “good enough” means when the user’s safety depends on it.