How e-con Systems empowers Autonomous Robots with ROS 2 Perception-Driven Object Tracking

True robot autonomy delivers value by enabling robots to perceive and interpret their environment, react to what is found, and act without human intervention. This unlocks seamless object tracking and real-time response, enabling robots to operate independently in dynamic environments.

At CES 2025, e-con Systems presented a ROS 2-based object tracking and navigation demo, where a rover receives a command, finds a real-world object, navigates, aligns, and delivers a captured image.

Watch the full demo on YouTube

Using a ROS 2 Humble foundation, this project was developed and demonstrated on our mobile rover powered by e-con Systems’ CV72eSOM and paired with an AR0341 camera and RPLIDAR S2E. It brought together real-time object detection, LiDAR-camera sensor fusion, waypoint navigation, and precise image capture into a single, cohesive autonomy stack.

The entire system runs on embedded hardware, without any off-board compute. Moreover, the captured image can be forwarded to an LLM for scene understanding or used by any downstream application.

In this blog, you’ll learn about the importance of autonomous object tracking, how e-con Systems’ ROS 2-based object tracking and navigation system works, and the key challenges we solved along the way.

What Are the Practical Benefits of Autonomous Object Tracking?

Autonomous object tracking achieves tangible results that surpass technological expectations. Here’s how it transforms environments for robot deployment teams.

Natural language command interface

Operators issue simple, human-readable commands such as going to a location, finding an object, or describing a scene, without needing to understand the robot’s underlying navigation or perception stack. This dramatically lowers the barrier to operating autonomous robots in real environments, putting powerful capabilities into the hands of anyone on the floor.

Perception-guided navigation

The robot actively locates objects using live sensor data instead of following static coordinates. This advanced adaptability conquers dynamic spaces, outpacing any fixed-waypoint methods.

Image capture for downstream integration

Upon reaching and aligning with a detected object, the robot captures an image and delivers it directly to the specified destination. The seamless integration empowers applications to drive scene understanding or inspections autonomously.

Embedded-ready on resource-constrained hardware

The stack operates reliably on the CV72eSOM’s embedded hardware, harnessing its NPU for robust object detection. No external GPU or compute is required, guaranteeing suitability for demanding industrial and field deployments.

Scalable command architecture

A structured command protocol makes it straightforward to extend the system to new locations, object categories, and mission types. The robot’s capabilities can grow alongside operational requirements, making it a long-term platform rather than a point solution.

What’s Inside e-con Systems’ Object Tracking & Navigation Solution?

The autonomous object tracking system is built around five core components that work in concert, namely:

  • Structured command interface
  • YOLO World object detection
  • LiDAR-camera sensor fusion
  • Goal calculation and waypoint navigation
  • Image capture and delivery

Every one of these components runs as a modular ROS 2 node, independently configurable and replaceable, communicating through well-defined topics and services.

The key technologies powering this system are:

  • ROS 2 Humble
  • Nav2 Navigation Stack with AMCL Localization
  • YOLO World Object Detection
  • RPLIDAR S2E
  • Ambarella CV72 NPU for on-device object detection inference
  • e-con Systems CV72eSOM (Ambarella CV72)
  • AR0341 Camera
  • Custom ROS 2 Python and C++ Nodes

ROS 2 Object Tracking and Navigation: Understanding the Workflow

Commanding the robot

The system is driven by a structured command protocol with three types that cover the full mission lifecycle.

The Go To command sends the robot to a named location using a pre-mapped waypoint. The Find command instructs the robot to search for a named object, navigate to it, and align itself in front of it. The Describe command triggers an image capture at the target location and delivers it to a specified path, ready for an LLM or any downstream system.

An emergency stop signal can halt all motion instantly and release it again when safe to continue, giving operators a reliable safety interrupt without disrupting mission state.

Searching for and detecting the target

When a Find command is issued, the rover systematically navigates to a central location, completes a targeted rotational scan, and promptly identifies the target at each heading. Upon detection, it switches to approach mode and plots an accurate path.

Object detection runs entirely on the CV72eSOM using its NPU with a YOLO-World model, with no GPU or off-board inference involved. The target object can be updated at runtime without restarting the model, so the robot can be redirected to a new search target mid-mission without any downtime.

Knowing where the object is

Detecting an object in the camera is only half the problem. The robot also needs to know how far away it is and in which direction to move. Rather than relying on a depth camera, the system fuses the camera bounding box with data from the 2D LiDAR already on board for navigation.

The object’s image position is mapped to a corresponding LiDAR angle, and distance is extracted directly from the scan. By aggregating multiple readings, the system achieves a consistently stable position estimate and commits confidently to navigation goals.

This filtering step makes approach behavior more reliable in environments with variable lighting or visual clutter.

Navigating, aligning, and capturing

After the object is located, the robot computes an exact goal to position itself directly in front, maintaining a safe distance. Nav2 executes the route. On arrival, the robot performs visual alignment, centering the object and making precise corrections before capturing the image.

This two-phase approach, coarse navigation followed by fine visual alignment, mirrors the strategy used in the autonomous docking system and delivers reliable results without requiring precision path tuning from Nav2.

Key Challenges and How We Solved Them

Live autonomy demos expose the parts of a system that are easy to miss in a lab. For this CES 2025 demo, the rover had to run perception, navigation, fusion, safety handling, and image capture on the embedded platform while reacting to real exhibition conditions. The main engineering work came from making those steps run together in a predictable sequence.

1) Running full perception and navigation on a single embedded device

The challenge

Object detection, ROS 2 navigation, and LiDAR fusion had to run at the same time on the CV72eSOM. If one process consumed too much CPU or memory, the rest of the autonomy stack would suffer. The system needed enough headroom for perception, navigation, and control to keep running together during the mission.

Our solution

The rover’s hardware nodes were optimized to reduce CPU usage. The YOLO-World model was tuned for the CV72 NPU, balancing frame rate and detection accuracy within device limits. A socket bridge decoupled ML inference from the ROS 2 stack, which kept perception separate from navigation logic and reduced resource contention.

Business impact
  • Full object detection, ROS 2 navigation, and LiDAR fusion run on the device
  • Field operation becomes simpler because the rover runs independent of off-board compute
  • Infrastructure cost is lower in locations where cloud connectivity may vary

2) Accurate depth estimation using existing sensors

The challenge

The camera detects the object in the image, but the rover also needs distance and direction before it can move toward the target. A monocular camera alone gives image position rather than dependable depth. Adding a depth camera or stereo rig would increase hardware cost, weight, and integration work.

Our solution

The system fuses the AR0341 camera with the 2D LiDAR already on board for navigation. The detected object’s position in the image is mapped to the corresponding LiDAR angle, and the distance is read from the scan at that angle. Before tuning the fusion logic, the team characterized the LiDAR’s angular reference, scan direction, and offset relative to the camera.

Business Impact
  • Object localization uses sensors already present on the rover
  • Hardware cost, weight, and mechanical integration work remain lower
  • The fusion setup can be configured for different camera and LiDAR mounting arrangements

3) Stable navigation goals despite noisy detection data

The challenge

A single detection frame can be affected by motion blur, changing light, visual clutter, or a short occlusion. If the rover acts on that frame immediately, it may compute an unstable goal and approach the wrong point.

Our solution

The system collects multiple bounding box readings before committing to a navigation goal. It then derives a stable position estimate from those readings. This filtering step absorbs short-lived detection noise and gives Nav2 a more dependable goal for approach.

Business impact
  • Approach behavior remains consistent in variable lighting and cluttered environments
  • Failed navigation attempts caused by detection jitter are reduced
  • Operators see more predictable rover behavior during search and approach

4) Final visual alignment before image capture

The challenge

Nav2 brings the rover to the goal area, but the arrival heading may still place the object off-center in the camera frame. For the image capture step, the rover needs to face the object properly before capturing the target.

Our solution

After Nav2 completes the path, a visual alignment loop takes over. The rover checks where the detected object appears in the camera frame and makes small heading corrections until the object is centered. If alignment fails within a full rotation, the maneuver is aborted in a bounded way.

Business impact
  • Target images are framed for downstream use
  • Final alignment is handled through camera feedback rather than tighter Nav2 heading tuning
  • The rover avoids endless rotation when alignment fails

5) Stop and resume during an active navigation goal

The challenge

In real deployments, operators may need to pause a rover mid-mission because a person enters its path, a safety concern appears, or mission priority changes. The system has to cancel the active Nav2 goal, preserve the mission context, and resume from the rover’s current position. A weak stop-resume flow can leave the rover stuck, confused about state, or repeating completed work.

Our solution

Stop and resume are handled as part of the command lifecycle. When a stop signal arrives, the current Nav2 goal is canceled in a controlled way, and the mission state is preserved. When the resume arrives, the goal is reinstated from the rover’s current position, so the mission continues from the right point.

Business impact
  • Operators can intervene at any point during a mission
  • Mission progress is preserved through stop and resume
  • Teams can respond to safety events while the broader mission continues

6) LiDAR angular reference and sensor calibration

The challenge

LiDAR-camera fusion depends on the correct relationship between the camera view and the LiDAR scan. If the LiDAR reference direction, scan rotation, or angular offset is wrong, depth readings will be mapped to the wrong part of the image.

Our solution

The LiDAR’s reference direction, scan rotation, and angular offset relative to the camera were characterized on the actual hardware. The camera’s real field of view was also validated on the physical unit, since lens characteristics in the field can differ from datasheet values in ways that affect depth accuracy.

Business impact
  • Fusion accuracy is established before field operation
  • Systematic depth errors are reduced before deployment
  • The calibration process can be repeated across different hardware units

How e-con Systems’ Demo Has Elevated Autonomous Robot Perception

This ROS 2 object tracking and navigation system demonstrates what becomes possible when perception, navigation, and precise image capture are brought together on a single embedded platform. The rover takes a simple command, finds a real-world object, navigates to it, aligns itself, and delivers a well-framed image of the target, all autonomously, all on-device, without an operator involved at any step.

Building on the same foundation as the autonomous docking system, this project pushes the autonomy stack from energy management into active world perception. Robots can now not only manage themselves but also actively understand and report on the environments they operate in, a meaningful step toward truly independent autonomous agents.

Furthermore, this demo was showcased in a live exhibition environment with dynamic lighting, dense crowds, uneven flooring, and other real-world conditions that expose exactly the kinds of failure modes a lab cannot simulate.

Every design decision, including the isolated detection architecture, multi-sample position estimation, and visual alignment approach, was guided by the need for reliable performance in real deployments rather than controlled lab conditions.

Want to know more?

Watch the complete demo here

Want to Learn More About Autonomous Robot Perception?

If you would like to know more about this demo or explore how this technology could apply to your use case, please fill out the form below, and our team will be in touch.

https://www.e-consystems.com/Request-form.asp

e-con Systems: A Leader of Embedded Vision Innovation

Since 2003, e-con Systems has been designing, developing, and manufacturing custom OEM and ODM camera solutions. We have been at the forefront of innovation as far as vision solutions are concerned, having successfully catered to the emerging trends in industries like autonomous mobility.

Please use the Camera Selector to view our camera solutions based on your unique application needs.

You can also write to us at camerasolutions@e-consystems.com if you need expert insights on the type of vision solution that your application deserves.

FAQs

What is perception-driven object tracking in autonomous robots?

Perception-driven object tracking enables a robot to find a target object, navigate toward it, align itself, and capture an image using live sensor data and onboard processing.

How does e-con Systems’ ROS 2 object tracking system work?

The system uses Go To, Find, and Describe commands. The robot moves to a location, searches for the object, estimates its position, navigates toward it, aligns with it, and captures the target image.

Which technologies power the object tracking and navigation workflow?

The system uses ROS 2 Humble, Nav2 with AMCL Localization, YOLO World object detection, RPLIDAR S2E, Ambarella CV72 NPU, CV72eSOM, an AR0341 camera, and custom ROS 2 nodes.

How does the robot estimate object distance?

It maps the detected object’s camera position to a matching LiDAR angle and reads the distance from the LiDAR scan. Multiple readings help create a stable navigation goal.

What happens after the robot captures the image?

The image is delivered to a specified path, where it can be used by an LLM for scene understanding, logged for inspection, or sent to another downstream application.

Related posts

What Makes Rugged USB Cameras So Important in Delivery Robots: From Navigation to Cargo Monitoring

What Is Bus Lane Enforcement – and How Do Vision-Based Systems Work?

Why FPV Drones Are Growing in Importance for Mission-Critical Vision Applications