Labeling Toy Aircraft in 3D space using an ONNX model and Windows ML on a HoloLens

6 minute read


Back in November I wrote about a POC I wrote to recognize and label objects in 3D space, and used a Custom Vision Object Recognition project for that. Back then, as I wrote in my previous post, you could only use this kind of projects by uploading the images you needed to the model in the cloud. In the mean time, Custom Vision Object Recognition models can be downloaded in various formats - and one of them in ONNX, which can be used in Windows ML. And thus, it can be used to run on a HoloLens to do AI-powered object recognition.

Which is exactly what I am going to show you. In essence, the app still does the same as in November, but now it does not use the cloud anymore - the model is trained and created in the cloud, but can be executed on an edge device (in this case a HoloLens).

The main actors

These are basically still the same:

  • CameraCapture watches for an air tap, and takes a picture of where you look
  • ObjectRecognizer receives the picture and feeds it to the 'AI', which is now a local process
  • ObjectLabeler shoots for the spatial map and places labels.

As I said - the app is basically still the same as the previous version, only now it uses a local ONNX file.

Setting up the project

Basically you create a standard empty HoloLens project with the MRTK and configure it as you always do. Be sure to enable Camera capabilities, of course.

Then you simply download the ONNX file from you model. The procedure is described in my previous post. Then you need to place the model file (model.onnx) into a folder "StreamingResources" in the Unity project. This procedure is described in more detail in this post by Sebastian Bovo of the AppConsult team. He uses a different kind of model, but the workflow is exactly the same.

Be sure to adapt the ObjectDetection.cs file as I described in my in my previous post.

Functional changes to the original project

Like I said, the difference between this project and the online version are for the most part inconsequential. Functionally only one thing changed: in stead the app showing the picture that it took prior to starting the (online) model, it now sounds a click sound when you air tap to start the recognition process, and sounds either a pringg sound or a buzz sound, indicating the recognition process respectively succeeded (i.e. found at least toy aircraft) or failed (i.e. did not find an toy aircraft).

Technical changes to the original project

  • The ObjectDetection file, downloaded from and adapted for use in Unity, has been added to the project
  • CustomVisonResult, containing all the JSON serialization code to deal with the online model, is deleted. The ObjectDetection file contains all classes we need
  • In all classes I have adapted the namespace from "CustomVison" *cough* to "CustomVision" (sorry, typo ;) ).
  • The ObjectDetection uses root class PredictionModel in stead of Predition, so that has been adapted in all files that use it. The affected classes are:
    • ObjectRecognitionResultMessage
    • ObjectLabeler
    • ObjectRecognizer
    • PredictionExtensions
  • Both CameraCapture and ObjectLabeler have sound properties and play sound on appropriate events
  • ObjectRecognizer has been extensively changed to use the local model. This I will describe in detail

Object recognition - the Windows ML way

The first part of the ObjectRecognizer initializes the model

using UnityEngine;
using System.Threading.Tasks;
using Windows.Graphics.Imaging;
using Windows.Media;

public class ObjectRecognizer : MonoBehaviour
    private ObjectDetection _objectDetection;

    private bool _isInitialized;

    private void Start()
          p=> RecognizeObjects(p.Image, p.CameraResolution, p.CameraTransform));
#if UNITY_WSA && !UNITY_EDITOR _objectDetection = new ObjectDetection(new[]{"aircraft"}, 20, 0.5f,0.3f ); Debug.Log("Initializing..."); _objectDetection.Init("ms-appx:///Data/StreamingAssets/model.onnx").ContinueWith
(p => { Debug.Log("Intializing ready"); _isInitialized = true; }); #endif }

Notice, here, too the liberal use of preprocessor directives, just like in my previous post. In the start of it's method we create a model from the ONNX file that's in StreamingAssets, using the method I added to ObjectDetection. Since we can't make the start method awaitable, the ContinueWith needs to finish the initalization.

As you can see, the arrival of a PhotoCapture message from the CameraCapture behavior fires off RecognizeObjects, just like in the previous app.

public virtual void RecognizeObjects(IList<byte> image, 
                                     Resolution cameraResolution, 
                                     Transform cameraTransform)
    if (_isInitialized)
        RecognizeObjectsAsync(image, cameraResolution, cameraTransform);


But unlike the previous app, it does not fire off a Unity coroutine, but a private async method

private async Task RecognizeObjectsAsync(IList<byte> image, Resolution cameraResolution, Transform cameraTransform)
    using (var stream = new MemoryStream(image.ToArray()))
        var decoder = await BitmapDecoder.CreateAsync(stream.AsRandomAccessStream());
        var sfbmp = await decoder.GetSoftwareBitmapAsync();
        sfbmp = SoftwareBitmap.Convert(sfbmp, BitmapPixelFormat.Bgra8, 
BitmapAlphaMode.Premultiplied); var picture = VideoFrame.CreateWithSoftwareBitmap(sfbmp);
var prediction = await _objectDetection.PredictImageAsync(picture); ProcessPredictions(prediction, cameraResolution, cameraTransform); } } #endif

This method basically is 70% converting the raw bits of the image to something the ObjectDetection class's PredictImageAsync can handle. I have very much to thank this post in the Unity forums and this post on the MSDN blog site by my friend Matteo Pagani to piece this together. This is because I am a stubborn idiot - I want to take a picture in stead of using a frame of the video recorder, but then you have to convert the photo to a video frame.

The 2nd to last code actually calls the PredictImageAsync - essentially a black box for the app, and then the predictions are processed more or less like before:

private void ProcessPredictions(IList<PredictionModel>predictions, 
                                Resolution cameraResolution, Transform cameraTransform)
    var acceptablePredications = predictions.Where(p => p.Probability >= 0.7).ToList();
       new ObjectRecognitionResultMessage(acceptablePredications, cameraResolution, 

Everything with a probability lower than 70% is culled, and the rest is being send along to the messenger, where the ObjectLabeler picks it up again and starts shooting for the Spatial Map in the center of all rectangles in the predications to find out where the actual object may be in space.


I have had some fun experimenting with this, and the conclusions are clear:

  • For a simple model as this, even with a fast internet connection, using a local model in stead of a cloud based model is way faster
  • Yet - the hit rate is notably lower - the cloud model is definitely more 'intelligent'. I suppose improvements to Windows ML will fix that in the near future. Also, the AI coprocessor the next release of HoloLens will undoubtedly contribute to both speed and accuracy.
  • With 74 pictures of a few model airplanes, almost all on the same background, my model is not nearly enough equipped to recognize random planes in random environments. This highlights a bit the crux of machine learning - you will need data, data more data and even more than that.
  • This method of training models in the cloud and executing them locally provides exiting new - an very usable - features for Mixed Reality devices.

Using Windows ML in edge devices is not hard, and on a HoloLens is only marginally harder because you have to circumvent an few differences between full UWP and Unity, and be aware of differences between C# 4.0 and C# 7.0. This can easily be addressed, as I showed before.

The complete project can be found here (branch WinML) - since in now operates without a cloud model it is actually runnable by everyone. I wonder if you can actually get it to recognize model planes you may have around. I've got it to recognize model planes up to about 1.5 meters.