How do we cope with the anticipated huge demands on bandwidth and device computational power coming from high-quality virtual reality as video resolutions grow higher and higher? Researchers from Ericsson Research, the Royal Institute of Technology KTH, Stockholm, Sweden, and Tobii have demonstrated how bandwidth and device computational requirements can be dramatically reduced by using smart eye-tracking combined with low-latency 5G networks and distributed cloud. This allows network service providers to provide excellent experiences at a fraction of the network burden.
What are the barriers to overcome if VR is to really take off and reach mass market adoption?
Here are a few things from our wish list:
1. For a truly immersive experience, much higher resolution will be needed than what is currently available in even high-end VR devices. At the same time, the cost of devices must go down significantly.
2. We need to cut the cord from the tethered computer required today for high-quality VR graphics processing and instead use energy-efficient battery-powered lightweight wearables.
3. It is absolutely necessary to find cost-efficient ways for network service providers to stream high quality VR content to massive amounts of mobile users.
But how can all this be realized? One obvious way to meet all three challenges is to utilize the fact that high-resolution video quality is only needed in the user’s field of view, known as the viewport. Why waste network bandwidth and device compute resources on the parts that the users are less likely to see? Viewport-aware streaming and rendering based on users’ head movements is a popular and promising track currently being explored.
In a research collaboration between the Royal Institute of Technology (KTH), Tobii and Ericsson, we have taken this optimization approach to the next level by demonstrating “foveated streaming”, a novel solution limiting the provision of high video quality only to the fraction of the viewport in the immediate proximity of the user’s fixation point.
This solution is enabled by the information provided by the next generation eye-trackers, which are embedded in the head mounted display (HMD), and used to identify the portion of content to be selectively streamed at high quality from a cloud server to the end device.
This allows to achieve significant savings both in network bandwidth and device computational power. A big challenge is that the loop from the eye-tracker to the content server and back to the HMD needs to be really fast. Thankfully, 5G provides this exact property, with low-latency wide area access! Add to that the foveal server hosted at a distributed cloud edge of the network service provider’s network, a fast video coding mechanism, and there you go.
Eye tracking and foveated rendering in user devices
Foveated rendering is an enabler for the PC and VR industry, as it can reduce requirements in rendering capacity or enable higher resolution displays without adding more GPU resources. By knowing where the user is focusing their gaze, only a limited area of the display is rendered at high quality, while the rest of the display is rendered in such a way that the user will not notice the reduction in resolution. All throughout the computer graphic industry, a lot of effort is put into supporting foveated rendering into the rendering technologies and game engines.
Central to the functionality is eye tracking technology.
In late 2018 and into 2019, a number of commercially available headsets will come out with Tobii Eye Tracking technology. This will enable devices with better foveated rendering and also further improve user experience through interaction enhancements and social interaction and eye contact in virtual environments.
Recent announcements by SoC provider Qualcomm about a reference implementation of Tobii Eye Tracking in the Snapdragon 845 chipset serve as a testament to the future adoption rate of eye tracking tech in the standalone VR and AR markets.
So far, the benefit of using eye tracking technology for foveal rendering lies on the VR device side only. But as eye tracking technology becomes more widely available in standalone VR devices, it will become viable to stretch the concept of foveated rendering from the device to the network to cope with the massive network bandwidth demand that will come from popular use of mobile VR streaming.
Foveated streaming over 5G
We have built a prototype with Tobii eye tracking integrated in an HTC Vive HMD, which transmits the eye-gaze signal over an Ericsson 5G end-to-end network prototype (radio + core network) to a local “foveated cloud server” where we in real-time select which parts of the video to deliver over the network to the user device in high quality.
So, in principle we have moved the role of foveated rendering from the device to this foveated cloud server by developing a novel streaming solution using HEVC video coding. In the prototype we demonstrate bandwidth reductions of up to 85 percent compared with transmitting the entire video in full resolution. Feedback from network service providers trying out our demo at Mobile World Congress in February was enormously encouraging. Most visitors could not spot any or only very little difference in perceived quality despite the huge difference in bandwidth consumption.
Figure 1. Content adapted at a distributed cloud server, over a low-latency network, and based on real-time eye-gaze measurements
Figure 2. High-quality foreground tiles selected exactly at the spot where the user is looking. Low-resolution background in the rest of the viewport. Screenshot from demo.
Figure 3. Demonstration at Mobile World Congress
Codec configuration and fast adaptation
The prototype uses the HEVC video coding format, and the content is encoded into three different streams. For the first stream, the VR content is partitioned into spatial regions by using the Tiles tool supported in HEVC. Each region, or tile, is made available in high quality on the server. The second stream is a low resolution, low bitrate representation of the video. This second stream does not use the Tiles tool. Finally, the third stream is a so-called all-Intra stream in which all tiles are encoded in high quality without prediction from previous tiles.
During media consumption, the gaze direction is continuously captured from the user and sent to the server. An area around the gaze direction is calculated and the tiles that fall within this area are forwarded from the first stream and sent to the user in high quality. The second stream is sent in parallel and the end device decodes both the second stream as well as the high-quality tiles. The decoded high-quality tiles replace the corresponding area in the decoded second stream before the video picture is rendered to the user.
When the user changes the gaze direction by looking in a new direction, the server updates the high-quality area. Transmission for new tile positions that were previously not sent in high quality is started by sending the corresponding tiles of the next picture from the all-Intra stream. This means that there is at most one picture delay at the server before the high-quality area is updated. Moreover, this novel approach can be implemented on standard streaming protocols, like DASH.
Trade-off between latency and bandwidth savings
How low latency is then required? Well, one beauty of the concept is that there is no strict limit. Higher latency can to some degree be compensated for by increasing the size of the high-quality area of the video at the expense of more consumed bandwidth. Simply speaking, the lower the transmission latency, the more bandwidth you can save for the same user perceived quality. Alternatively, you can choose to deliver even higher quality where the user is looking at a moderate bandwidth increase. Again, the lower the latency is, the higher the gain in quality and bandwidth saving becomes.
Figure 4. Radii of high-quality foreground adapted to network conditions.
Other potential applications
In our prototype we focused on streamed pre-recorded 360 VR video as a use case. But there are, of course, numerous other possible applications like live-video feeds, video conferencing, video for remotely controlled vehicles and AR-assisted operators in factories, just to mention a few.
Last, we are also looking into how aggregated eye-gaze data can be used for further gains and savings on VR / AR streaming as well as content provisioning.