Video streaming quality measurements
When end users watch a streaming video, how do you know what quality they get? Or whether they are pleased or annoyed with the service? The new ITU-T P.1203 standard uses machine learning to help measure the video quality, and Ericsson is one of its main contributors.
Video streaming quality and end-user perception
Most of today's internet video is consumed over non-guaranteed connections, both wired and wireless, and the dominant delivery methods are adaptive HTTP-based streaming, such as DASH and HLS. The streaming server offers several versions of the same video, each of them encoded with a different resolution and bitrate. The streaming client within the phone, tablet or PC dynamically selects between them during playout.
If the available network bitrate is high, the streaming client will select to download the best-available quality, and if the network bitrate suddenly drops, the client will move to a lower-quality version until conditions improve and a higher quality can be used again. The main purpose is to avoid stalling, i.e. that the client needs to stop the playout and fill its video buffer, as such stalling and re-buffering is highly annoying (more on that can be found in this Ericsson Mobilty Report).
It also makes it more difficult to understand the final quality for the end users. While intermittently changing to lower-quality versions of the video does decrease the risk for re-buffering, it also introduces a quality variation over time which itself can be irritating. And, in the worst-case scenario, there might still be one or more examples of re-buffering, contributing to the total quality impression for the end user.
Below, you will see an example session for a 60-second video, where the quality varies over time, and it also includes a three-second re-buffering in the middle. While there are parts with low quality, the last third of the video is delivered with high quality. So, if end users rate this video at the end, how much would they remember of earlier problems? And how does quality variation in the first half influence the final score?
During the development of the P.1203 standard, a large number of similar video sessions were shown to test panels in different parts of the world. Each session was different in length, content, adaptation and re-buffering behavior. Each video was watched and scored by at least 24 test participants and the average score defined the true video quality. This true quality is often called the mean opinion score (MOS), and this is what the P.1203 standard estimates.
You can actually make your own judgement and compare it to the P.1203 standard’s score (which will be revealed at the end of the article). Watch the video below – preferably in full-screen format – and use the following scale when deciding your final score:
5 = Excellent
4 = Very good
3 = Good
2 = Fair
1 = Poor
The P.1203 standard
The P.1203 standard is built on a modular concept, with separate modules for estimating short-term audio and video quality, and an integration module estimating the final session quality due to adaptation and re-buffering, as shown below.
The short-term Pv and Pa modules continuously estimate short-term audio and video quality scores for one-second pieces of content, so for a 60-second video there will be 60 audio scores, and 60 video scores. These Pv and Pa modules are specific for each type of codec. The video module currently supports H.264/AVC up to HD (1920x1080) and there is ongoing work in ITU-T to also support H.265/HEVC and VP9, up to UHD resolution (3840x2160).
The Pv and Pa modules operate in up to four different modes, depending on how detailed the available input is from the parameter extraction. For the least-complex mode the main inputs are related to resolution, bitrate and framerate, while the most-complex mode performs advanced analysis of the video payload.
The short-term scores from those modules are fed into the Pq long-term integration module, together with information about any re-buffering, and the final session quality score for the total video session is then estimated. The Pq module also produces diagnostic outputs so that underlying causes for the score can be analyzed. The module is not mode- or codec-dependent and thus common for all cases.
While the Pv and Pa modules are developed by traditional analytical methods, and implemented as a series of mathematical functions, the Pq module is more advanced. It is divided into two separate estimation algorithms; one using a traditional functional approach and one based on machine-learning concepts.
The functional variant models human perception, including effects from quality oscillations, deep quality dips, repeated quality or buffering artifacts, as well as human memory effects. As for the Pv and Pa modules, these effects are described by mathematical functions, which are combined so that the total perceived end user quality is estimated.
Machine learning and algorithms
Machine learning is a method where a solution to a problem is found not by manual human-controlled analysis but instead by partly self-learning computer algorithms. It is well suited for problems where the relationship between the input and the output is complex, such as the estimation of the Pq score.
During the design and training phase these algorithms automatically identify how characteristics of the input data (Pv/Pa scores and buffering parameters) are reflected in changes to the output data (the test panel MOS scores). The algorithm then builds a black-box "machine" which implements the final machine-trained solution, and estimates the end-user score.
The final Pq MOS score is a weighted average of the output from these two algorithms – the traditional functional one and the one based on machine learning. A further advantage of using two totally different Pq algorithms is that these have statistically independent estimation errors. When averaging such independent estimations, the resulting estimation error becomes smaller.
How did you do against the P.1203?
Going back to the example video, what did you score? The P.1203 standard scores it as 2.66, which is the result you would get if one third of the test panel scored a 2, and two-thirds scored a 3. If you scored lower or higher, then you might be a very critical or a very relaxed viewer. In both cases, don't worry, it is natural that different persons have different opinions. However, the P.1203 standard always estimates the quality as perceived by the average viewer.