Towards faster, smoother mobile video conferencing | Swinburne University, Sarawak, Malaysia

By Phang Chui Lian

A picture contains a fixed number of rows and columns of pixels. Pixels are the smallest units in a picture. A picture of a given size that consists of more pixels is a better picture. For instance, a Sony Cyber-shot DSC-TX100V Digital Camera produces 16.2 megapixels pictures while the camera in Samsung Galaxy S4 produces 13 megapixel pictures. This means that a picture from the Sony camera is made up of 16.2 million pixels while a picture from the Samsung smartphone is made up of 13 million pixels.

A picture that is not moving is called a still picture while a picture that is moving is called a video. A meeting is more interactive when both the speech and the moving picture (video) are sent. This type of meeting is called video conferencing. In this era of technology advancement, video conferencing is widely used in many fields, e.g. education, business, etc. It is very useful because it enables users in two or more locations to communicate through meeting each other face-to-face in a virtual world (online). Not only could the users see each other “live” but they could also see each other at any time, without the hassle of making appointments and travelling to the same place. With the recent advancement and popularity of smartphones, people are starting to have video conferencing sessions using mobile phones. Since mobile phones can be easily moved from one place to another, mobile video conferencing is very convenient, practical and highly sought after.

However, we know that a video consists of a huge number of pixels. Nevertheless, the current mobile communication networks could only be used to transfer a limited amount of pixel every second. In other words, the bandwidth of the networks is limited. Hence, if every pixel is to be sent, it will take a lot of time. This is definitely not suitable for video conferencing applications because it involves two-way interactive communication whereby the video and the speech have to be delivered instantly.

Therefore, it is necessary to reduce the number of pixels without significantly affecting the quality of the video. The technique for doing so is called compression. There are many types of compression techniques, some of which are commonly known, for example JPEG (compression of still picture), MPEG (compression of video), and MP3 (compression of music).

Currently, a picture is divided into many parts (blocks) and compression is done block by block. Although this compression technique is good, it is not good enough for video conferencing. In fact, current mobile video conferencing sessions have problems such as discontinuity or lagging and lack of clarity. Moreover, the data to be sent is still quite large for a mobile phone. Therefore, a better technique is needed for smoother and faster video transmission over a mobile network. For this purpose, 3D model-based compression technique could be considered.

In model-based compression, a 3D human face model that uses facial features rather than the whole face is used. The model is modified to suit a particular user’s face. For example, if the user’s eyebrows are at higher points compared to the generic model, the points that describe the eyebrows are changed. Since most of the time only some features will move, we only need to compress and send the information about the movement of those moving features. For instance, if the user closes his lips but opens it by 2cm afterwards, the information about the movement of the lips would be compressed and sent. As a result, much less information is involved. When the amount of information is reduced, the transmission of information would be faster and the video would be smoother and clearer.

A typical mobile video conferencing system using 3D model-based video codec (compressor – decompressor) works as follows. On every mobile phone, a 3D human face model is available. The information of an user’s facial features, including eyebrows, eyes, nose, mouth, face contour, and so on is used to modify the 3D human face model so that it becomes the user’s own face model. The model of each participant is then sent to all other participants. Before the video from a participant is sent, the video from the participant’s camera is first analysed to estimate the changes in the features of his/her face.

Then, information about the movement of facial features is obtained. This information is then compressed and sent through mobile network to other participants. At the other participants’ mobiles, the reverse process of compression called decompression is performed on the information received. Then, this information is applied on the sender’s face model. Finally, a video that is similar to the original video is produced. This process repeats every time a video is sent from one mobile phone to another.

Besides being faster, mobile video conferencing using model-based video compression is safer. This is because the user’s face model is only sent in the beginning. After that, only the compression information about the movements of the facial features is sent through mobile telecommunication network. Nevertheless, this technique could only be implemented on more technologically advanced mobile phones.

Phang Chui Lian is a lecturer with the Faculty of Engineering, Computing and Science at Swinburne University of Technology Sarawak Campus. She can be contacted at [email protected]