tripskillo.blogg.se - Facerig avatar

To improve the performance of the encoder without reducing accuracy or increasing jitter, we selectively used unpadded convolutions to decrease the feature map size. This encourages landmark predictions to be equivariant under different image transformations, improving landmark location consistency between frames without requiring landmark labels for a subset of the training images. We utilize real images without annotations in an unsupervised consistency loss (L c ), similar to. In addition, a regularization term on the acceleration (L acc ) is added to reduce FACS weights jitter (its weight kept low to preserve responsiveness). It encourages overall smoothness of dynamic expressions.

A velocity loss (L v ) inspired by is the MSE between the target and predicted velocities. For FACS weights, we reduce jitter using temporal losses over synthetic animation sequences. For landmarks, the RMSE of the regressed positions (L lmks ), and for FACS weights, the MSE (L facs ). To train our deep learning network, we linearly combine several different loss terms to regress landmarks and FACS weights: These animation files were generated using classic computer vision algorithms running on face-calisthenics video sequences and supplemented with hand-animated sequences for extreme facial expressions that were missing from the calisthenic videos. A normalized rig used for all the different identities (face meshes) was set up by our artist which was exercised and rendered automatically using animation files containing FACS weights. The synthetic animation sequences were created by our interdisciplinary team of artists and engineers.

After a certain number of steps we start adding synthetic sequences to learn the weights for the temporal FACS regression subnetwork. We initially train the model for only landmark regression using both real and synthetic images. This allows the model to learn temporal aspects of facial animations and makes it less sensitive to inconsistencies such as jitter. The FACS regression sub-network that is trained alongside the landmarks regressor uses causal convolutions these convolutions operate on features over time as opposed to convolutions that only operate on spatial features as can be found in the encoder. This setup allows us to augment the FACS weights learned from synthetic animation sequences with real images that capture the subtleties of facial expression.