iFACE — Infant-Facial-Analysis-for-Classification-of-Emotion

Project Overview

The iFACE project is part of the broader CUBE-SD research program and aims to improve the automated recognition of infant facial expressions.

The aim of this project is to develop an open-source and freely available tool for the automatic recognition of infant facial expressions across seven categories: happiness, anger, disgust, surprise, fear, sadness, and neutral.

This project stems from the observation that, although behavioural indicators of surprise are relatively easy to identify in standard screen-based violation-of-expectation paradigms, they are much more difficult to assess in social contexts such as social learning. In such situations, infants may look in different directions, making gaze-based measures harder to interpret. See this preprint for a more detailed discussion of why better tools are needed to characterise surprise in social contexts.

A central motivation of the project is that most facial-expression recognition systems were originally developed for adults, whereas infant faces differ in morphology, proportions, and expression dynamics. iFACE therefore explores methods that are both technically effective and scientifically interpretable.

The current implementation relies primarily on a convolutional neural network for image classification. Rather than training a model from scratch, we leveraged a pre-trained ResNet18 architecture, originally trained on large-scale image recognition datasets, and fine-tuned it on infant facial expression data. In parallel, exploratory work using facial landmarks and Delaunay triangulation is conducted to provide a more interpretable representation of infant facial geometry.

The current version of the algorithm is based on the dataset developed by Webb et al. (2017).

The main libraries used in the current pipeline include PyTorch, OpenCV, and face_alignment. The results presented here remain preliminary, and the algorithm still requires further development and validation.

The source code is not yet publicly available, as it is intended to be released alongside the associated publication. However, if you are interested in contributing to the project, you are very welcome to get in touch. Depending on the nature and extent of the contribution, collaborators may be acknowledged or included as co-authors on future outputs.

7 Emotion categories in the classification workflow

ResNet18 Pre-trained CNN backbone used in the main pipeline

80% / 20% Train / validation split

GPU-ready Compatible with GPU-accelerated training and inference

Training hardware: model training was conducted on a high-performance workstation equipped with an Intel Core i9-14900HX CPU (24 cores, 32 threads) and an NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM. This configuration supports efficient deep learning training for medium-scale computer vision models.

Methodological note: the notebook describes a main CNN classification pipeline, a multi-task extension involving Action Units, and a later landmark-based triangulation extension. These components are not all at the same stage of validation, so they are presented here as different parts of an evolving research workflow.

Methods

The current workflow starts from an infant facial-expression image dataset organised into seven emotional categories: anger, disgust, fear, happy, neutral, sad, and surprise.

1. Dataset organisation

Images are stored by emotion class and loaded through a custom PyTorch dataset. This dataset class is responsible for reading the images, assigning labels from the folder structure, and preparing them for training.

The model therefore learns from labelled infant-face images rather than from manually entered features only.

2. Train / validation split

The main training procedure uses an 80% / 20% split between training and validation data. When possible, the split is performed with stratification so that class proportions are preserved across subsets.

Training set: used to update model weights
Validation set: used to evaluate generalisation

3. Image preprocessing and augmentation

Before entering the model, images are resized and normalised. During training, several augmentation steps are used to improve robustness, including horizontal flipping, small rotations, and brightness / contrast jitter.

These transformations are intended to reduce overfitting and help the classifier handle modest visual variability.

4. Main classifier

The core model is based on a pre-trained ResNet18, adapted for seven-way infant expression classification. Using a pre-trained network allows the system to start from visual features learned on large image datasets and then specialise them for the infant-expression task.

Overview of the iFACE workflow — General overview of the iFACE workflow. The project combines a main CNN-based image classifier with later extensions aimed at improving interpretability and infant-specific facial representation.

Training configuration

The notebook defines a training procedure using configurable hyperparameters such as batch size, number of epochs, and learning rate. The batch size is automatically adjusted depending on whether a GPU is available.

GPU compatibility

The pipeline is explicitly designed to run on either GPU or CPU. When CUDA is available, the model and tensors are moved to the GPU to speed up optimisation and allow larger batches.

Exploratory multi-task extension

The notebook also includes a multi-task version intended to predict both emotion categories and Action Units (AUs). In its current form, this section is best understood as exploratory development rather than as a fully validated final model.

Landmark-based extension

A separate methodological branch investigates facial landmarks and Delaunay triangulation to build a more geometric and interpretable representation of infant faces. This extension is conceptually important because it may help identify which facial regions contribute most to the classification process.

Why landmarks and triangulation were introduced

The main CNN can classify images directly from pixels, but it does not by itself provide a transparent facial geometry. The landmark-based extension was therefore introduced to better characterise facial structure, support interpretability, and explore whether infant-specific geometric information could improve performance or help explain classification errors.

Interpretation point: in the current material, the CNN classification pipeline is the most operational part of the algorithm. The landmark and triangulation steps should be presented as a complementary methodological development, not as something already fully integrated into the final classifier.

Results and Pipeline Evolution

The figures below illustrate the progressive development of the iFACE pipeline: initial image-based classification, examination of confusion patterns, checks for overfitting, and later work on landmark-guided facial representation.

Landmarking and geometric representation

Infant face with facial landmarks — Example of facial landmark localisation on an infant face. These points provide a structured representation of facial geometry and make it easier to study the spatial configuration of brows, eyes, nose, and mouth.

Delaunay triangulation over infant facial landmarks — Delaunay triangulation built from facial landmarks. This geometric representation is useful for analysing local facial deformations and for exploring more interpretable infant-specific descriptors.

Initial classification stage

Early classification examples from the iFACE pipeline — Early qualitative classification examples. These outputs suggest that the model can identify some infant expressions correctly at the image level, including surprise in at least some cases.

First confusion matrix — Early confusion matrix from the first classification stage. The results indicate that some categories are discriminable, but also reveal substantial confusion between visually close expressions.

A key challenge visible in the early matrices is that surprise is not isolated perfectly from neighbouring categories such as fear. This is scientifically important because these expressions may share features such as widened eyes or open mouth configurations.

Overfitting assessment

Confusion matrix with very strong diagonal performance — Later-stage confusion matrix with very strong diagonal values. Such results can be encouraging, but they also motivate closer examination of whether performance is stable on unseen data.

Overfitting evaluation report — Overfitting report used to assess whether strong apparent performance truly reflects generalisable learning. In this workflow, model development is not limited to maximising accuracy; it also involves checking whether results remain credible beyond the training data.

Step 1 of training: classification head training — **Step 1: Training the classification head**
In the first stage, the model starts from a pre-trained visual backbone and learns to separate infant facial expressions using a task-specific classification layer. This is a common transfer-learning strategy: the system keeps previously learned visual knowledge while adapting the final decision layer to the new emotion labels.

Step 2 of training: fine-tuning the network — **Step 2: Fine-tuning the network**
In the second stage, the optimisation can be extended to more of the network so that the visual representation itself becomes better adapted to infant faces. This stage is important because infant morphology differs from adult faces, and a more specific representation may improve classification while still requiring careful monitoring for overfitting.

Refined stage after adjustment

Refined classification examples after adjustments — Qualitative examples from a later version of the pipeline after refinement. At this stage, the focus is not only on correct predictions, but on obtaining results that are more stable and more believable across images.

Final confusion matrix after refinement — Final confusion matrix from a later development stage. The model still captures meaningful expression-related structure, but some categories remain difficult to separate fully, especially when facial configurations overlap.

Main strength

The project already demonstrates that infant facial expressions can be approached with a transfer-learning pipeline based on a modern CNN architecture, rather than relying only on adult-oriented tools.

Main challenge

The classification problem remains difficult because several infant expressions share partially overlapping facial cues. This makes confusion analysis just as important as global accuracy.

What the figures really show

The material documents an iterative research process: build a model, evaluate errors, inspect possible overfitting, test refinements, and develop more interpretable facial representations.

Current stage of the project

iFACE should be understood as a promising but still developing infant-specific emotion-recognition framework. Its value lies both in current performance and in the methodological tools it is building for future improvement.

Contribute

We welcome contributions to the iFACE project. Contributions may include improving the codebase, discussing infant-adapted facial descriptors, helping with annotation strategies, or contributing to more robust and transparent model evaluation.

The project may be of particular interest to researchers working on infant development, affective computing, computer vision, facial landmarks, reproducible research, and interpretable machine learning.

For more information, please contact: romain.di-stasi@outlook.com.

Back to CUBE-SD Website