Building a social robot that is able to interact naturally with people is a challenging task that becomes even more ambitious if the robots’ interlocutors are children involved in crowded scenarios like a classroom or a museum. In such scenarios, the main concern is enabling the robot to track the subjects’ social and affective state modulating its behaviour on the basis of the engagement and the emotional state of its interlocutors. To reach this goal, the robot needs to gather visual and auditory data, but also to acquire physiological signals, which are fundamental for understating the interlocutors’ psycho-physiological state. Following this purpose, several Human-Robot Interaction (HRI) frameworks have been proposed in the last years, although most of them have been based on the use of wearable sensors. However, wearable equipments are not the best technology for acquisition in crowded multi-party environments for obvious reasons (e.g., all the subjects should be prepared before the experiment by wearing the acquisition devices). Furthermore, wearable sensors, also if designed to be minimally intrusive, add an extra factor to the HRI scenarios, introducing a bias in the measurements due to psychological stress. In order to overcome this limitations, in this work, we present an unobtrusive method to acquire both visual and physiological signals from multiple subjects involved in HRI. The system is able to integrate acquired data and associate them with unique subjects’ IDs. The implemented system has been tested with the FACE humanoid in order to assess integrated devices and algorithms technical features. Preliminary tests demonstrated that the developed system can be used for extending the FACE perception capabilities giving it a sort of sixth sense that will improve the robot empathic and behavioural capabilities.