Learning self-supervised representations of audiovisual human-centric data