We address the problem of cross-modal fine-grained ac-tion retrieval between text and video. Cross-modal retrievalis commonly achieved through learning a shared embed-ding space, that can indifferently embed modalities. Inthis paper, we propose to enrich the embedding by disen-tangling parts-of-speech (PoS) in the accompanying cap-tions. We build a separate multi-modal embedding spacefor each PoS tag. The outputs of multiple PoS embed-dings are then used as input to an integrated multi-modalspace, where we perform action retrieval. All embeddingsare trained jointly through a combination of PoS-aware andPoS-agnostic losses. Our proposal enables learning spe-cialised embedding spaces that offer multiple views of thesame embedded entities.
We report the first retrieval results on fine-grained ac-tions for the large-scale EPIC dataset, in a generalisedzero-shot setting. Results show the advantage of our ap-proach for both video-to-text and text-to-video action re-trieval. We also demonstrate the benefit of disentanglingthe PoS for the generic task of cross-modal video retrievalon the MSR-VTT dataset.