Learning visual representations with caption annotations - Naver Labs Europe