Querying a database using an example image has been is a simple and intuitive interface to retrieve information in a database of images. Consequently instance-level image retrieval has been heavily studied in the computer vision community in the last decade. One problem that has been overlooked though is the retrieval of similar visual scenes, where the retrieved images do not exhibit the same object instance as the query image, but the images share the same semantic. In this paper, we define the task of semantic image retrieval, and show through a user study that, despite its subjective nature, it is consistently implemented across a pool of human annotators. Our study also shows that region-level captions constitute a good proxy to semantic similarity. Following this observation, we leverage human captions to learning a global image representation that is compact and still performs well at the semantic retrieval task. We also show that we can jointly train visual and textual embeddings that allow to query with both image and text (although the database has no textual annotation) and to perform arithmetic operations on this joint embedding. As a by-product of the learning, the network can be used to visualize which regions contributed the most to the similarity between two images, allowing to interpret our semantic retrieval results in an visual and intuitive way.