We propose a new deep learning architecture for the tasks of semantic segmentation and depth prediction from RGB-D images. We revise the state of the art based on the RGB and depth feature fusion, where both modalities are assumed to be available at train and test time. We propose a new architecture where the feature fusion is replaced with a common deep representation. Combined with an encoder decoder network for feature map extraction, the architecture can jointly learn models for semantic segmentation and depth estimation based on their common representation. This representation, inspired by multi-view learning, offers several important advantages, such as using one modality available at test time to reconstruct the missing modality. In the RGB-D case, this enables cross-modality scenarios, such as using depth data for semantic segmentation and the RGB images for depth estimation. We demonstrate the effectiveness of the proposed network on two publicly available RGB-D datasets. The experimental results show that the proposed method works well in both semantic segmentation and depth estimation tasks.