A Bag-of-Pages Approach to Unordered Multi-Page Document Classification - Naver Labs Europe
preloder

We are interested in the problem of classifying documents containing multiple unordered pages. For this propose, we propose a novel bag-of-pages document representation. Offline, we learn a page clusters on a training set. To represent a new document, one assigns every page to a cluster and counts the proportion of pages assigned to each cluster. This leads to a histogram representation which can then be fed to any discriminative classifier. We consider several refinements of this initial approach: 1/ clusters learned in a supervised manner; 2/ soft-assignment of pages to clusters and 3/going beyond simple counting. We show on two challenging datasets that the proposed approach outperforms a simple baseline system.

NAVER LABS Europe
NAVER LABS Europe
Ceci correspond à une petite biographie d'environ 200 caractéres