A Bag-of-Pages Approach to Unordered Multi-Page Document Classification

Published by NAVER LABS Europe at 23 August 2010

International Conference for Pattern Recognition (ICPR), Istanbul, Turkey, 23-26 August, 2010

We are interested in the problem of classifying documents containing multiple unordered pages. For this propose, we propose a novel bag-of-pages document representation. Offline, we learn a page clusters on a training set. To represent a new document, one assigns every page to a cluster and counts the proportion of pages assigned to each cluster. This leads to a histogram representation which can then be fed to any discriminative classifier. We consider several refinements of this initial approach: 1/ clusters learned in a supervised manner; 2/ soft-assignment of pages to clusters and 3/going beyond simple counting. We show on two challenging datasets that the proposed approach outperforms a simple baseline system.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2026

All

Publications

Blog

News

Code & Data

Careers

People

A Bag-of-Pages Approach to Unordered Multi-Page Document Classification

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings