Reconstructing Textual Documents from n-grams

Published by NAVER LABS Europe at 3 August 2015

Matthias Gallé, Matias Tealdi

21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10-13 August, 2015.

2015-006.pdf

We analyze the problem of reconstructing documents when we only have access to the n-grams and their counts from the original documents and a xed n. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain.
We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction.
We de ne two rules that reduce this graph in such a way of preserving all possible reconstruction, while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically ctitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Reconstructing Textual Documents from n-grams

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings