Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.

Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.

We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?

The development and implementation of elephant is described in the paper Kilian Evang, Valerio Basile, Grzegorz ChrupaƂa and Johan Bos (2013): Elephant: Sequence Labeling for Word and Sentence Segmentation. Proceedings of the EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, United States [PDF] [BibTeX] [Poster]