Word Ninja: A Probabilistic Approach to Word Boundary Detection

Handling text without proper word boundaries can be a significant challenge in data processing. Manual separation and pattern-based methods can be slow and complex.

Word Ninja uses probabilistic language models to quickly and accurately separate concatenated words.

import wordninja 

wordninja.split("honeyinthejar")
#> ['honey', 'in', 'the', 'jar']

wordninja.split("ihavetwoapples")
#> ['i', 'have', 'two', 'apples']

wordninja.split("aratherblusterday")
#> ['a', 'rather', 'bluster', 'day']

Link to Word Ninja.

Leave a Comment

Your email address will not be published. Required fields are marked *

Related Posts

Scroll to Top