Text cleaning is a critical preprocessing step in natural language processing (NLP), data normalisation, SEO slug generation, search indexing, and database entry sanitisation. This tool provides a configurable set of cleaning operations including accent removal, punctuation stripping, non-ASCII filtering, and HTML tag removal.
The tool uses Unicode Normalisation Form D (NFD) to decompose characters like é into e + ́ (the base letter plus a combining accent mark). It then removes all combining characters (Unicode category Mn), leaving only the base letters.
The HTML stripper removes tags (anything between < and >) but keeps the text content between tags. It also decodes common HTML entities: & → &, < → <, → space.