Php – How to remove Word markup crap when inserting to a form


I'm building a CMS in PHP and one dread I have is that the users will have to fill the data in from existing Word (and Excel, but nevermind that) documents. Now, I've seen what happens when they carelessly copy and paste from Word to a textarea: the database got filled with crap markup.

Now, I could certainly strip all markup myself, but I'd have to start learning about it first. So I ask you: have you tested some functionality – plugins of the usual suspects (tinyMCE, FCKeditor, etc) that helps here? Bonus for the least intrusive solution.

Best Solution

Sadly most of the HTML editor controls I've used either:

  1. Have a button to strip out various elements of mark up (word, html, script, etc)
  2. Strip out all markup on paste via JavaScript.

If you leave it to a button, then generally the non-technical users will forget to press it because they don't (some would say "shouldn't have to") care about it :(

With a bit of playing around with Regular Expressions (now you have another problem ;)) you could do something similar to 2 but just for word xml.