An extension to my previous question:
Text cleaning and replacement: delete \n from a text in Java
I am cleaning this incoming text, which comes from a database with irregular text. That means, there' s no standard or rules. Some contain HTML characters like ®, &trade, <, and others come in this form: ”, –, etc. Other times I just get the HTML tags with < and >.
I am using String.replace to replace the characters by their meaning (this should be fine since I'm using UTF-8 right?), and replaceAll() to remove the HTML tags with a regular expression.
Other than one call to the replace() function for each replacement, and compiling the HTML tags regular expression, is there any recommendation to make this replacement efficient?
My first suggestion is to measure the performance of the simplest way of doing it (which is probably multiple replace/replaceAll calls). Yes, it's potentially inefficient. Quite often the simplest way of doing this is inefficient. You need to ask yourself: how much do you care?
Do you have sample data and a threshold at which point the performance is acceptable? If you don't, that's the first port of call. Then test the naive implementation, and see whether it really is a problem. (Bear in mind that string replacement is almost certainly only part of what you're doing. As you're fetching the text from a database to start with, that may well end up being the bottleneck.)
Once you've determined that the replacement really is the bottleneck, it's worth performing some tests to see which bits of the replacement are causing the biggest problem - it sounds like you're doing several different kinds of replacement. The more you can narrow it down, the better: you may find that the real bottleneck in the simplest code is caused by something which is easy to make efficient in a reasonably simple way, whereas trying to optimise everything would be a lot harder.