Javascript – how to tokenize/parse string literals from javascript source code

cjavascriptparsing

I am working on a program in C# that needs to load some JavaScript code, parse it and do some processing to the string literals found in the code (such as overwrite them with something else).

My problem is that I'm having a difficult time devising an elegant way to actually find the string literals in the JavaScript code in the first place.

For example, take a look at the sample JavaScript code below. Do you see how even Stack Overflow's code highlighter is able to pick out string literals in the code and make them red in color?

I want to basically do the same thing, except I will not be turning them into a different color, but I will do some processing on them and possibly replace it with an entirely different string literal.

var dp = {
    sh :                    // dp.sh
    {
        Utils   : {},       // dp.sh.Utils
        Brushes : {},       // dp.sh.Brushes
        Strings : {},
        Version : '1.3.0'
    }
};

dp.sh.Strings = {
    AboutDialog : '<html><head><title>About...</title></head><body class="dp-about"><table cellspacing="0"><tr><td class="copy"><p class="title">dp.SyntaxHighlighter</div><div class="para">Version: {V}</p><p><a href="http://www.dreamprojections.com/syntaxhighlighter/?ref=about" target="_blank">http://www.dreamprojections.com/SyntaxHighlighter</a></p>&copy;2004-2005 Alex Gorbatchev. All right reserved.</td></tr><tr><td class="footer"><input type="button" class="close" value="OK" onClick="window.close()"/></td></tr></table></body></html>',
    
    // tools
    ExpandCode : '+ expand code',
    ViewPlain : 'view plain',
    Print : 'print',
    CopyToClipboard : 'copy to clipboard',
    About : '?',
    
    CopiedToClipboard : 'The code is in your clipboard now.'
};

dp.test1 = 'some test blah blah blah' +  someFunction()  + 'asdfasdfsdf';
dp.test2 = 'some test blah blah blah' +  'xxxxx'  + 'asdfasdfsdf';
dp.test3 = 'some test blah blah blah' +  "XXXXsdf "" \" \' ' sdfdff "" \" \' ' asdfASDaSD FASDF SDF'  + 'asdfasdfsdf";

dp.SyntaxHighlighter = dp.sh;

I have tried parsing through looking for quotes, but it gets complicated when you have escape characters in the string literal. The other solution I was thinking is to use a RegEx, but I am not strong enough with Regular Expressions and I'm not even sure if that is the avenue I should be perusing.

I would like to see what Stack Overflow thinks. Thanks a bunch!

Best Answer

Regexs in Depth: Advanced Quoted String Matching has some good examples of how to do this with a regex.

One of the approaches is this:

(["'])(?:(?!\1)[^\\]|\\.)*\1

You could use it as follows:

string modifiedJavascriptText =
   Regex.Replace
   (
      javascriptText, 
      @"([""'])(?:(?!\1)[^\\]|\\.)*\1", // Note the escaped quote
      new MatchEvaluator
      (
         delegate(Match m) 
         { 
            return m.Value.ToUpper(); 
         }
      )
   );

in this case, all of the string literals are made upper case.