C# – Finding methods in source code using regular expressions

c++regex

I have a program which looks in source code, locates methods, and performs some calculations on the code inside of each method. I am trying to use regular expressions to do this, but this is my first time using them in C# and I am having difficulty testing the results.

If I use this regular expression to find the method signature:

((private)|(public)|(sealed)|(protected)|(virtual)|(internal))+([a-z]|[A-Z]|[0-9]|[\s])*([\()([a-z]|[A-Z]|[0-9]|[\s])*([\)|\{]+)

and then split the source code by this method, storing the results in an array of strings:

string[] MethodSignatureCollection = regularExpression.Split(SourceAsString);

would this get me what I want, ie a list of methods including the code inside of them?

Best Solution

I would strongly suggest using Reflection (if it is appropriate) or CSharpCodeProvider.Parse(...) (as recommended by rstevens)

It can be very difficult to write a regular expression that works in all cases.

Here are some cases you'd have to handle:

public /* comment */ void Foo(...)      // Comments can be everywhere
string foo = "public void Foo(...){}";  // Don't match signatures in strings 
private __fooClass _Foo()               // Underscores are ugly, but legal
private void @while()                   // Identifier escaping
public override void Foo(...)           // Have to recognize overrides
void Foo();                             // Defaults to private
void IDisposable.Dispose()              // Explicit implementation

public // More comments                 // Signatures can span lines
    void Foo(...)

private void                            // Attributes
   Foo([Description("Foo")] string foo) 

#if(DEBUG)                              // Don't forget the pre-processor
    private
#else
    public
#endif
    int Foo() { }

Notes:

  • The Split approach will throw away everything that it matches, so you will in fact lose all the "signatures" that you are splitting on.
  • Don't forget that signatures can have commas in them
  • {...} can be nested, your current regexp could consume more { than it should
  • There is a lot of other stuff (preprocessor commands, using statements, properties, comments, enum definitions, attributes) that can show up in code, so just because something is between two method signatures does not make it part of a method body.