How I removed 50% of the code


Title: Azuleyo tiles somewhere in Portugal, Source: own resources, Authors: Agnieszka and Michał Komorowscy

My last 2 posts were about problems with using Roslyn. Nonetheless, even if I sometime hate it, I'm still using it so the time has come to show some practical example of using Roslyn. Recently, I've been working on the task that can be summed up as: Take this ugly code and do something with it. i.e. more or less the refactoring task.

Now I'll give you some intuition of what I have to deal with. The code that I have to refactor was generated automatically based on XML schema. These were actually DTO classes used to communicate with the external service. Here are some statistics:
  • 28.7 thousands lines of code in 23 files.
  • 2200 classes and 920 enums.
  • Many classes and enums seems to me identical or very similar.
Pretty nice, isn't it? I started with reading this code and calculating some statistics. I did it by using CSharpSyntaxWalker to visit each class and each enum. Classes/enums declarations are represented by ClassDeclarationSyntax and EnumDeclarationSyntaxt nodes in the syntax tree so it was enough to override the following 2 methods:

public class MyWalker: CSharpSyntaxWalker
   public override void VisitClassDeclaration(ClassDeclarationSyntax node) { /* ... */}
   public override void VisitEnumDeclaration(EnumDeclarationSyntax node) { /* ... */}

My walker simply visited all classes and enums and saved them for future processing. Additionally classes and enums were grouped by their names. As this stage I detected that they are actually only 620 unique classes and 340 unique enums. The next step was to verify if grouped classes/enums are the same. Two classes are the same if they:
  • Have the same name.
  • Have the same attributes and these attributes have the same arguments. Here, I decided to ignore some meaningless attributes like GeneratedCodeAttribute.
  • Have the same number of properties.
  • And these properties have the same names and return types.
  • And these properties have the same attributes and these attributes have the same arguments. Here again I ignored some attributes.
All these information can be retrieved from Roslyn syntax tree or from the semantic model. For enums situation is similar but we have to check names and values of enum members instead of properties.

If all classes (enums) within a group are the same they can be replaced by only 1 class. However, there is one tricky thing as to the return type. Let's assume that we have 2 properties i.e. Namespace1.Class.Prop and Namsepace2.Class.Prop. Both these properties return the type with the same name i.e. SomeClass. Now, it's important to check if all classes with the name SomeClass are the same according to the above definition. Only then we can merge Namespace1.Class.Prop and Namespace2.Class.Prop. In other words I had a graph, potenially with cycles, of dependencies to analyse.

I modelled this graph in the easy way. I used a dictionary to store groups of classes (enums) where the key was a name of a class or an enum. Additionally, each class/enum has list of related classes/enum i.e. a base class, return types of properties... After these analysis I was able to merge 950 classes and 490 enums.

However, it is not everything. In the next step I used CSharpSyntaxRewriter class to perform the following additional refactorings:
  • Remove meaningless attributes.
  • Add Enum suffix to enums.
  • Remove the backing fields for properties and introduce auto-properties.
  • Add using directives and remove fully qualified names of types.
  • ...
At the end I write merged classes and enums to individual files. The final effect is really good i.e.:
  • 13.5 thousands lines (-53%) of code in 1670 files i.e. 1 class/enum per file.
  • 1250 classes (-43%) and 430 enums (-53%).
What is the best it's not everything. I'm thinking about performing more sophisticated refactorings. For example even if two or more classes have different properties they can me merged into one class with all these properties providing that there is no conflict in names.

If you have any question about details of my approach just let me know.

*The picture at the beginning of the post comes from own resources and shows Azuleyo tiles somewhere in Portugal.


Andrei Ignat said...

Could you show code for MyWalker ?

Michał Komorowski said...

@Andrei Ignat - The full code of MyWalker is not very short, besides it used some other classes etc. In other words, I think that it's too long to include it here and unfortunately I cannot publish it on GitHub (or somewhere else). However, if you have any specific questions just let me know and I'll try to help.

Post a Comment