Microsoft Open XML SDK Code Quality Study

image1.png

My acquaintance with the Open XML SDK began when I needed a library for creating Word documents with some reporting. After working with the Word API for over 7 years, I wanted to try something new and more convenient. This is how I found out that Microsoft has an alternative solution. By tradition, we pre-check programs and libraries used in the company using the PVS-Studio analyzer.

Introduction

Office Open XML, also known as OpenXML or OOXML, is an XML-based format for office documents, including word processing documents, spreadsheets, presentations, and diagrams, shapes, and other graphics. The specification was developed by Microsoft and adopted by ECMA International in 2006. In June 2014, Microsoft released the Open XML SDK in open source. The sources are now available at Github under the MIT licenses.

To find errors in the source code of the library, we used PVS-Studio… It is a tool for detecting bugs and potential vulnerabilities in the source code of programs written in C, C ++, C # and Java. Works in 64-bit systems on Windows, Linux and macOS.

The project is small enough and there were few warnings. But the choice of the title picture was based precisely on the results. There are a lot of useless conditional operators in the code. It seems to me that if you refactor all such places in the code, then the volume will be noticeably reduced. As a result, the readability of the code will also increase.

Why Word API and not Open XML SDK

As you might have guessed from the title, I continued to use the Word API. This method has a lot of disadvantages:

  • Old awkward API;
  • Microsoft Office must be installed;
  • The need to distribute a distribution kit with Office libraries;
  • Dependence of the Word API on the system locale settings;
  • Low speed of work.

With the locale in general, an amusing incident happened. Windows has a dozen regional settings. On one of the server computers there was a mess of the USA and UK locales. Because of this, Word documents were created, where instead of the dollar symbol there was a ruble, and pounds were not displayed at all. Improving the operating system settings fixed the problem.

Enumerating all this, I wondered again why I still use it …

But no, I like the Word API better so far, and here’s why.

OOXML looks like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document ....>
  <w:body>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is a paragraph.</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is another paragraph.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

Where (Word Run) – not a sentence, and not even a word, but any piece of text that has attributes that are different from the adjacent text fragments.

It is programmed with something like this:

Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));

The document has a specific internal structure, and in the code you need to create the same elements. The Open XML SDK, in my opinion, does not have enough abstract data access layer. Creating a document using the Word API will be clearer and shorter. Especially when it comes to tables and other complex data structures.

In turn, the Open XML SDK solves a wide range of tasks. With it, you can create documents not only for Word, but also for Excel and PowerPoint. This library is probably more suitable for some tasks, but I decided to stay on the Word API for now. In any case, it will not be possible to completely abandon it, because for internal needs, we are developing a plugin for Word, and there it is possible to use only the Word API.

Two values ​​for string

V3008 The ‘_rawOuterXml’ variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 164, 161. OpenXmlElement.cs 164

internal string RawOuterXml
{
    get => _rawOuterXml;

    set
    {
        if (string.IsNullOrEmpty(value))
        {
            _rawOuterXml = string.Empty;
        }

        _rawOuterXml = value;
    }
}

A type string can have 2 types of values: null and a text value. It is definitely safer to use textual meaning, but both approaches are valid. Here in this project the meaning null use is unacceptable and is overwritten to string.Empty… at least that’s how it was intended. But due to an error in RawOuterXml can still be written null, and then refer to this field, getting NullReferenceException

V3022 Expression ‘namespaceUri! = Null’ is always true. OpenXmlElement.cs 497

public OpenXmlAttribute GetAttribute(string localName, string namespaceUri)
{
    ....
    if (namespaceUri == null)
    {
        // treat null string as empty.
        namespaceUri = string.Empty;
    }
    ....
    if (HasAttributes)
    {
        if (namespaceUri != null)  // <=
        {
            ....
        }
        ....
    }
    ....
}

This snippet uses the same approach, the author of the code did not make any major mistake, but still smells of bad refactoring. Most likely, one check can be removed here, which will reduce the width of the code, and, therefore, increase its readability.

About the compactness of the code

image2.png

V3009 It’s odd that this method always returns one and the same value of ‘”.xml”‘. CustomXmlPartTypeInfo.cs 31

internal static string GetTargetExtension(CustomXmlPartType partType)
{
    switch (partType)
    {
        case CustomXmlPartType.AdditionalCharacteristics:
            return ".xml";

        case CustomXmlPartType.Bibliography:
            return ".xml";

        case CustomXmlPartType.CustomXml:
            return ".xml";

        case CustomXmlPartType.InkContent:
            return ".xml";

        default:
            return ".xml";
    }
}

I don’t know if there is any typo here, or if the author of the code wrote “nice” code in his opinion. I am sure that there is no point in returning so many values ​​of the same type from a function, and the code can be greatly reduced.

This is not the only such place. Here are a couple more of these warnings:

  • V3009 It’s odd that this method always returns one and the same value of ‘”.xml”‘. CustomPropertyPartTypeInfo.cs 25
  • V3009 It’s odd that this method always returns one and the same value of ‘”.bin”‘. EmbeddedControlPersistenceBinaryDataPartTypeInfo.cs 22

It would be interesting to hear why write like that.

V3139 Two or more case-branches perform the same actions. OpenXmlPartReader.cs 560

private void InnerSkip()
{
    Debug.Assert(_xmlReader != null);

    switch (_elementState)
    {
        case ElementState.Null:
            ThrowIfNull();
            break;

        case ElementState.EOF:
            return;

        case ElementState.Start:
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;

        case ElementState.End:
        case ElementState.MiscNode:
            // cursor is end element, pop stack
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;
        ....
    }
    ....
}

There are fewer questions about this code. Most likely, identical cases can be combined and the code will become shorter and more obvious.

A few more such places:

  • V3139 Two or more case-branches perform the same actions. OpenXmlMiscNode.cs 312
  • V3139 Two or more case-branches perform the same actions. CustomPropertyPartTypeInfo.cs 30
  • V3139 Two or more case-branches perform the same actions. CustomXmlPartTypeInfo.cs 15
  • V3139 Two or more case-branches perform the same actions. OpenXmlElement.cs 1803

Those Always true / false

Now is the time to provide some code examples that determined my choice of title image.

Warning 1

V3022 Expression ‘Complete ()’ is always false. ParticleCollection.cs 243

private bool IsComplete => Current is null ||
                           Current == _collection._element.FirstChild;

public bool MoveNext()
{
    ....
    if (IsComplete)
    {
        return Complete();
    }

    if (....)
    {
        return Complete();
    }

    return IsComplete ? Complete() : true;
}

Property IsComplete is used 2 times, and it is easy to understand from the code that its value will not change. Thus, at the end of the function, one can simply return the second value of the ternary operator – true

Warning 2

V3022 Expression ‘_elementStack.Count> 0’ is always true. OpenXmlDomReader.cs 501

private readonly Stack<OpenXmlElement> _elementStack;

private bool MoveToNextSibling()
{
    ....
    if (_elementStack.Count == 0)
    {
        _elementState = ElementState.EOF;
        return false;
    }
    ....
    if (_elementStack.Count > 0) // <=
    {
        _elementState = ElementState.End;
    }
    else
    {
        // no more element, EOF
        _elementState = ElementState.EOF;
    }
    ....
}

Obviously, if on the stack_elementStack not 0 elements, then there are more of them. The code can be shortened by at least 8 lines.

Warning 3

V3022 Expression ‘rootElement == null’ is always false. OpenXmlPartReader.cs 746

private static OpenXmlElement CreateElement(string namespaceUri, string name)
{
    if (string.IsNullOrEmpty(name))
    {
        throw new ArgumentException(....);
    }

    if (NamespaceIdMap.TryGetNamespaceId(namespaceUri, out byte nsId)
        && ElementLookup.Parts.Create(nsId, name) is OpenXmlElement element)
    {
        return element;
    }

    return new OpenXmlUnknownElement();
}

private bool ReadRoot()
{
  ....
  var rootElement = CreateElement(....);

  if (rootElement == null) // <=
  {
      throw new InvalidDataException(....);
  }
  ....
}

Function CreateElement cannot return null… If the company made a rule to write methods for creating xml-nodes that either return a valid object or throw an exception, then users of such functions can not abuse additional checks.

Warning 4

V3022 Expression ‘nameProvider’ is always not null. The operator ‘?.’ is excessive. OpenXmlSimpleTypeExtensions.cs 50

public static XmlQualifiedName GetSimpleTypeQualifiedName(....)
{
    foreach (var validator in validators)
    {
        if (validator is INameProvider nameProvider &&
            nameProvider?.QName is XmlQualifiedName qname) // <=
        {
            return qname;
        }
    }

    return type.GetSimpleTypeQualifiedName();
}

The is operator has the following pattern:

expr is type varname

If expression expr takes on the meaning true, then the object varname will be valid and do not need to be compared again with nullas done in this code snippet.

Warning 5

V3022 Expression ‘extension == “.xlsx” || extension == “.xlsm” ‘is always false. PresentationDocument.cs 246

public static PresentationDocument CreateFromTemplate(string path)
{
    ....
    string extension = Path.GetExtension(path);
    if (extension != ".pptx" && extension != ".pptm" &&
        extension != ".potx" && extension != ".potm")
    {
        throw new ArgumentException("...." + path, nameof(path));
    }

    using (PresentationDocument template = PresentationDocument.Open(....)
    {
        PresentationDocument document = (PresentationDocument)template.Clone();

        if (extension == ".xlsx" || extension == ".xlsm")
        {
            return document;
        }
        ....
    }
    ....
}

An interesting code turned out. First, the author eliminated all documents with extensions not .pptx, .pptm,.potx and.potm, and then decided to check that among them there is no .xlsx and .xlsm… Function PresentationDocument – definitely a victim of refactoring.

Warning 7

V3022 Expression ‘OpenSettings.MarkupCompatibilityProcessSettings == null’ is always false. OpenXmlPackage.cs 661

public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (_mcSettings is null)
        {
            _mcSettings = new MarkupCompatibilityProcessSettings(....);
        }

        return _mcSettings;
    }

    set
    {
        _mcSettings = value;
    }
}

public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (OpenSettings.MarkupCompatibilityProcessSettings == null) // <=
        {
            return new MarkupCompatibilityProcessSettings(....);
        }
        else
        {
            return OpenSettings.MarkupCompatibilityProcessSettings;
        }
    }
}

Property MarkupCompatibilityProcessSettings never returns null… If in the getter it turns out that the class field has a value null, then the object is overwritten with a new one. Also, note that this is not a recursive call to one property, but properties of the same name from different classes. Perhaps some confusion has led to the writing of unnecessary checks.

Other warnings

Warning 1

V3080 Possible null dereference. Consider inspecting ‘previousSibling’. OpenXmlCompositeElement.cs 380

public OpenXmlElement PreviousSibling()
{
    if (!(Parent is OpenXmlCompositeElement parent))
    {
        return null;
    }
    ....
}

public override T InsertBefore<T>(T newChild, OpenXmlElement referenceChild)
{
    ....
    OpenXmlElement previousSibling = nextNode.PreviousSibling();
    prevNode.Next = nextNode;
    previousSibling.Next = prevNode;    // <=
    ....
}

And here is an example where additional verification is just not enough. Method PreviousSibling just can return the value null, and the result of this function is immediately used without checking.

2 more dangerous places:

  • V3080 Possible null dereference. Consider inspecting ‘prevNode’. OpenXmlCompositeElement.cs 489
  • V3080 Possible null dereference. Consider inspecting ‘prevNode’. OpenXmlCompositeElement.cs 497

Warning 2

V3093 The ‘&’ operator evaluates both operands. Perhaps a short-circuit ‘&&’ operator should be used instead. UniqueAttributeValueConstraint.cs 60

public override ValidationErrorInfo ValidateCore(ValidationContext context)
{
    ....
    foreach (var e in root.Descendants(....))
    {
        if (e != element & e.GetType() == elementType) // <=
        {
            var eValue = e.ParsedState.Attributes[_attribute];

            if (eValue.HasValue && _comparer.Equals(....))
            {
                return true;
            }
        }
    }
    ....
}

Some people like to use the ‘&’ operator in logical expressions where they shouldn’t. In the case of this operator, the second operand is evaluated first, regardless of the result of the first. This is not a very serious mistake here, but such sloppy code after refactoring can lead to potential exceptions. NullReferenceException

Warning 3

V3097 Possible exception: type marked by [Serializable] contains non-serializable members not marked by [NonSerialized]… OpenXmlPackageValidationEventArgs.cs 15

[Serializable]
[Obsolete(ObsoleteAttributeMessages.ObsoleteV1ValidationFunctionality, false)]
[EditorBrowsable(EditorBrowsableState.Never)]
public sealed class OpenXmlPackageValidationEventArgs : EventArgs
{
    private string _message;

    [NonSerialized]
    private readonly object _sender;

    [NonSerialized]
    private OpenXmlPart _subPart;

    [NonSerialized]
    private OpenXmlPart _part;

    ....

    internal DataPartReferenceRelationship
        DataPartReferenceRelationship { get; set; } // <=
}

Serializing a class OpenXmlPackageValidationEventArgs may fail because you forgot to mark one of the properties as non-serializable. Or you need to modify the return type of this property to be serializable, otherwise an exception may occur at runtime.

Conclusion

We at the company love Microsoft projects and technologies. In the section where we list Open Source projects, verified using PVS-Studio, we even allocated a separate section for Microsoft. There have already accumulated 21 projects, about which 26 articles have been written. This is the 27th.

I’m sure you are wondering if there is Microsoft among our customers. The answer is yes! But let’s not forget that this is a huge corporation leading development around the world. There are definitely divisions that already use PVS-Studio in their projects, but there are even more of those that do not use! And our experience with open source projects shows that they clearly lack a good tool for finding bugs;).

More news from the news who is interested in analyzing code in C ++, C # and Java: we have recently added support for the standard OWASP and actively increasing its coverage.

If you want to share this article with an English-speaking audience, please use the translation link: Svyatoslav Razmyslov. Analyzing the Code Quality of Microsoft’s Open XML SDK.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *