#TECH

Parsing Microsoft Word Document using C#

Sometimes you need to extract information from Microsoft Office Word (.docx) which becomes a challenge because of the limited methods available but here in this article I would talk about a new nuget package called DocX which helps in extracting any kind of information and is very easy to use.

It even allows to extract the specific information like formatting. It also allows you to find the occurrence of specific text in the document.

In this article we will extract the following information:

  • Author of the document
  • Specific text like a word or a phrase
  • Complete text of any paragraph
  • Titles and Headings
  • Text highlighted in bold or any other styles applied to a text

Let’s dive into the solution:

  1. Install-Package DocX

Install nuget

  1. Load the document either from stream or from disk directly.

docx_load

docx_load2

  1. Properties of the file like Author Name, Last modified and Created time etc. we can get from CoreProperties property of the DocX object created above.The following image shows the CoreProperties fields.

the CoreProperties fields

The document author name can be extracted as follows:

author name can be extracted

  1. Specific text can be found using FindAll() method. This takes text as a mandatory parameter along with an optional parameter of Regex for matching.

FindAll() method

It returns a List<int> containing the index of the matched text in the document. If no element is present in returned List object then the searched text is not found.

  1. We can get all the paragraphs from the Paragraphs property:

the Paragraphs property

Each Paragraph row consists of following properties:

Paragraph row consists of following properties

To get complete text of any paragraph you just need to get the value of Text property:

Text property

You can also get specific paragraphs (say title paragraph) by putting the following condition using

LINQ

docx_wherepara

docx_paragraphProp

  1. You can access the style and other formatting information of any text using the MagicText property.

The following code demonstrates the magic text application on the first paragraph:

docx_magicText

Using above technique we can find all the bold text (specific formatting) inside the document by iterating first through each paragraph and then iterating through each MagicText object falling inside that paragraph.

There are various other methods also available to get further information about the text such as the indentation, hyperlinks, tables etc. which can be explored as per the need.

Note:

The document should be properly formatted in order to extract the information using above tips.

You might also like