Sometimes you need to extract information from Microsoft Office Word (.docx) which becomes a challenge because of the limited methods available but here in this article I would talk about a new nuget package called DocX which helps in extracting any kind of information and is very easy to use.
It even allows to extract the specific information like formatting. It also allows you to find the occurrence of specific text in the document.
In this article we will extract the following information:
- Author of the document
- Specific text like a word or a phrase
- Complete text of any paragraph
- Titles and Headings
- Text highlighted in bold or any other styles applied to a text
Let’s dive into the solution:
- Load the document either from stream or from disk directly.
- Properties of the file like Author Name, Last modified and Created time etc. we can get from CoreProperties property of the DocX object created above.The following image shows the CoreProperties fields.
The document author name can be extracted as follows:
- Specific text can be found using FindAll() method. This takes text as a mandatory parameter along with an optional parameter of Regex for matching.
It returns a List<int> containing the index of the matched text in the document. If no element is present in returned List object then the searched text is not found.
- We can get all the paragraphs from the Paragraphs property:
Each Paragraph row consists of following properties:
To get complete text of any paragraph you just need to get the value of Text property:
You can also get specific paragraphs (say title paragraph) by putting the following condition using
- You can access the style and other formatting information of any text using the MagicText property.
The following code demonstrates the magic text application on the first paragraph:
Using above technique we can find all the bold text (specific formatting) inside the document by iterating first through each paragraph and then iterating through each MagicText object falling inside that paragraph.
There are various other methods also available to get further information about the text such as the indentation, hyperlinks, tables etc. which can be explored as per the need.
The document should be properly formatted in order to extract the information using above tips.