Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

In this tutorial, we’ll replace a pattern in various locations of a Word document. We’ll work with both .doc and .docx files.

2. The Apache POI Library

The Apache POI library provides Java APIs for manipulating various file formats used by Microsoft Office applications, such as Excel spreadsheets, Word documents, and PowerPoint presentations. It permits to read, write, and modify such files programmatically.

To edit .docx files, we’ll add the latest version of poi-ooxml to our pom.xml:

<dependency>
    <groupId>org.apache.poi</groupId>
   .<artifactId>poi-ooxml</artifactId>
    <version>5.2.5</version>
</dependency>

Additionally, we’ll also need the latest version of poi-scratchpad to deal with .doc files:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>5.2.5</version>
</dependency>

3. File Handling

We want to create example files, read them, replace some text in the file, and then write the result file. Let’s talk about everything that concerns file handling first.

3.1. Example Files

Let’s create a Word document. We’ll want to replace the word Baeldung in it with the word Hello. Thus, we’ll write Baeldung in multiple locations of the files, especially in a table, various document sections, and paragraphs. We also want to use diverse formatting styles, including one occurrence with a format change inside the word. We’ll use the same document once saved as a .doc file and once as a .docx:

 

original document

3.2. Reading the Input File

First, we need to read the file. We’ll put it in the resources folder to make it available in the classpath. This way, we’ll get an InputStream. For a .doc document, we’ll create a POIFSFileSystem object based on this InputStream. Lastly, we can retrieve the HWPFDocument object we’ll modify. We’ll use a try-with-resources so that the InputStream and POIFSFileSystem objects are closed automatically. However, as we’ll make modifications to the HWPFDocument, we’ll close it manually:

public void replaceText() throws IOException {
    String filePath = getClass().getClassLoader()
      .getResource("baeldung.doc")
      .getPath();
    try (InputStream inputStream = new FileInputStream(filePath); POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream)) {
        HWPFDocument doc = new HWPFDocument(fileSystem);
        // replace text in doc and save changes
        doc.close();
    }
}

When dealing with a .docx document, it’s slightly more straightforward, as we can directly derive an XWPFDocument object from the InputStream:

public void replaceText() throws IOException {
    String filePath = getClass().getClassLoader()
      .getResource("baeldung.docx")
      .getPath();
    try (InputStream inputStream = new FileInputStream(filePath)) {
        XWPFDocument doc = new XWPFDocument(inputStream);
        // replace text in doc and save changes
        doc.close();
    }
}

3.3. Writing the Output File

We’ll write the output document into the same file. As a result, the modified file will be located in the target folder. HWPFDocument and XWPFDocument classes both expose a write() method to write the document to an OuputStream. For instance, for a .doc document, it all boils down to:

private void saveFile(String filePath, HWPFDocument doc) throws IOException {
    try (FileOutputStream out = new FileOutputStream(filePath)) {
        doc.write(out);
    }
}

4. Replacing Text in a .docx Document

Let’s try to replace the occurrences of the word Baeldung in the .docx document and see what challenges we face in the process.

4.1. Naive Implementation

We’ve already parsed the document into an XWPFDocument object. An XWPFDocument is divided into various paragraphs. The paragraphs inside the core of the file are available directly. However, to access the ones inside a table, it is necessary to loop over all the rows and cells of the tables. Leaving the writing of the method replaceTextInParagraph() for later on, here is how we’ll apply it repetitively to all the paragraphs:

private XWPFDocument replaceText(XWPFDocument doc, String originalText, String updatedText) {
    replaceTextInParagraphs(doc.getParagraphs(), originalText, updatedText);
    for (XWPFTable tbl : doc.getTables()) {
        for (XWPFTableRow row : tbl.getRows()) {
            for (XWPFTableCell cell : row.getTableCells()) {
                replaceTextInParagraphs(cell.getParagraphs(), originalText, updatedText);
            }
        }
    }
    return doc;
}

private void replaceTextInParagraphs(List<XWPFParagraph> paragraphs, String originalText, String updatedText) {
    paragraphs.forEach(paragraph -> replaceTextInParagraph(paragraph, originalText, updatedText));
}

In Apache POI, paragraphs are divided into XWPFRun objects. As a first shot, let’s try to iterate over all runs: if we detect the text we want to replace inside a run, we’ll update the content of the run:

private void replaceTextInParagraph(XWPFParagraph paragraph, String originalText, String updatedText) {
    List<XWPFRun> runs = paragraph.getRuns();
    for (XWPFRun run : runs) {
        String text = run.getText(0);
        if (text != null && text.contains(originalText)) {
            String updatedRunText = text.replace(originalText, updatedText);
            run.setText(updatedRunText, 0);
        }
    }
}

To conclude, we’ll update replaceText() to include all the steps:

public void replaceText() throws IOException {
    String filePath = getClass().getClassLoader()
      .getResource("baeldung-copy.docx")
      .getPath();
    try (InputStream inputStream = new FileInputStream(filePath)) {
        XWPFDocument doc = new XWPFDocument(inputStream);
        doc = replaceText(doc, "Baeldung", "Hello");
        saveFile(filePath, doc);
        doc.close();
    }
}

Let’s now run this code, for instance, through a unit test. We can have a look at a screenshot of the updated document:

 

hello replacement docx naive document

4.2. Limitations

As we can see in the screenshot, most occurrences of the word Baeldung have been replaced with the word Hello. However, we can see two remaining Baeldung.

Let’s now understand deeper what XWPFRun is. Each run represents a continuous sequence of text with a common set of formatting properties. The formatting properties include font style, size, color, boldness, italics, underlining, etc. Whenever there is a format change, there is a new run. This is why the occurrence with various formattings in the table is not replaced: its content is spread over multiple runs.

However, the bottom blue Baeldung occurrence wasn’t replaced either. Indeed, Apache POI doesn’t guarantee that characters with the same formatting properties are part of the same run. In a nutshell, the naive implementation is good enough for the simplest cases. It is worth using this solution in such cases because it doesn’t imply any complex decision. However, if we’re confronted with this limitation, we’ll need to move toward another solution.

4.3. Dealing With Text Spread Over Multiple Character Run

For the sake of simplicity, we’ll make the following assumption: it is ok for us to lose the formatting of a paragraph when we find the word Baeldung inside it. Thus, we can remove all existing runs inside the paragraph and replace them with a single new one. Let’s rewrite replaceTextInParagraph():

private void replaceTextInParagraph(XWPFParagraph paragraph, String originalText, String updatedText) {
    String paragraphText = paragraph.getParagraphText();
    if (paragraphText.contains(originalText)) {
        String updatedParagraphText = paragraphText.replace(originalText, updatedText);
        while (paragraph.getRuns().size() > 0) {
            paragraph.removeRun(0);
        }
        XWPFRun newRun = paragraph.createRun();
        newRun.setText(updatedParagraphText);
    }
}

Let’s have a look at the result file:

hello replacement docx full document
As expected, every occurrence is now replaced. However, most formatting is lost. The last format isn’t lost. In this case, it seems that Apache POI handles formatting properties differently.

As a last remark, let’s note that depending on our use case, we could also decide to keep some formatting of the original paragraph. We’d then need to iterate over all the runs and keep or update properties as we like.

5. Replacing a Text in a .doc Document

Things are much more straightforward for doc files. We can indeed access a Range object on the whole document. We are then able to modify the content of the range via its replaceText() method:

private HWPFDocument replaceText(HWPFDocument doc, String originalText, String updatedText) {
    Range range = doc.getRange();
    range.replaceText(originalText, updatedText);
    return doc;
}

Running this code leads to the following updated file:

 

hello replacement document

As we can see, the replacement took place all over the file. We can also notice that the default behavior for texts spread over multiple runs is to keep the formatting of the first run.

6. Conclusion

In this article, we replaced a pattern in a Word document. In a .doc document, it was pretty straightforward. However, in a .docx, we experienced some limitations with the easy-going implementation. We showcased an example of overcoming this limitation by making a simplification hypothesis.

As always, the code is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are closed on this article!