Batch convert word docs to HTML/PDF with PowerShell.

There are a bunch of static web contents in my project that initially get drafted from word documents from various sources. These word documents need to get published to the web application eventually in either HTML or PDF formats. Obviously, our BA Julio is super tired of converting them one by one. With hundreds of documents, it could be a nightmare to do it by hand. So he came to me for help. It ended up to be a simple task with a few lines of script using PowerShell. Salute to the power of programming!!

Let’s get started. Copy the code below and save it to converter.ps1 for eg.

[CmdletBinding()]
Param(
  
   [Parameter(Mandatory=$True)]
   [string]$documents_path 
)

$targetPdfPath = "${documents_path}\pdf"

if (!(Test-Path $targetPdfPath -PathType Container)) {
    New-Item -ItemType Directory -Force -Path $targetPdfPath
}

$word_app = New-Object -ComObject Word.Application

# This filter will find .doc as well as .docx documents

$count = 0
Get-ChildItem -Path $documents_path -Filter *.doc? | ForEach-Object {
    $count++;

    $document = $word_app.Documents.Open($_.FullName)

    $pdf_filename = "${targetPdfPath}\$($_.BaseName).pdf"
    
    Write-Host "Converting ${pdf_filename} ..." 

    $document.SaveAs([ref] $pdf_filename, [ref] 17)

    $document.Close()

}

Write-Host "Complete converting ${count} word documents to pdf under ${targetPdfPath}" -BackgroundColor "Green" -ForegroundColor "Black";


$targetHtmlPath = "${documents_path}\html"

if (!(Test-Path $targetHtmlPath -PathType Container)) {
    New-Item -ItemType Directory -Force -Path $targetHtmlPath
}

$word_app = New-Object -ComObject Word.Application

# This filter will find .doc as well as .docx documents

$count1 = 0
Get-ChildItem -Path $documents_path -Filter *.doc? | ForEach-Object {
    $count1++;

    $document = $word_app.Documents.Open($_.FullName)

    $html_filename = "${targetHtmlPath}\$($_.BaseName).html"
    
    Write-Host "Converting ${html_filename} ..." 
    
    $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");

    $document.SaveAs([ref] $html_filename, [ref] $saveFormat)

    $document.Close()

}

Write-Host "Complete converting ${count1} word documents to html under ${targetHtmlPath}" -BackgroundColor "Green" -ForegroundColor "Black";

$word_app.Quit()

Now copy the docs you would like to convert to a temporary folder for eg:

C:\temp\msdocs

Follow the following steps to convert the documents to htmls and pdfs. Assume you are on windows system.

Step 1: Click on window icon at the left bottom corner and search powershell.

Step 2: Run as administrator to open powershell

Step 3: At prompt, run the following command

Set-ExecutionPolicy Restricted

If the above command didn’t work, try this command

 Set-ExecutionPolicy RemoteSigned

Step 4: Input “Y” when prompted

Step 5: Navigate to script folder => cd

is where your put your converter.ps1.

Step 6: Run the following command

./converter.ps1 “”

refers to C:\temp\msdocs if you follow the example to move your documents to this example folder.

Step 7: It will create a pdf & html folder respectively under the current directory. Please watch the console output in case there is some unusual hanging behavior on the screen. It could indicate that there is some popup interaction from word application. Once you interact with the word application, script will continue the process until finished.

Step 8: Go to msdocs folder and check to make sure you don’t have any missing files.

Note the console would output the number of pdfs that have been converted from word documents. You should compare the two counts to prevent from any potential missing conversion.

convert

 

 

Advertisement

4 thoughts on “Batch convert word docs to HTML/PDF with PowerShell.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s