Going with the Flow: Part 4 - Parsing attachments with Powershell

Going with the Flow: Part 4 - Parsing attachments with Powershell

Now we're taking the training wheels off and getting into the tricky stuff. In order to add the details to our calendar, we need a way to parse the PDF attachment we saved to Azure and extract the necessary information for the calendar event.

Steps

To convert the PDF to text, I used and open source .NET library called iTextSharp. I downloaded this library and imported it into my local Powershell session, so I can create the code that i'll use to get my bill information.

From the .zip downloaded from iTextSharp GitHub repo, I extracted itextsharp-all-5.5.10.zip\itextsharp.xfaworker-dll-5.5.10.zip\itextsharp.dll to C:\stuff\myBillParser\itextsharp.dll.

In the same directory, I put an sample bill attachment saved from my email.

I then created a Powershell script that imports the iTextSharp library, and has a simple function (Get-PDFText) that takes the input of a PDF file, and spits out the contents in plain text.

# Import iTextSharp library
Add-Type -Path "C:\Stuff\myBillParser\itextsharp.dll"

function Get-PdfText
{
    [CmdletBinding()]
    [OutputType([string])]
    param (
        [Parameter(Mandatory = $true)]
        [string]
        $Path
    )
 
    $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
 
    try
    {
        $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
    }
    catch
    {
        throw
    }
 
    $stringBuilder = New-Object System.Text.StringBuilder
 
    for ($page = 1; $page -le $reader.NumberOfPages; $page++)
    {
        $text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page)
        $null = $stringBuilder.AppendLine($text) 
    }
 
    $reader.Close()
 
    return $stringBuilder.ToString()
}

Running the function and giving it the path to the PDF gives me the desired results:
Get-PdfText -Path C:\Stuff\myBillParser\myBill.pdf

Now that I have the PDF converted to text, I wrote a small function that uses RegEx to parse the text for only the values I want. In my case, I captured the following info:

  • Billing start date
  • Billing end date
  • Due date
  • New charges on the bill

Once I have the values, I convert that to JSON and output the results. This makes it very easy to send our payload of extracted values from our Azure Runbook to our Flow using a simple HTTP POST to a webhook.

My new Get-BillValues function looks something like this:

function Get-BillValues {
# Convert bill PDF to text
$tmp = Get-PdfText -Path 'C:\Stuff\myBillParser\myBill.pdf'

# Get the dollar amount due for this bill
$newCharges = [Regex]::Match($tmp, "(Total new charges in this bill \$)(\d+\.\d+)").captures.groups[2].Value

# Get the date the bill is due
[DateTime]$dueDate = [Regex]::Match($tmp, "(TOTAL DUE)\n(\d+ \w+ \d+)").Captures.Groups[2].Value

# Get the billing period start and end dates
$billingPeriod = [Regex]::Match($tmp, "(BILLING PERIOD)\n(\d+ \w+) \- (\d+ \w+)")
[DateTime]$billingStart = $billingPeriod.Captures.Groups[2].Value
[DateTime]$billingEnd = $billingPeriod.Captures.Groups[3].Value

$obj = New-Object -TypeName PSObject
$obj | Add-Member -MemberType NoteProperty -Name Billing_Start -Value $billingStart
$obj | Add-Member -MemberType NoteProperty -Name Billing_End -Value $billingEnd
$obj | Add-Member -MemberType NoteProperty -Name Due_Date -Value $dueDate
$obj | Add-Member -MemberType NoteProperty -Name New_Charges -Value $newCharges

return $obj | ConvertTo-Json
}

Run the Get-BillValues function and boom!

What we learned

  • How to use Powershell and the itextsharp library to convert a PDF to plain text.
  • How to use Powershell and regex to extract key information.

Next steps

Next up we're going take our Powershell script and load it into an Azure Runbook, so we can execute the code on demand from our Flow.

Continue to Step 5: Working with Azure Runbooks

Related Article