How to parse fields from an acroform PDF on the basis of labels in the actual pdf using IText7?

I have a filled US tax form to parse. Using Itext Acroform, I can able to fetch all field names that it has, using the following code:


public static void main()
{
 PdfReader pdfReader = new PdfReader(src);
 PdfDocument pdfDocument = new PdfDocument(pdfReader); 

var form = PdfAcroForm.GetAcroForm(pdfDocument, false);
if (form != null)
{
        IDictionary formfields = form.GetFormFields();
                
        foreach (var field in formfields.Keys)
        {
                    
                    
          Console.WriteLine(formfields[field].GetFieldName().ToString() + ": " + formfields[field]?.GetValueAsString());
                    
        }
}
            

            
 Console.ReadLine();
}

The output in console is as follows:


topmostSubform[0]:
topmostSubform[0].Page1[0]:
topmostSubform[0].Page1[0].FilingStatus[0]:
topmostSubform[0].Page1[0].FilingStatus[0].c1_01[0]: 3
topmostSubform[0].Page1[0].FilingStatus[0].c1_01[1]: 3
topmostSubform[0].Page1[0].FilingStatus[0].c1_01[2]: 3
topmostSubform[0].Page1[0].FilingStatus[0].c1_01[3]: 3
topmostSubform[0].Page1[0].FilingStatus[0].c1_01[4]: 3
topmostSubform[0].Page1[0].FilingStatus[0].f1_01[0]: Sarah expat
topmostSubform[0].Page1[0].f1_02[0]: Brian
topmostSubform[0].Page1[0].f1_03[0]: Expat
topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0]:
topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0].f1_04[0]: 239402830
topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0].f1_05[0]: Sarah
topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0].f1_06[0]: Expat
topmostSubform[0].Page1[0].ReadOrderControl[0]:
topmostSubform[0].Page1[0].ReadOrderControl[0].f1_07[0]: 234745745
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0]:
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_08[0]: 100 Main Road
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_09[0]: 456
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_10[0]: Bangkok
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_11[0]: Thailand
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_12[0]: Bangkok
topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_13[0]: 78945312
topmostSubform[0].Page1[0].ReadOrderControl[1]:
topmostSubform[0].Page1[0].ReadOrderControl[1].PresidentialElection[0]:
topmostSubform[0].Page1[0].ReadOrderControl[1].PresidentialElection[0].c1_02[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].PresidentialElection[0].c1_03[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].StandardDeduction[0]:
topmostSubform[0].Page1[0].ReadOrderControl[1].StandardDeduction[0].c1_04[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].StandardDeduction[0].c1_05[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].StandardDeduction[0].c1_06[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].AgeBlindness[0]:
topmostSubform[0].Page1[0].ReadOrderControl[1].AgeBlindness[0].c1_07[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].AgeBlindness[0].c1_08[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].AgeBlindness[0].c1_09[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl[1].AgeBlindness[0].c1_10[0]: 1
topmostSubform[0].Page1[0].IfMoreThanFour[0]:
topmostSubform[0].Page1[0].IfMoreThanFour[0].c1_11[0]: 1
topmostSubform[0].Page1[0].Dependents[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0].f1_14[0]: Kevin                                               Expat
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0].f1_15[0]: 123871928
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0].f1_16[0]: Son
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0].c1_12[0]: 1
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row1[0].c1_13[0]: 2
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0].f1_17[0]: Audra                                              Expat
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0].f1_18[0]: 123718237
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0].f1_19[0]: Daughter
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0].c1_14[0]: 1
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row2[0].c1_15[0]: 2
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0].f1_20[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0].f1_21[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0].f1_22[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0].c1_16[0]: 1
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row3[0].c1_17[0]: 2
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0].f1_23[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0].f1_24[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0].f1_25[0]:
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0].c1_18[0]: 1
topmostSubform[0].Page1[0].Dependents[0].Table_Dependents[0].Row4[0].c1_19[0]: 2
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0]:
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_26[0]: 123,828.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_27[0]: 5,589.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_28[0]: 1,000.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_29[0]: 4,789.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_30[0]: 1,485.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_31[0]: 1,200.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_32[0]: 41,885.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_33[0]: 1,300.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_34[0]: 45,895.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_35[0]: 1,965.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_36[0]: 50,145.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].c1_20[0]: 1
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_37[0]: 7,894.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_38[0]: 8,654.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_39[0]: 50,928.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_40[0]: 8,956.
topmostSubform[0].Page1[0].ReadOrderControl_Lns1-8b[0].f1_41[0]: 50,299.
topmostSubform[0].Page1[0].f1_42[0]: 24,000.
topmostSubform[0].Page1[0].f1_43[0]: 25,895.
topmostSubform[0].Page1[0].f1_44[0]: 47,568.
topmostSubform[0].Page1[0].f1_45[0]: 26,299.
topmostSubform[0].Page2[0]:
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0]:
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0].c2_01[0]:
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0].c2_02[0]:
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0].c2_03[0]: 1
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0].f2_01[0]: 5,786
topmostSubform[0].Page2[0].Lines12a-12b_ReadOrder[0].f2_02[0]:
topmostSubform[0].Page2[0].f2_03[0]: 5,786.
topmostSubform[0].Page2[0].Lines13a-13b_ReadOrder[0]:
topmostSubform[0].Page2[0].Lines13a-13b_ReadOrder[0].f2_04[0]: 4,000.
topmostSubform[0].Page2[0].f2_05[0]: 4,301.
topmostSubform[0].Page2[0].f2_06[0]: 1,485.
topmostSubform[0].Page2[0].f2_07[0]: 1,229.
topmostSubform[0].Page2[0].f2_08[0]: 2,714.
topmostSubform[0].Page2[0].f2_09[0]: 4,856.
topmostSubform[0].Page2[0].Line18_ReadOrder[0]:
topmostSubform[0].Page2[0].Line18_ReadOrder[0].f2_10[0]: 85,475.
topmostSubform[0].Page2[0].Line18_ReadOrder[0].f2_11[0]: 95,478
topmostSubform[0].Page2[0].Line18_ReadOrder[0].f2_12[0]: 20,000.
topmostSubform[0].Page2[0].Line18_ReadOrder[0].f2_13[0]: 89,540.
topmostSubform[0].Page2[0].f2_14[0]: 45,569.
topmostSubform[0].Page2[0].f2_15[0]: 47,520.
topmostSubform[0].Page2[0].f2_16[0]: 54,565.
topmostSubform[0].Page2[0].c2_04[0]: 1
topmostSubform[0].Page2[0].f2_17[0]: 4,000.
topmostSubform[0].Page2[0].RoutingNo[0]:
topmostSubform[0].Page2[0].RoutingNo[0].f2_18[0]: 123456778
topmostSubform[0].Page2[0].c2_05[0]: 1
topmostSubform[0].Page2[0].c2_05[1]: 1
topmostSubform[0].Page2[0].AccountNo[0]:
topmostSubform[0].Page2[0].AccountNo[0].f2_19[0]: 23455789900112586
topmostSubform[0].Page2[0].f2_20[0]: 85,000.
topmostSubform[0].Page2[0].f2_21[0]: 2,714.
topmostSubform[0].Page2[0].f2_22[0]: 55,000.
topmostSubform[0].Page2[0].ThirdPartyDesignee[0]:
topmostSubform[0].Page2[0].ThirdPartyDesignee[0].c2_06[0]: 1
topmostSubform[0].Page2[0].ThirdPartyDesignee[0].c2_06[1]: 1
topmostSubform[0].Page2[0].ThirdPartyDesignee[0].f2_23[0]: XYZ
topmostSubform[0].Page2[0].ThirdPartyDesignee[0].f2_24[0]: 78954231321
topmostSubform[0].Page2[0].ThirdPartyDesignee[0].f2_25[0]: 12131
topmostSubform[0].Page2[0].Signatures[0]:
topmostSubform[0].Page2[0].Signatures[0].f2_26[0]: SD
topmostSubform[0].Page2[0].Signatures[0].f2_27[0]: 123456
topmostSubform[0].Page2[0].Signatures[0].f2_28[0]: SD
topmostSubform[0].Page2[0].Signatures[0].f2_29[0]: 324324
topmostSubform[0].Page2[0].Signatures[0].f2_30[0]: 784561358922
topmostSubform[0].Page2[0].Signatures[0].f2_31[0]: brianexpat@gmail.com
topmostSubform[0].Page2[0].PaidPreparer[0]:
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0]:
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_32[0]: SSDS
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_33[0]: 5454745
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].CheckIf[0]:
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].CheckIf[0].c2_07[0]: 1
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].CheckIf[0].c2_07[1]: 1
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_34[0]: SDSDE
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_35[0]: 8787922313
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_36[0]: SDSDSD
topmostSubform[0].Page2[0].PaidPreparer[0].Preparer[0].f2_37[0]: 646464

What I expect to have parsed is something like this:


...
...
"First Name": "Sarah",
"Last Name":"Expat"
...
...

The issue here is that these fields in actual output are abrupt and has no relation with the actual label of the form whose value it is mapped with.

For example, topmostSubform[0].Page1[0].FilingStatus[0].f1_01[0]: Sarah expat, here in form this value is actually a name. But there isn't any mapping available that represents f1_01[0] to be the name. One way is to have manual mapping wherein I manually write the code to map this with each label in the form. But then the structure, positioning, and label of the same forms keep getting changed from year to year basis. So if we go with the manual mapping, then it will lead us to manual mapping for all years for all the forms (which is a cumbersome process).

Therefore, is it even possible to fetch out the relationship somehow between the labels and their values from the parsed data that we have in acroform without manual mapping (automatically?)?. I also looked into converting the acroform to XFAForm, and generating XML out of it, but same issue, we don't have the mapping between the labels and their adjacent values.

Edit Manually determining the mapping is the option, of course, but as mentioned earlier, we'd have to manually determine mapping for each form for each year which is a time-consuming process. Since the acro form, underneath contains XFA too, do IRS contain a sort of mapping definition file altogether? Or is there an automatic way to figure out the relationship between the data and its label (of the form)?

1 answer

  • answered 2020-09-24 07:57 Jan Slabon

    This is not a complete answer but I guess could give some ideas how to solve such problem:

    The document is a static XFA form, which has both an AcroForm and a XFA package embedded.

    Both things are logically bound to each other. So if you are able to map the AcroForm field to the template nodes in the XFA package, you can use the <caption> element to gather information about the "label". For example what you expect as "Last Name" is bound to the field topmostSubform[0].Page1[0].FilingStatus[0].f1_03[0] which can be mapped to this XML definition:

    enter image description here

    But the field you used as an example (topmostSubform[0].Page1[0].FilingStatus[0].f1_01[0]) sadly has not a caption at all but you may try to use the <assist> element there:

    enter image description here

    At the end you still have to check these mapping because they may also change in the future but it may give you at least some hints.

    I have no idea if and how you can access this mapping in iText.