Automating PDF form digitization with LLMs

Context

While working on partnership development for our compliance platform, we identified an opportunity to demonstrate value to a potential enterprise partner. They had comprehensive compliance templates (QM10 and QM20) available only as PDFs. Instead of traditional sales outreach, I proposed building their templates directly on our platform as a proof of concept.

Note: While the specific partner and compliance details are kept confidential, this case study focuses on the technical approach to solving a common enterprise challenge: converting unstructured PDF forms into interactive digital workflows.

What I Built

I developed an automated pipeline that could:

  1. Extract structured data from complex PDF compliance forms
  2. Transform the data into a standardized format
  3. Automatically generate digital workflows through our platform's GraphQL API
  4. Create a complete, interactive compliance assessment ready for immediate use

The end result was a fully-functional digital version of their compliance workflow that we could demonstrate during partnership discussions.

Technical Breakdown

Stack & Tools

Key Architecture Decisions

  1. LLM-Based Data Extraction

After initially exploring traditional PDF parsing libraries, I pivoted to using OpenAI's GPT for data extraction. Here's why:

// Example of the structured data format we achieved
interface ComplianceSection {
  title: string;
  informationText: string;
  goal: string;
  booleanQuestions: string[];
  mappingIndication: string[];
}

// Sample of extracted and structured data
const extractedData = [
  {
    title: "1.2 Informatiebeveiligingsbeleid en bestuurlijke goedkeuring",
    informationText: "Het management van de organisatie dient...",
    goal: "Voorkomen dat er informatiebeveiligingsincidenten...",
    booleanQuestions: [
      "Heeft de organisatie een gedetailleerd informatiebeveiligingsbeleid...?",
      // More questions...
    ],
    mappingIndication: [
      "ISO 27001: A.5.1 – Information security policies",
      // More mappings...
    ],
  },
  // More sections...
];

This approach provided superior results compared to traditional PDF parsing because:

  1. Automated Form Generation Pipeline
async function main() {
  // Authentication
  await signIn(baseUrl, process.env.USERNAME, process.env.PASSWORD);

  // Create base form structure
  const formCollection = await createFormCollection({
    tenantId: tenant.id,
    data: { name: "QM20 Assessment Form" },
  });

  // Process each section from extracted data
  for (const section of extractedData) {
    const formSection = await createFormSection({
      tenantId: tenant.id,
      formId: formId,
      data: {
        title: section.title,
        description: section.informationText,
      },
    });

    // Create dynamic form fields
    await createFormFields(tenant.id, formSection.id, section);
  }
}

The pipeline handles:

  1. Error Handling and Validation
const createFormField = async ({
  tenantId,
  formSectionId,
  type,
  initialTitle,
}) => {
  try {
    const field = await createField(/* ... */);
    await updateFormField(tenant.id, field.id, {
      richDescription: "",
      richTitle: initialTitle,
      disabled: false,
      metadata: {
        validation: {
          required: false,
          formats: ["PDF", "DOC", "DOCX" /* ... */],
        },
      },
    });
    return field;
  } catch (error) {
    console.error(`Failed to create field: ${initialTitle}`);
    throw error;
  }
};

Built-in safeguards include:

What I Learned

  1. PDF Data Extraction Strategy

The initial approach using PDF parsing libraries like pdf-parse or pdf2json proved challenging due to:

LLMs provided a more elegant solution by:

  1. GraphQL API Orchestration

Managing multiple dependent API calls required careful orchestration:

  1. Business Process Automation

The project highlighted how technical solutions can directly impact business development:

What's Next?

  1. Scalability Improvements

    • Batch processing for multiple PDFs
    • Parallel processing where possible
    • Caching for improved performance
  2. Enhanced Extraction

    • Support for more complex PDF layouts
    • Additional compliance template types
    • Multi-language support
  3. Integration Enhancements

    • Automated testing for generated forms
    • Version control for templates
    • Change tracking and diff generation

Key Takeaways

I learned the power of combining modern AI tools with traditional automation to solve real business challenges. By thinking creatively about PDF data extraction and leveraging LLMs, I turned what could have been weeks of manual work into an automated process.

The solution saved immediate time and resources + created a repeatable pattern for future partner onboarding. Most importantly, it transformed a traditional sales approach into a value-first demonstration which resonated with our potential partner.


Note: This case study focuses on the technical implementation while respecting confidentiality around specific partner details and compliance requirements.