Automating PDF form digitization with LLMs

Context

While working on partnership development for our compliance platform, we identified an opportunity to demonstrate value to a potential enterprise partner. They had comprehensive compliance templates (QM10 and QM20) available only as PDFs. Instead of traditional sales outreach, I proposed building their templates directly on our platform as a proof of concept.

Note: While the specific partner and compliance details are kept confidential, this case study focuses on the technical approach to solving a common enterprise challenge: converting unstructured PDF forms into interactive digital workflows.

What I Built

I developed an automated pipeline that could:

Extract structured data from complex PDF compliance forms
Transform the data into a standardized format
Automatically generate digital workflows through our platform's GraphQL API
Create a complete, interactive compliance assessment ready for immediate use

The end result was a fully-functional digital version of their compliance workflow that we could demonstrate during partnership discussions.

Technical Breakdown

Stack & Tools

TypeScript/Node.js: Core implementation
OpenAI GPT API: PDF content extraction and structuring
GraphQL: Platform API integration
PDF Processing: Initial exploration with PDF parsing libraries

Key Architecture Decisions

LLM-Based Data Extraction

After initially exploring traditional PDF parsing libraries, I pivoted to using OpenAI's GPT for data extraction. Here's why:

// Example of the structured data format we achieved
interface ComplianceSection {
  title: string;
  informationText: string;
  goal: string;
  booleanQuestions: string[];
  mappingIndication: string[];
}

// Sample of extracted and structured data
const extractedData = [
  {
    title: "1.2 Informatiebeveiligingsbeleid en bestuurlijke goedkeuring",
    informationText: "Het management van de organisatie dient...",
    goal: "Voorkomen dat er informatiebeveiligingsincidenten...",
    booleanQuestions: [
      "Heeft de organisatie een gedetailleerd informatiebeveiligingsbeleid...?",
      // More questions...
    ],
    mappingIndication: [
      "ISO 27001: A.5.1 – Information security policies",
      // More mappings...
    ],
  },
  // More sections...
];

This approach provided superior results compared to traditional PDF parsing because:

PDFs contained complex formatting and tables
LLM could understand context and relationships between elements
Structured output was more reliable and required less cleanup

Automated Form Generation Pipeline

async function main() {
  // Authentication
  await signIn(baseUrl, process.env.USERNAME, process.env.PASSWORD);

  // Create base form structure
  const formCollection = await createFormCollection({
    tenantId: tenant.id,
    data: { name: "QM20 Assessment Form" },
  });

  // Process each section from extracted data
  for (const section of extractedData) {
    const formSection = await createFormSection({
      tenantId: tenant.id,
      formId: formId,
      data: {
        title: section.title,
        description: section.informationText,
      },
    });

    // Create dynamic form fields
    await createFormFields(tenant.id, formSection.id, section);
  }
}

The pipeline handles:

Authentication and session management
Hierarchical form creation (collections → sections → fields)
Dynamic field generation based on question types
Metadata and mapping preservation

Error Handling and Validation

const createFormField = async ({
  tenantId,
  formSectionId,
  type,
  initialTitle,
}) => {
  try {
    const field = await createField(/* ... */);
    await updateFormField(tenant.id, field.id, {
      richDescription: "",
      richTitle: initialTitle,
      disabled: false,
      metadata: {
        validation: {
          required: false,
          formats: ["PDF", "DOC", "DOCX" /* ... */],
        },
      },
    });
    return field;
  } catch (error) {
    console.error(`Failed to create field: ${initialTitle}`);
    throw error;
  }
};

Built-in safeguards include:

Proper error handling for API calls
Field validation rules
File format restrictions
Rich text support for complex content

What I Learned

PDF Data Extraction Strategy

The initial approach using PDF parsing libraries like pdf-parse or pdf2json proved challenging due to:

Inconsistent text extraction
Loss of formatting and structure
Difficulty handling tables and layouts

LLMs provided a more elegant solution by:

Understanding document context
Maintaining relationships between elements
Producing clean, structured output

GraphQL API Orchestration

Managing multiple dependent API calls required careful orchestration:

Sequential processing for proper parent-child relationships
Error handling with appropriate rollbacks
Rate limiting consideration
Progress tracking for long-running operations

Business Process Automation

The project highlighted how technical solutions can directly impact business development:

Reduced sales cycle by providing immediate value
Demonstrated platform capabilities effectively
Saved significant manual work for the CSM team
Created reusable automation patterns

What's Next?

Scalability Improvements
- Batch processing for multiple PDFs
- Parallel processing where possible
- Caching for improved performance
Enhanced Extraction
- Support for more complex PDF layouts
- Additional compliance template types
- Multi-language support
Integration Enhancements
- Automated testing for generated forms
- Version control for templates
- Change tracking and diff generation

Key Takeaways

I learned the power of combining modern AI tools with traditional automation to solve real business challenges. By thinking creatively about PDF data extraction and leveraging LLMs, I turned what could have been weeks of manual work into an automated process.

The solution saved immediate time and resources + created a repeatable pattern for future partner onboarding. Most importantly, it transformed a traditional sales approach into a value-first demonstration which resonated with our potential partner.

Note: This case study focuses on the technical implementation while respecting confidentiality around specific partner details and compliance requirements.