Automating PDF form digitization with LLMs
Context
While working on partnership development for our compliance platform, we identified an opportunity to demonstrate value to a potential enterprise partner. They had comprehensive compliance templates (QM10 and QM20) available only as PDFs. Instead of traditional sales outreach, I proposed building their templates directly on our platform as a proof of concept.
Note: While the specific partner and compliance details are kept confidential, this case study focuses on the technical approach to solving a common enterprise challenge: converting unstructured PDF forms into interactive digital workflows.
What I Built
I developed an automated pipeline that could:
- Extract structured data from complex PDF compliance forms
- Transform the data into a standardized format
- Automatically generate digital workflows through our platform's GraphQL API
- Create a complete, interactive compliance assessment ready for immediate use
The end result was a fully-functional digital version of their compliance workflow that we could demonstrate during partnership discussions.
Technical Breakdown
Stack & Tools
- TypeScript/Node.js: Core implementation
- OpenAI GPT API: PDF content extraction and structuring
- GraphQL: Platform API integration
- PDF Processing: Initial exploration with PDF parsing libraries
Key Architecture Decisions
- LLM-Based Data Extraction
After initially exploring traditional PDF parsing libraries, I pivoted to using OpenAI's GPT for data extraction. Here's why:
// Example of the structured data format we achieved
interface ComplianceSection {
title: string;
informationText: string;
goal: string;
booleanQuestions: string[];
mappingIndication: string[];
}
// Sample of extracted and structured data
const extractedData = [
{
title: "1.2 Informatiebeveiligingsbeleid en bestuurlijke goedkeuring",
informationText: "Het management van de organisatie dient...",
goal: "Voorkomen dat er informatiebeveiligingsincidenten...",
booleanQuestions: [
"Heeft de organisatie een gedetailleerd informatiebeveiligingsbeleid...?",
// More questions...
],
mappingIndication: [
"ISO 27001: A.5.1 – Information security policies",
// More mappings...
],
},
// More sections...
];
This approach provided superior results compared to traditional PDF parsing because:
- PDFs contained complex formatting and tables
- LLM could understand context and relationships between elements
- Structured output was more reliable and required less cleanup
- Automated Form Generation Pipeline
async function main() {
// Authentication
await signIn(baseUrl, process.env.USERNAME, process.env.PASSWORD);
// Create base form structure
const formCollection = await createFormCollection({
tenantId: tenant.id,
data: { name: "QM20 Assessment Form" },
});
// Process each section from extracted data
for (const section of extractedData) {
const formSection = await createFormSection({
tenantId: tenant.id,
formId: formId,
data: {
title: section.title,
description: section.informationText,
},
});
// Create dynamic form fields
await createFormFields(tenant.id, formSection.id, section);
}
}
The pipeline handles:
- Authentication and session management
- Hierarchical form creation (collections → sections → fields)
- Dynamic field generation based on question types
- Metadata and mapping preservation
- Error Handling and Validation
const createFormField = async ({
tenantId,
formSectionId,
type,
initialTitle,
}) => {
try {
const field = await createField(/* ... */);
await updateFormField(tenant.id, field.id, {
richDescription: "",
richTitle: initialTitle,
disabled: false,
metadata: {
validation: {
required: false,
formats: ["PDF", "DOC", "DOCX" /* ... */],
},
},
});
return field;
} catch (error) {
console.error(`Failed to create field: ${initialTitle}`);
throw error;
}
};
Built-in safeguards include:
- Proper error handling for API calls
- Field validation rules
- File format restrictions
- Rich text support for complex content
What I Learned
- PDF Data Extraction Strategy
The initial approach using PDF parsing libraries like pdf-parse
or pdf2json
proved challenging due to:
- Inconsistent text extraction
- Loss of formatting and structure
- Difficulty handling tables and layouts
LLMs provided a more elegant solution by:
- Understanding document context
- Maintaining relationships between elements
- Producing clean, structured output
- GraphQL API Orchestration
Managing multiple dependent API calls required careful orchestration:
- Sequential processing for proper parent-child relationships
- Error handling with appropriate rollbacks
- Rate limiting consideration
- Progress tracking for long-running operations
- Business Process Automation
The project highlighted how technical solutions can directly impact business development:
- Reduced sales cycle by providing immediate value
- Demonstrated platform capabilities effectively
- Saved significant manual work for the CSM team
- Created reusable automation patterns
What's Next?
-
Scalability Improvements
- Batch processing for multiple PDFs
- Parallel processing where possible
- Caching for improved performance
-
Enhanced Extraction
- Support for more complex PDF layouts
- Additional compliance template types
- Multi-language support
-
Integration Enhancements
- Automated testing for generated forms
- Version control for templates
- Change tracking and diff generation
Key Takeaways
I learned the power of combining modern AI tools with traditional automation to solve real business challenges. By thinking creatively about PDF data extraction and leveraging LLMs, I turned what could have been weeks of manual work into an automated process.
The solution saved immediate time and resources + created a repeatable pattern for future partner onboarding. Most importantly, it transformed a traditional sales approach into a value-first demonstration which resonated with our potential partner.
Note: This case study focuses on the technical implementation while respecting confidentiality around specific partner details and compliance requirements.