Core Concepts
Morphik’s rules system currently allows for two key operations during document ingestion:- Metadata Extraction: Pull structured information from documents into searchable metadata.
- Content Transformation: Modify document content during ingestion (redaction, summarization, etc.).
Architecture Overview
The rules engine works by:- Accepting rules definitions during document ingestion
- Converting each rule to the appropriate model class
- Sequentially applying rules to document content
- Using an LLM to perform extractions or transformations
- Storing both the extracted metadata and modified content
Rule Types
MetadataExtractionRule
This rule type extracts structured data from document content according to a schema. It’s perfect for converting unstructured documents into structured, queryable data.NaturalLanguageRule
This rule type transforms document content according to natural language instructions. It’s perfect for redaction, summarization, formatting changes, etc.Technical Implementation
Under the hood, the rules processing system leverages Language Models (LLMs) to perform both metadata extraction and content transformation. The system is designed to be modular and configurable.LLM Integration
Morphik supports multiple LLM providers:- OpenAI: For high-quality results with cloud-based processing
- Ollama: For self-hosted, private LLM deployments
morphik.toml
file using the registered models approach:
Rules Processing Logic
When a document with rules is ingested:- The document is parsed and text is extracted
- The rules are validated and converted to model classes
- For each rule:
- The appropriate prompt is constructed based on rule type
- The prompt and document content are sent to the LLM
- The LLM’s response is parsed (JSON for metadata, plain text for transformed content)
- Results are stored according to rule type
Performance Considerations
For large documents, Morphik automatically chunks the content and processes rules in batches. This ensures efficient handling of documents of any size. Thebatch_size
configuration in morphik.toml
determines how content is split up before passing on the LLM.
Larger batch sizes may improve throughput but require more memory, and with a huge batch size, we could run into unreliable results. For complex rules or larger documents, you might need to adjust this setting based on your hardware capabilities and the latency requirements of your application.
Example Use cases
Resume Processing System
Let’s explore a complete example for processing resumes.Medical Document Processing with PII Redaction
Medical documents often contain sensitive information that needs redaction while preserving clinical value.Optimizing Rule Performance
Prompt Engineering
The effectiveness of your rules depends significantly on the quality of prompts and schemas:- Be Specific: Clearly define what you want extracted or transformed
- Provide Examples: For complex schemas, include examples in the prompt
- Limit Scope: Focus each rule on a specific task rather than trying to do too much at once
Rule Sequencing
Rules are processed in sequence, so order matters:- Extract First, Transform Later: Generally, extract metadata before transforming content
- Chunking Awareness: For very large documents, be aware that rules are applied to each chunk separately
- Rule Complexity: Split complex operations into multiple simpler rules
LLM Selection
Different tasks may benefit from different LLMs:- Metadata Extraction: Benefits from models with strong JSON capabilities (like GPT-4)
- Simple Redaction: Can work well with smaller, faster models. For redaction, you may not want to send data across the network, so local models might be more suitable.
- Complex Transformation: May require more sophisticated models.
Conclusion
Morphik’s rules-based ingestion provides a powerful, flexible system for extracting structured data from unstructured documents and transforming content during ingestion. This capability can be applied to numerous use cases from resume processing to medical record management to legal document analysis. The system’s architecture balances flexibility, performance, and ease of use:- Client-Side Simplicity: Define rules using simple schemas and natural language
- Server-Side Power: Leverages LLMs to handle the complex extraction and transformation
- Configurable: Adapt to different deployment scenarios and performance requirements