不容错过的更新:Office 2016 与 Office 2019 支持终止

立即阅读
我们利用人工智能进行网站翻译,虽然我们力求准确,但不一定总是 100%精确。感谢您的理解。

Concatenated PDFs: A Simple Trick That Confuses Anti-Malware Engines and AI Systems

OPSWAT
分享此贴

The Hidden Danger Inside a Trusted File Format

PDFs are among the most universally trusted and widely used document formats in enterprise environments. They are exchanged daily across email, file-sharing platforms, and collaboration tools. Precisely because of that trust, they have become one of the most consistently abused vectors for phishing campaigns, malware delivery, and social engineering attacks.

According to Check Point Research, 22% of file-based cyberattacks leverage PDFs as the delivery mechanism, and 68% of all cyberattacks originate from the inbox. What is less widely understood is that PDFs are not simply containers for visible content. They are structured documents with defined internal architecture, and the way that architecture is parsed varies across readers, security tools, and AI systems.

This variability is not a bug. It is a design characteristic, and sophisticated threat actors have learned to exploit it in ways that require no vulnerability, no exploit kit, and no advanced tooling.

Understanding PDF Structure

To understand how a concatenation attack works, it is necessary to first understand how PDF parsers read a document.

When a PDF reader opens a file, it follows a defined sequence: it locates the last end-of-file marker, reads the startxref pointer, uses it to locate the cross-reference (xref) table and trailer, and then reconstructs the document by resolving object offsets. This design is intentional, allowing readers to instantly locate objects in large documents without scanning the entire file.

Figure 1 — Standard PDF document structure: Header, Body, Cross-Reference Table, and Trailer

The PDF specification also defines a mechanism called Incremental Updates, which allows documents to be modified without rewriting the entire file. Changes are appended to the end of the document, and each update adds new objects, a new xref table, a new trailer, and a new end-of-file marker.

Figure 2 — PDF Incremental Updates: each revision appends its own xref section, trailer, and EOF marker

Because of this design, a valid PDF may legitimately contain multiple xref tables, multiple trailers, and multiple end-of-file markers. Most modern parsers handle this structure correctly. But this same structural flexibility also creates a measurable opportunity for manipulation.

The Concatenation Technique

During internal security research, OPSWAT discovered that appending two entirely separate PDFs into a single file produces a document that different parsers interpret in fundamentally different ways. What began as a structural curiosity revealed a meaningful and reproducible evasion technique that had gone largely unexamined. The resulting file contains two independent document structures, each with its own header, xref table, trailer, and end-of-file marker.

This is conceptually similar to parser exploitation techniques already observed with archive files, where structural ambiguity is used to obscure malicious content from security tools. In the case of PDFs, the consequences extend further: not only do security scanners disagree on what the file contains, but the version users ultimately see in their PDF reader may be entirely different from the version that was inspected.

Figure 3 — Concatenated PDF Technique

Because different PDF readers apply different parsing strategies, the same concatenated file can display entirely different content depending on which application opens it.

Different Applications, Different Content

A proof-of-concept was created using two PDF sections: the first instructing to draw a rectangle, and the second instructing to draw a circle.

Common PDF readers, including Adobe Reader, Foxit Reader, Chrome, and Microsoft Edge, locate the last startxref pointer in the file, which references the structure of the appended (second) document. They render the circle instruction.

Figure 4 — Adobe Reader displays the content of the second (appended) document

Microsoft Word and Teams Preview apply a different parsing strategy and resolve the first document structure. They render the rectangle instruction, which the user cannot see in Adobe Reader.

Figure 5 — Microsoft Word and Teams Preview display the content of the first (hidden) document

Measured Impact on Antivirus Detection

The security implications of this structural ambiguity were validated through direct testing using the OPSWAT MetaDefender® platform, which aggregates results from multiple antivirus engines.

Step 1: Original Phishing PDF

A PDF containing phishing content and malicious hyperlinks was submitted to 34 antivirus engines. Eight engines correctly identified the malicious content.

Figure 6 — Original phishing PDF: 8 out of 34 antivirus engines detected malicious content

Step 2: Concatenated PDF with a Clean Prepended Document

A clean, blank PDF was prepended before the phishing PDF to produce a concatenated document. The combined file was submitted to the same 34 engines.

Figure 7 — Concatenated PDF: detection dropped to 5 out of 34 engines. Three engines were evaded by structural manipulation.

Detection dropped to 5 out of 34 engines. Three antivirus engines no longer identified the threat. The most probable explanation is that those engines processed only the first document structure in the file, which contained the clean PDF, and did not traverse the second structure where the malicious content resided.

From the user’s perspective, however, the risk was completely unchanged. When the concatenated file was opened in Adobe Reader, the phishing page was rendered exactly as the attacker intended.

Figure 8 — Adobe Reader renders the phishing page from the concatenated PDF. The user is exposed to the same threat regardless of what security engines inspect.

How AI Systems Interpret Concatenated Documents

As AI-powered document processing becomes embedded in enterprise workflows, this structural ambiguity introduces a distinct category of risk beyond conventional malware delivery. Organizations increasingly rely on large language models to analyze documents, extract information, and support decision-making. If those systems interpret a different version of a document than the one a human user sees, the consequences extend well beyond a missed phishing link.

Testing with the same concatenated PDF demonstrated that major AI platforms interpret the file according to the same parser-dependent logic observed in traditional reader applications.

GPT: Interprets the First Section

GPT resolved the first document structure in the file and extracted the content from the hidden prepended section. It read and acted on the rectangle instruction, which is not the content visible to a user opening the file in Adobe Reader.

Figure 9 — GPT interprets the first (hidden) document structure, extracting content invisible to users in Adobe Reader

Gemini and Claude: Interpret the Second (Visible) Section

Both Gemini and Claude resolved the second document structure and extracted the content consistent with what users see in Adobe Reader. While this is the expected behavior from a user experience standpoint, it demonstrates that AI systems are subject to the same structural parsing differences as conventional readers.

Figure 10 — Gemini correctly reads the second (visible) document structure
Figure 11 — Claude also reads the second (visible) document structure, consistent with what users see

This discrepancy has direct implications for several high-priority risk scenarios:

  • Prompt injection: An attacker embeds covert instructions in the hidden first section of a concatenated PDF. A user sees a normal document. An AI system that parses the first structure receives commands that override its intended behavior, without any visible indicator to the user or reviewer.
  • Training data poisoning: Documents used to fine-tune or augment AI models may carry a hidden section that introduces adversarial content into the training corpus without triggering detection.
  • Compliance and audit failures: AI systems used for document review, contract analysis, or regulatory reporting may process a version of a document that differs materially from the version reviewed by human counsel or compliance staff, creating a silent governance gap.

For Legal and Corporate Counsel, Privacy Officers, and Compliance teams, the scenario in which an AI system acts on content that no human reviewed, and no security tool flagged is not theoretical. The concatenation technique makes it trivially achievable.

How OPSWAT Addresses the Concatenated PDF Attack

Deep CDR™ Technology: File Sanitization That Eliminates the Threat Before It Arrives

OPSWAT Deep CDR™ Technology treats every file as potentially malicious. Rather than attempting to detect specific malicious patterns, Deep CDR™ Technology deconstructs each file, validates its internal structure against official format specifications, removes all elements that do not conform or that fall outside defined policy, and regenerates a clean, fully usable file. This approach addresses the concatenated PDF attack at its structural root.

Deep CDR™ Technology prevents this attack technique with its File Structure Verification capability. When processing a concatenated PDF, Deep CDR™ Technology identifies the structural anomaly: the presence of multiple independent document structures, multiple xref tables, multiple trailers, and multiple end-of-file markers in a configuration that does not conform to a valid single PDF document. It then removes the conflicting elements and reconstructs the document from the verified, safe content layer only.

What Deep CDR™ Technology Actually Removes

The following screenshot from MetaDefender shows the Deep CDR™ Technology analysis result for the concatenated phishing PDF. With Deep CDR™ Technology configured and applied, the system identified and acted on each element that violated the expected file structure or security policy. 

Figure 12 — Deep CDR™ Technology analysis result: 2 hyperlinks removed, 1 image sanitized, 3 unused objects removed from the concatenated PDF

As shown, Deep CDR™ Technology took the following actions on the concatenated PDF:

  • Removed 2 hyperlinks: the malicious phishing links embedded in the document were stripped before the file reached the user.
  • Sanitized 1 image: the embedded image, which was used as visual bait in the phishing lure, was sanitized.
  • Removed 3 unused objects: the orphaned objects from the hidden first document structure, which no longer belonged to any valid document layer, were identified and removed.

The resulting output is a structurally clean PDF that preserves business-relevant content and passes file format specification checks. Critically, what the user receives, what AV engines scan, and what any downstream AI system processes are identical: a single, verified document with no hidden structure, no malicious links, and no out-of-policy objects.

Flexible Sanitization Mode

In environments where usability must be maintained alongside security, Deep CDR™ Technology operates in Flexible Sanitization Mode. The system does not block the file. Instead, it performs structural reconstruction: the conflicting document sections are removed, all active and potentially malicious objects are stripped, and a clean, policy-compliant PDF is regenerated and delivered to the user. The user experience is preserved while the attack surface is eliminated.

Sanitization Details Report

Every file processed by Deep CDR™ Technology produces a forensic sanitization report documenting which objects were identified, what action was taken, and why. As illustrated in Figure 11, this report provides a full audit trail of every structural anomaly and policy violation addressed. For Compliance Officers, Privacy Officers, and Legal Counsel, this report is the documented proof that files entering the environment were processed against a consistent, verifiable security policy, and that any deviation from expected file structure was recorded and remediated.

Adaptive Sandbox: Structure-Aware Analysis That Leaves No Blind Spots

While Deep CDR™ Technology mitigates the risk by sanitizing and rebuilding the document, OPSWAT Adaptive Sandbox (Aether) approaches the problem from a fundamentally different angle: it performs deep behavioral analysis of every possible document structure within the file. Where Deep CDR™ Technology removes the threat before a file reaches a user, Adaptive Sandbox detonates the file in a controlled environment and observes exactly what it was designed to do.

In the case of concatenated PDFs, Adaptive Sandbox does not rely on a single parser interpretation. Instead, it performs structure-aware analysis to identify that the file actually contains multiple valid PDF documents appended together. This directly prevents attackers from hiding malicious content behind parser inconsistencies. The analysis proceeds in three stages:

1. Extract: Each embedded PDF document is individually extracted from the concatenated structure. No document layer is treated as an authoritative one. Every section present in the binary stream is identified and isolated for independent inspection.

2. Analyze: Each extracted document is analyzed independently in a controlled emulated environment. Adaptive Sandbox executes the content, monitors runtime behavior, and surfaces any malicious activity including network callbacks, script execution, payload drops, and attempts to exploit the rendering application, regardless of which document layer the behavior originates from.

Correlate: The results of each independent analysis are correlated back to the original file, producing a unified verdict that reflects the true behavioral intent of the complete concatenated document. Indicators of Compromise extracted from each layer are consolidated into a single forensic report, supporting threat intelligence, incident response, and SOC workflows.

Figure 13 — Deep analysis of a concatenated PDF with Adaptive Sandbox

The result is a complete analytical picture with no blind spots. Every embedded document is analyzed. Every object chain is inspected. There is no space for parser tricks. An attacker cannot rely on one application seeing a clean layer while a malicious layer goes unexamined, because Adaptive Sandbox does not make that distinction. It examines everything.

Layered Detection for Complete Defense

Deep CDR™ Technology and Adaptive Sandbox address the concatenated PDF threat from opposite directions, and together they leave no viable attack path. Deep CDR™ Technology removes the threat before the file is delivered: the user receives a structurally clean document with no hidden sections, no malicious links, and no out-of-policy objects. Adaptive Sandbox reveals the intent of the threat before or alongside delivery: every document layer is executed, every behavior is observed, and every Indicator of Compromise is extracted and recorded.

For organizations operating in high-risk environments, this combination is particularly powerful. Deep CDR™ Technology ensures that documents reaching users cannot execute hidden logic. Adaptive Sandbox ensures that the behavioral intent of every document, including every layer of a concatenated file, is understood. Neither technology requires prior knowledge of the specific attack technique to be effective. Both operate on the structure of the file and the behavior of its content, not on known signatures or threat intelligence feeds.

结束语

The concatenated PDF attack technique illustrates a category of threat that detection-based security was not designed to address. There is no malware signature to find. There is no exploit to detect. There is only a structural arrangement of a legitimate file format that causes different systems to see different things.

For IT Managers and Directors, the operational implication is clear: scanning tools currently deployed may be evaluating a different version of a document than the one users open.

For Compliance and Risk Officers, the implication is a governance gap: the audit trail for file security may not reflect the actual content delivered.

For C-Suite Executives, financial exposure is significant, with the average cost of a successful phishing breach now exceeding $4.88 million and attacks that evade standard controls among the most expensive to remediate.

For Legal and Corporate Counsel and Privacy Officers, AI systems acting on hidden document content without human review or security visibility represent an emerging and material risk.

OPSWAT Deep CDR™ Technology and Adaptive Sandbox close this gap from both directions. Deep CDR™ Technology eliminates the structural conditions that allow such threats to exist by verifying file structure, removing all hidden and conflicting document sections, and regenerating a clean, verified output, it ensures every file entering the environment carries exactly the content that was inspected. Adaptive Sandbox ensures that nothing goes unexamined: by performing structure-aware analysis across every embedded document layer, executing each independently, and correlating results back to the original file, it exposes the behavioral intent of threats that no parser trick can conceal. Together, these technologies ensure that what users receive is safe, and that what attackers designed the file to do is fully understood.

其他资源

通过OPSWAT 了解最新信息!

立即注册,即可收到公司的最新动态、 故事、活动信息等。