PDF File Analysis Cheatsheet

Triage in 30 seconds

sha256sum sample
file sample
trid sample
exiftool sample
strings -a -n 6 sample | less
strings -el sample | less
binwalk sample
ent sample

Goal:
- identify exact format
- find obvious URLs / PowerShell / JS
- detect embedded blobs / high entropy
- decide PDF vs Office vs script vs PE workflow

Main decision tree

PDF?
  → pdfid.py → pdf-parser.py → extract JS / objects / attachments

Office / OLE / OOXML / RTF?
  → oleid → olevba → oledump.py → extract macro / payload

Encoded script / PowerShell?
  → grep / strings / Base64 hunt → decode UTF-16LE if needed

Embedded binary?
  → binwalk -e / foremost → rerun file triage on extracted files

Very common indicators

- powershell -enc / EncodedCommand
- long Base64 blobs
- cmd /c, wscript, mshta, rundll32
- suspicious URLs / IPs
- auto-executing macros
- PDF /OpenAction, /JS, /Launch
- embedded files or object streams

Fast PDF workflow

pdfid.py sample.pdf
pdf-parser.py -a sample.pdf
pdf-parser.py --search javascript sample.pdf
pdf-parser.py --search OpenAction sample.pdf
pdf-parser.py --search EmbeddedFile sample.pdf

Then inspect suspicious objects:
pdf-parser.py -o <id> -f sample.pdf
pdf-parser.py -o <id> -f -d dumped.bin sample.pdf

What to look for

/JS          JavaScript present
/JavaScript  same idea, explicit
/OpenAction  code / action on open
/AA          additional automatic actions
/Launch      launch external content
/EmbeddedFile embedded attachment
/ObjStm      compressed object streams
/XFA         XML Forms; sometimes abused / suspicious
/URI         outbound URLs / links

Useful helpers

peepdf           # richer PDF analysis, JS / suspicious objects
mutool show      # inspect xref / objects
mutool extract   # attachments / embedded resources
pdfdetach        # extract embedded files
qpdf --qdf --object-streams=disable sample.pdf out.pdf
js-beautify dumped.js

Recognize the format

OLE / legacy
- .doc / .xls / .ppt
- compound binary format
- use oletools / oledump heavily

OOXML
- .docx / .xlsx / .pptx
- actually ZIP containers
- inspect with unzip -l or 7z l

RTF
- plain-ish text container
- can embed OLE objects / exploit blobs

Core Office workflow

oleid sample.doc
olevba --decode sample.doc
oledump.py sample.doc
mraptor sample.doc
msodde sample.doc

For OOXML:
unzip -l sample.docx
7z x sample.docx -oout
find out -type f | sort

High-value indicators

- AutoOpen, Document_Open, Workbook_Open
- Shell, CreateObject, WScript.Shell
- URLDownloadToFile, XMLHTTP, WinHttp
- PowerShell, cmd, mshta, rundll32
- DDE fields / remote templates
- suspicious relationships in OOXML
- embedded OLE packages / attachments

Files worth checking

[Content_Types].xml
_rels/.rels
word/_rels/document.xml.rels
word/document.xml
word/vbaProject.bin
word/embeddings/
docProps/

Look for:
- remote templates
- weird external relationships
- embedded objects
- vbaProject.bin presence

Quick commands

unzip -l sample.docx
unzip -p sample.docx word/_rels/document.xml.rels
unzip -p sample.docx word/document.xml | head
find out -iname '*.rels' -o -iname '*.xml' | sort
grep -RniE 'http|https|template|oleObject|external' out/

Common patterns

powershell -enc
powershell -encodedcommand
FromBase64String
IEX
DownloadString
Invoke-WebRequest
Net.WebClient

Search:
strings -a file | grep -iE 'powershell|encodedcommand|frombase64string|iex'
strings -el file | grep -iE 'powershell|encodedcommand|frombase64string|iex'

Decode workflow

# Normal Base64
echo 'BASE64' | base64 -d

# PowerShell -enc is commonly UTF-16LE
echo 'BASE64' | base64 -d | iconv -f UTF-16LE -t UTF-8

# Gzip / zlib after Base64 is also common
python3 - <<'PY'
import base64, gzip
print(gzip.decompress(base64.b64decode('...')))
PY

Base64 hunting

grep -aEo '[A-Za-z0-9+/=]{20,}' sample
strings -a sample | grep -E '^[A-Za-z0-9+/=]{20,}$'
binwalk -e sample
foremost -i sample -o out/

For short noisy dumps:
- try packet / line reconstruction first
- then test Base64 candidates
- then try UTF-16LE decode

One complete workflow

1. sha256sum / file / trid / exiftool
2. strings -a and strings -el
3. Decide:
   - PDF → pdfid.py + pdf-parser.py
   - Office → oleid + olevba + oledump.py
   - OOXML → unzip and inspect .rels, XML, vbaProject.bin
4. Search for:
   - PowerShell
   - Base64 blobs
   - URLs / domains
   - embedded objects / files
5. Extract suspicious content
6. Decode / beautify / rescan extracted payloads
7. Only then consider dynamic execution in an isolated lab