pdf:: file analysis cheatsheet forensics
PDF · office
macro · JS
01 Start Here
Triage in 30 seconds
sha256sum sample
file sample
trid sample
exiftool sample
strings -a -n 6 sample | less
strings -el sample | less
binwalk sample
ent sample
Goal:
- identify exact format
- find obvious URLs / PowerShell / JS
- detect embedded blobs / high entropy
- decide PDF vs Office vs script vs PE workflow
Main decision tree
PDF?
→ pdfid.py → pdf-parser.py → extract JS / objects / attachments
Office / OLE / OOXML / RTF?
→ oleid → olevba → oledump.py → extract macro / payload
Encoded script / PowerShell?
→ grep / strings / Base64 hunt → decode UTF-16LE if needed
Embedded binary?
→ binwalk -e / foremost → rerun file triage on extracted files
Very common indicators
- powershell -enc / EncodedCommand
- long Base64 blobs
- cmd /c , wscript , mshta , rundll32
- suspicious URLs / IPs
- auto-executing macros
- PDF /OpenAction , /JS , /Launch
- embedded files or object streams
02 PDF Analysis
Fast PDF workflow
pdfid.py sample.pdf
pdf-parser.py -a sample.pdf
pdf-parser.py --search javascript sample.pdf
pdf-parser.py --search OpenAction sample.pdf
pdf-parser.py --search EmbeddedFile sample.pdf
Then inspect suspicious objects:
pdf-parser.py -o <id> -f sample.pdf
pdf-parser.py -o <id> -f -d dumped.bin sample.pdf
What to look for
/JS JavaScript present
/JavaScript same idea, explicit
/OpenAction code / action on open
/AA additional automatic actions
/Launch launch external content
/EmbeddedFile embedded attachment
/ObjStm compressed object streams
/XFA XML Forms; sometimes abused / suspicious
/URI outbound URLs / links
Useful helpers
peepdf # richer PDF analysis, JS / suspicious objects
mutool show # inspect xref / objects
mutool extract # attachments / embedded resources
pdfdetach # extract embedded files
qpdf --qdf --object-streams=disable sample.pdf out.pdf
js-beautify dumped.js
03 Office Documents
Recognize the format
OLE / legacy
- .doc / .xls / .ppt
- compound binary format
- use oletools / oledump heavily
OOXML
- .docx / .xlsx / .pptx
- actually ZIP containers
- inspect with unzip -l or 7z l
RTF
- plain-ish text container
- can embed OLE objects / exploit blobs
Core Office workflow
oleid sample.doc
olevba --decode sample.doc
oledump.py sample.doc
mraptor sample.doc
msodde sample.doc
For OOXML:
unzip -l sample.docx
7z x sample.docx -oout
find out -type f | sort
High-value indicators
- AutoOpen , Document_Open , Workbook_Open
- Shell , CreateObject , WScript.Shell
- URLDownloadToFile , XMLHTTP , WinHttp
- PowerShell , cmd , mshta , rundll32
- DDE fields / remote templates
- suspicious relationships in OOXML
- embedded OLE packages / attachments
04 OOXML Internals
Files worth checking
[Content_Types].xml
_rels/.rels
word/_rels/document.xml.rels
word/document.xml
word/vbaProject.bin
word/embeddings/
docProps/
Look for:
- remote templates
- weird external relationships
- embedded objects
- vbaProject.bin presence
Quick commands
unzip -l sample.docx
unzip -p sample.docx word/_rels/document.xml.rels
unzip -p sample.docx word/document.xml | head
find out -iname '*.rels' -o -iname '*.xml' | sort
grep -RniE 'http|https|template|oleObject|external' out/
05 Encoded PowerShell & Payload Hunting
Common patterns
powershell -enc
powershell -encodedcommand
FromBase64String
IEX
DownloadString
Invoke-WebRequest
Net.WebClient
Search:
strings -a file | grep -iE 'powershell|encodedcommand|frombase64string|iex'
strings -el file | grep -iE 'powershell|encodedcommand|frombase64string|iex'
Decode workflow
# Normal Base64
echo 'BASE64' | base64 -d
# PowerShell -enc is commonly UTF-16LE
echo 'BASE64' | base64 -d | iconv -f UTF-16LE -t UTF-8
# Gzip / zlib after Base64 is also common
python3 - <<'PY'
import base64, gzip
print(gzip.decompress(base64.b64decode('...')))
PY
Base64 hunting
grep -aEo '[A-Za-z0-9+/=]{20,}' sample
strings -a sample | grep -E '^[A-Za-z0-9+/=]{20,}$'
binwalk -e sample
foremost -i sample -o out/
For short noisy dumps:
- try packet / line reconstruction first
- then test Base64 candidates
- then try UTF-16LE decode
06 Practical Workflow
One complete workflow
1. sha256sum / file / trid / exiftool
2. strings -a and strings -el
3. Decide:
- PDF → pdfid.py + pdf-parser.py
- Office → oleid + olevba + oledump.py
- OOXML → unzip and inspect .rels , XML, vbaProject.bin
4. Search for:
- PowerShell
- Base64 blobs
- URLs / domains
- embedded objects / files
5. Extract suspicious content
6. Decode / beautify / rescan extracted payloads
7. Only then consider dynamic execution in an isolated lab
PDF / OFFICE / FILE ANALYSIS