BIPI
BIPI

Data Exfiltration Investigation: Following the Bytes Out

Cybersecurity

Modern exfiltration rarely uses obvious channels. We walk through the investigation flow across netflow, proxy, DNS, and cloud DLP signals to reconstruct what left and where it went.

By Arjun Raghavan, Security & Systems Lead, BIPI · March 23, 2024 · 9 min read

#data-exfiltration#dlp#investigation

Exfil is the part of the breach that turns a contained intrusion into a regulator notification. By the time you confirm exfil, you are negotiating with lawyers about what the disclosure says. The investigation flow below is what we run when the first exfil signal lights up.

Where the signals come from

Exfil rarely announces itself. Detection usually comes from one of four sources: netflow showing an anomalous transfer volume to an external IP, proxy logs showing a host posting to a file-sharing service it never used before, DNS logs showing high-entropy subdomains (the DNS tunneling signature), or cloud DLP firing on a download volume that breaks the baseline.

73 days
median dwell time before exfil detection in 2023 cases we worked
4 hours
median exfil window once the attacker decides to leave
5 channels
average number of attempted exfil paths in a single intrusion

Netflow: the volume signal

Pull netflow for the suspect host across the last 90 days. Aggregate by destination IP, destination port, and total bytes. The clean pattern is repetitive traffic to known services (your SaaS, your update servers, your DNS). The bad pattern is a spike to a single external IP that does not appear elsewhere in your fleet. The destination is often a residential ASN or a cheap VPS provider. Reverse-lookup the IP, check ASN reputation, and pivot to whether other hosts also talked to it.

Proxy and TLS inspection: the destination signal

If you have TLS inspection on egress, you see the HTTP host header even when traffic is encrypted. Look for posts to file-sharing services (filebin.net, anonfiles, mega.nz, transfer.sh, pCloud), pastebin clones, GitHub gists, and unusual cloud storage destinations. The 2023 trend was attackers using cloud storage with their own attacker accounts: an S3 bucket they own, a Backblaze B2 bucket, a Wasabi bucket. The traffic looks legitimate because it is going to a major cloud provider, but the bucket is the attacker's.

  • Filter proxy logs for the suspect host, last 30 days minimum
  • Aggregate by destination domain and total upload bytes
  • Look for first-time-seen domains, especially file-sharing or cloud storage
  • Identify any User-Agent that does not match installed software (curl, wget, rclone)

DNS: the tunneling and beacon signal

DNS is the channel everyone forgets to log. Modern exfil over DNS uses high-entropy subdomains to encode payload, with queries to a domain controlled by the attacker. The signature is a high query volume to a single second-level domain, with subdomains that look like base64 or hex strings. Tools like RITA from Active Countermeasures or the open-source DNS analysis in Zeek surface this directly. Beacon detection is the same principle applied to TLS connections: regular intervals to the same destination with low jitter.

Cloud DLP and storage logs

When the data lives in cloud storage, the exfil happens in cloud-native ways. CloudTrail will show GetObject calls in volumes you do not expect. Workspace audit logs show massive Drive downloads. Box, Dropbox Business, and OneDrive Business all expose audit APIs that surface unusual download patterns. The 2023 pattern we saw repeatedly was an attacker with stolen credentials using a legitimate sync client to pull Drive contents to their own machine, then exfilling at a leisurely pace from there.

Real channels from cases worked

  1. rclone to attacker-owned S3 bucket (most common in 2023, by a margin)
  2. GitHub gists with secret URLs (attacker uploads, downloads from a different IP)
  3. Telegram bot API uploading file fragments as bot messages (slow, but auditless on victim side)
  4. DNS TXT queries to attacker domain (low volume but invisible without DNS logging)
  5. Discord webhook posts (small chunks, look like legitimate app traffic)
  6. Anonymous FTP to a VPS that gets reaped a week later

Reconstructing what left

The hardest part of an exfil investigation is proving what left, not just that something left. File access logs help, host EDR with full process command-line logging helps, network capture during the exfil window helps the most. Pull file system access timestamps, correlate with the exfil window, and build a list of files that were accessed in the right window by the right process. The list is your best evidence of scope for the regulatory filing.

Saying 'we believe up to X records were accessed' is the difference between a 50,000 row notification and a 5 million row notification. Reconstruct precisely.

Containment without alerting the attacker

If you block the exfil destination at the perimeter, the attacker knows. They may rotate to a backup channel, accelerate, or burn the access. The judgment call is whether to block and accept the rotation risk, or monitor and let exfil continue while you build the scoping picture. We lean toward block-and-accept when the data is sensitive enough that any additional exfil is unacceptable. We lean toward monitor when the bytes leaving are tolerable and the attribution picture is still being built.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.