AUG 07 2025 // 3 min read

Post-Mortem: Incident Report 08/07/2025

Tags
  • post-mortem
  • incident-report

A post-mortem report on a Sandbox Archival and Creation Latency Incident. If you have any questions about the incident, please reach out to us via Slack.

Impact

  • Date: August 7, 2025

  • Duration: ~22 hours (5:22 AM CET / 8:22 PM PST to 3:01 AM CET / 6:01 PM PST)

  • Sandbox Creation: 2.3% of creations failed, remainder experienced significant latency

  • Archival Failures: 6,389 sandboxes failed to archive (all of them will be retried)

  • Customer Impact: 4 customers with 15 sandboxes experienced brief downtime due to runner restarts. Other customers might have experienced delayed sandbox creation times or archival failures.

Note: All times are listed in Central European Time (CET) with Pacific Standard Time (PST) conversions, as our engineering team handling this incident is currently based in Europe.

Initial Detection and Mitigation

5:22 AM CET (8:22 PM PST) - A user reached out reporting elevated disk quotas piling up due to inability to archive sandboxes. Key context: Archived sandboxes do not count towards organization quota, only stopped sandboxes do.

7:50 AM CET (10:50 PM PST) - Incident acknowledged by one of our engineers.

7:51 AM CET (10:51 PM PST) - User disk quota increased to mitigate sandbox creation blockages while we investigate.

7:52 AM CET (10:52 PM PST) - A query by one of our engineers revealed 6,000+ sandbox archival failures that occurred within the last 12 hours and that the number kept going up. The CTO was woken up along with the rest of the engineering team soon after.

Investigation and Initial Response

8:28 AM CET (11:28 PM PST) - Cause of the issue was suspected to be a very high number of archival operations that occurred. The team started investigating and mitigating the issue.

8:45 AM CET (11:45 PM PST) - A hotfix was deployed to relieve pressure on the sandbox archival queue and reduce stress on our runners.

8:50 AM CET (11:50 PM PST) - We observed that the queue success rate went from 0% to 78%.

8:58 AM CET (11:58 PM PST) - First reports of slower sandbox creation times started to come in.

9:00 AM CET (12:00 AM PST) - Confirmed slower sandbox creation times for non-default snapshots.

9:01 AM CET (12:01 AM PST) - Traces revealed that API calls to our runners were hanging and that's what caused timeouts in creating sandboxes. Efforts were focused on resolving the archival backpressure because that was believed to be the cause of runner operations slowing down. To give more context: Archiving a sandbox includes committing the sandbox diff layer and pushing it to our internal registry. This operation might be heavy and that's why we have a policy to only archive 6 sandboxes per runner at a time. Additionally, it's worth noting at this point that sandboxes that have completed backups are archived immediately since their latest backup is already on our registry.

Escalation to Storage System Issues

9:05 AM CET (12:05 AM PST) - A high number of 504 errors were logged on our control plane instances. All related to our internal registry, including: checking if snapshots exist, deleting tags, pulling and pushing snapshots.

9:06 AM CET (12:06 AM PST) - Most archival and backup operations were aborted to lower the load.

9:14 AM CET (12:14 AM PST) - Our DevOps team began inspecting the load on affected Storage instances.

9:21 AM CET (12:21 AM PST) - DevOps team reports that we are experiencing more than 5x the usual number of connections and traffic to all Storage instances which caused a huge bottleneck in the DB and temp storage services.

Storage System Scaling and Recovery Attempts

9:49 AM CET (12:49 AM PST) - DevOps team reports that they doubled the compute capacity of affected Storage instances.

9:50 AM CET (12:50 AM PST) - Control plane hosts were reset to reduce the number of connections to our Storage System.

10:21 AM CET (1:21 AM PST) - Runner instances were also restarted to reset all connections to our Storage system. Storage takes a breather and normal archival operation resumes.

1:50 PM CET (4:50 AM PST) - 504s return and DevOps noticed that the bottleneck was now on the DB attached to our Storage.

1:58 PM CET (4:58 AM PST) - Storage DB was scaled 5x.

Inspecting Docker Runner Instances

2:36 PM CET (5:36 AM PST) - Sandbox creation was still slow and a thorough investigation began.

3:14 PM CET (6:14 AM PST) - Issue was identified as bottlenecks in Docker on runner instances.

3:16 PM CET (6:16 AM PST) - Docker on 2 runner instances crashed. The runners were quickly set as unschedulable to prevent sandboxes from being created there while we get Docker back up.

3:18 PM CET (6:18 AM PST) - Docker was booted back up and sandbox operations resumed normally on the 2 runners.

3:57 PM CET (6:57 AM PST) - The Storage was reconfigured to allow for 10x more connections and rebooted

Deep Dive into Docker Issues

6:39 PM CET (9:39 AM PST) - 504s began piling up and creation times slowing down. Deep investigation into runner hosts individually began.

7:39 PM CET (10:39 AM PST) - The issue was believed to be a corruption in Docker's overlay caused by a high number of I/O operations due to concurrent sandbox backups. This was believed to cause a high number of connections to remain open when trying to push backups.

7:41 PM CET (10:41 AM PST) - Customers that had a sandbox on one of the affected runners were notified that the runner will be restarted and a short downtime was to be expected. This only affected 15 sandboxes and 4 customers in total.

7:54 PM CET (10:54 AM PST) - Customers were notified that the runner was restarted and sandboxes were functioning normally.

8:12 PM CET (11:12 AM PST) - The runner that was restarted began to slow down operations again.

8:13 PM CET (11:13 AM PST) - The following logs in Docker's service were noticed:

1Aug 07 22:49:53 h1146 dockerd[2101]: time="2025-08-07T22:49:53.582331275Z" level=error msg="Can't add file /var/lib/docker/overlay2/552bb8c006b1594ec7e21c27e75247527b385bb6cba76eb53eea68b232b23a5d/diff/run/dbus/system_bus_socket to tar: archive/tar: sockets not supported"
2
3Aug 07 22:49:53 h1146 dockerd[2101]: time="2025-08-07T22:49:53.582640805Z" level=error msg="Can't add file /var/lib/docker/overlay2/552bb8c006b1594ec7e21c27e75247527b385bb6cba76eb53eea68b232b23a5d/diff/tmp/.ICE-unix/181 to tar: archive/tar: sockets not supported"
4
5Aug 07 22:49:53 h1146 dockerd[2101]: time="2025-08-07T22:49:53.582901805Z" level=error msg="Can't add file /var/lib/docker/overlay2/552bb8c006b1594ec7e21c27e75247527b385bb6cba76eb53eea68b232b23a5d/diff/tmp/.X11-unix/X1 to tar: archive/tar: sockets not supported"
6
7Aug 07 22:49:53 h1146 dockerd[2101]: time="2025-08-07T22:49:53.583097924Z" level=error msg="Can't add file /var/lib/docker/overlay2/552bb8c006b1594ec7e21c27e75247527b385bb6cba76eb53eea68b232b23a5d/diff/tmp/plugin700343199 to tar: archive/tar: sockets not supported"

After a quick investigation we found that docker commit could cause sockets to be pushed into image layers which could cause failure to push to a remote repository.

8:15 PM CET (11:15 AM PST) - A high number of layers on the runner were discovered to have sockets inside of them.

8:17 PM CET (11:17 AM PST) - It was concluded that docker push would hang if a layer with socks was to be pushed. This was causing a high number of connections to our Storage instances to stay alive.

Socket Investigation and Failsafe Implementation

9:33 PM CET (12:33 PM PST) - Investigation efforts were still focused on recreating the sockets issue and how to solve it.

9:58 PM CET (12:58 PM PST) - All backup and archival operations were stopped on the control plane hosts.

10:28 PM CET (1:28 PM PST) - We found a way to list layers that have sock files inside and began implementing a failsafe in our runner code.

11:15 PM CET (2:15 PM PST) - We deployed a hotfix with the failsafe to one of the runners.

11:17 PM CET (2:17 PM PST) - Seems that the failsafe had no effect.

Socket Theory Testing and Real Root Cause Discovery

11:22 PM CET (2:22 PM PST) - We reproduced a container with a few socks inside like this:

  • docker run --rm -it --entrypoint bash python:3 and then python -c "import socket as s; sock = s.socket(s.AF_UNIX); sock.bind('/tmp/test.sock')" inside to get /tmp/test.sock

  • Additionally, we ran a gnupg agent inside that would spawn a few socks in the home folder

  • We committed the container (with --pause=false to mimic our runner)

  • We inspected the image to find that the overlay layer in fact did not have the sock files present

  • A scan through the logs of the Docker service revealed the same logs as before. Errors that the socks can't be included in the layer

11:24 PM CET (2:24 PM PST) - We attempted to push the committed image. The push succeeded.

11:25 PM CET (2:25 PM PST) - We attempted to push an image of a sandbox that had before made the docker client hang forever. The push succeeded.

11:27 PM CET (2:27 PM PST) - We concluded that we overengineered the issue a bit. The issue was not with Docker. The issue was indeed our load management. Our backup policy was this:

  • Every 5 minutes, take 10 sandboxes that are started or being archived on a runner and create a backup for them

  • To give an example, for each 100 runner instances, that is 100 x 10 = 1000 backups created every 5 minutes

  • Add to that our auto-archive interval which archives sandboxes that were stopped for X minutes. Auto-archives admittedly had a very strong queuing policy but they still contributed to the active backup processes a lot

Final Resolution and Image Cleanup

12:02 AM CET (3:02 PM PST) - We did a quick scan of our runners and noticed that more than 60% of them had more than 20,000 images stored locally (each). Not only was our backup policy overwhelming, our image removal protocol was broken as well.

12:03 AM CET (3:03 PM PST) - We began procedures to clean orphaned images off of runner hosts.

12:14 AM CET (3:14 PM PST) - Tested the procedure on one runner instance that started breathing.

1:20 AM CET (3:48 PM PST) - First batch of runners were done and breathing.

1:58 AM CET (4:24 PM PST) - Second batch of runners were done and breathing.

2:37 AM CET (5:13 PM PST) - Third batch of runners were done and breathing.

3:20 AM CET (5:48 PM PST) - Fourth and final batch of runners were done and breathing.

3:31 AM CET (6:01 PM PST) - Full resolution: Control plane hosts were operating normally. Sandbox creation times are back to normal.

Root Cause Analysis

Primary Cause

An overwhelming backup policy that generated unsustainable load on our infrastructure:

  • Backup frequency: 10 sandboxes backed up every 5 minutes per runner

  • Scale impact: 100+ runners × 10 sandboxes = 1,000+ backup operations every 5 minutes

  • Auto-archive addition: Automatic archiving of stopped sandboxes added significant additional load

Contributing Factors

1. Broken image cleanup protocol: More than 60% of runners had accumulated over 20,000 images stored locally (each)

2. Storage system bottlenecks: Database and connection limits exceeded by 5x normal traffic

3. Inadequate monitoring: Missing alerts for critical failure modes

What We Learned

The incident lasted for about 22 hours during which every available Daytona engineer was working on the case. During the 22 hours, 2.3% of sandbox creations were dropped, the rest were created correctly (although with a much higher latency than usual). 6,389 sandboxes failed to archive initially but they will be retried and should transition to their archived state normally.

Our monitoring and alerts system failed us

We hadn't implemented the necessary monitoring tools for the following components:

  • Number of errored sandboxes

  • Number of Error logs on control plane hosts

  • Number of non-200 responses on our Storage instances

  • Request duration prolonging

We shouldn't hesitate to pause certain parts of the system to unblock others

Our hesitancy to disable archives caused a much longer period of prolonged sandbox creation times

Don't overengineer issues

If it looks like a system overload, Docker overlay did not suddenly corrupt itself on all our runners at the same time, and it is a system overload. Take a step back after 12 hours of debugging and you'll figure it out.

Action Items

  1. [P0] Implement comprehensive monitoring: - Due: August 17, 2025

  2. [P0] Redesign backup policy with rate limiting - Due: August 21, 2025

  3. [P0] Deploy automated image cleanup system - Due: August 21, 2025

  4. [P1] Create incident response runbooks - Due: August 28, 2025

  5. [P1] Add system circuit breakers - Due: September 4, 2025

This incident, while challenging, provided valuable insights into our system's scaling characteristics and operational blind spots. We will keep our users posted about the action items listed above and how we plan on mitigating this incident in the future.

other updates

Newsletter