• Years ago I built an application called Dashman. Customers used it to show web dashboards on screens around an office, a factory, a classroom, or a public space: Google Analytics, internal Grafana, shop-floor counters, org announcements, whatever they cared about as long as it was on web pages. This post is a retrospective on how that system was put together: the architecture, the handling of rendering jobs, and the end-to-end encryption that applied least privilege to every machine in the system, so that a compromise of the server or the public computers would expose only the data that machine needed to do its job, and nothing else.

    The naive approach, and why it fails

    The naive version of the job is easy to state: log into the websites and display them. One machine with a browser could, in principle, do all of it.

    That machine would hold everything: URLs, cookies, and obviously the rendered pages. In many offices the display sat in a lobby where anyone could walk up to it. In plenty of deployments, all you needed to compromise that computer was a USB keyboard. Plug it in and you were inside a computer that still had the customer’s session cookies in memory or on disk. The attack was simple.

    So the first move is architectural, separating rendering from displaying. Rendering needs cookies and a real browser. Displaying only needs the pixels, the rendered image. The machine under the public screen should not be the machine that logged into Google Analytics.

    The second move is end-to-end encryption, making it mathematically impossible to read the cookies anywhere they weren’t needed. That meant excluding both the Displayer and me as the operator, so my customers didn’t have to trust me with their precious cookies. The encryption was the part that I had the most fun with.

    The architecture

    The hosted side had several components, and the customer side had three applications. Together, the whole stack was a heterogeneous distributed system that looked like this:

    Configurator, Renderer, and Displayer on the customer side; Server, PostgreSQL, S3, and PubNub on the hosted side. Screenshots flow directly between clients and S3; the Server hands out presigned URLs.

    • Configurator: a desktop app the customer ran on an administrator’s machine. They logged into dashboard sources here (cookies originated here), chose what to display and where, and approved Renderers and Displayers.
    • Renderer: ran on a customer-controlled machine inside their network. It loaded configured URLs in a built-in browser and snapshotted them. The customer chose where this ran, such as a machine in a data center.
    • Displayer: ran on machines in public places, connected to the screens. It asked for screenshots at its resolution and showed whatever it got back.
    • Server: a stateless REST API that all the client components talked to.
    • PostgreSQL: durable store for accounts, sites, screenshot metadata, and the render queue.
    • S3: encrypted screenshots at rest.
    • PubNub: push channel. The Server published; Renderers and Displayers subscribed.

    A small deployment could collapse Configurator, Renderer, and Displayer onto one physical machine. Larger ones spread many Displayers across screens, ran several Renderers for capacity and redundancy, and kept a few Configurators on the administrators’ laptops.

    Cookies and two keys

    Everything sensitive had to be end-to-end encrypted. Cookies were the most critical piece of information because with them you could just log into all the websites. So I needed one key, which we’ll call the master key, to encrypt them end-to-end. So far so good, not that hard. With that key in place, the cookies on my hosted database were encrypted blobs that meant nothing to me.

    The screenshots that the Renderers generated also needed to be encrypted so that I couldn’t see them. But I couldn’t use the master key for this task because the Displayers couldn’t ever have it. A compromised Displayer should not expose the key that could decrypt the cookies. That meant a second key, the displayer key.

    This table might help:

    ConfiguratorRendererDisplayerBackend
    master key✅ readable✅ readable🚫 absent🔒 encrypted
    cookies✅ readable✅ readable🚫 absent🔒 encrypted
    URLs✅ readable✅ readable🚫 absent🔒 encrypted
    displayer key✅ readable✅ readable✅ readable🔒 encrypted
    Screenshots✅ readable✅ readable✅ readable🔒 encrypted

    This is just the beginning, though. Keeping these two keys encrypted on the Backend but distributed to Renderers and Displayers is where things get really interesting. More on that after a tour of the render loop.

    The render loop

    Dashman’s day-to-day work was a loop. A Displayer on a wall asked the Server for the best screenshot at its resolution. The Server tried the cache first: it picked among recently successful renders for that tenant, scoring candidates by how closely area and aspect ratio matched the request, by their freshness, and with some randomness.

    The goal there was that if you had 10 sites to display and 10 Displayers, you didn’t want to run 100 jobs when most of those would be pixel identical. But also, you didn’t want the 10 Displayers to show exactly the same thing in sync; variety was valuable.

    If nothing in the freshness window qualified, the Server enqueued a render job, told the Displayer to wait, and woke up the Renderers to start working. PubNub handled the fast path: the Server published to the renderer channel when a job was enqueued and to the displayer channel when a screenshot was ready. Renderers also polled on a slower cadence as a fallback. The database was authoritative and PubNub was a hint about when to look.

    After being woken up by a PubNub notification, all Renderers would try to claim a job. Claiming was an atomic row lock, and only one Renderer won. The order of processing was LIFO, not FIFO, because for displaying screenshots, freshness mattered. That meant a render job could be missed and never processed (under heavy load), but that was acceptable. If a Renderer crashed mid-render, the claim aged out and the row became available again. The claiming Renderer loaded the page, snapshotted it, uploaded the PNG to S3, and reported back. To reduce latency, the Server could auto-claim the next job at that point and assign it when responding to the Renderer. That way rendering jobs were chained together.

    Each site had two knobs the Configurator set: how long to wait after load before snapshotting (charts and fonts need time to settle), and how long a Displayer showed that screenshot before rotating. Both were per-site because a quarterly KPI and a minute-by-minute load report required different amounts of freshness and possibly different amounts of time to render fully.

    After all that, the Server notified the Displayer; the Displayer pulled the file from S3 and showed it on screen.

    Sequence: at the top, the Configurator saves a configuration to the Server, which fans the change out through PubNub to Renderer and Displayer; below, the steady-state loop where the Displayer asks the Server, the Server enqueues and publishes, the Renderer claims and uploads to S3, and the Displayer fetches from S3.

    The sequence diagram above shows two flows: an admin saving a configuration at the top (initial setup and occasional changes), then one full pass of the steady-state render loop below.

    I had sketched an alternative push path using WebSockets and SQS. I shipped PubNub instead because it was less code and less operational complexity while finding product-market fit. At higher scale, migrating would have been worth the savings.

    The render loop assumed each component could decrypt what it received and encrypt what it sent. What follows is the cryptography, in the order a real deployment came up: register user and tenant together, log in, change passwords, approve Renderers and Displayers, render with keys, revoke a decommissioned or compromised machine.

    Cryptography overview

    The user’s password was the root of trust. It never left the Configurator in plaintext. For authentication the Configurator ran SCrypt locally and sent the result to the Server, which ran SCrypt again to store an over-hashed password (a hash of a hash). At login it was similar, first hashing on the Configurator and then over-hashing the password to compare against the database. That split is generally a better way to do authentication than just sending a plain text password, but for Dashman it was a must. You will see why below.

    ConfiguratorRendererDisplayerBackend
    password⏱️ briefly in memory🚫 absent🚫 absent➖ absent
    hashed password⏱️ briefly in memory🚫 absent🚫 absent⏱️ briefly in memory
    over-hashed password⏱️ briefly in memory🚫 absent🚫 absent✅ readable

    During registration the Configurator created the two symmetric keys named above: master and displayer key. They were never stored or sent in plaintext to any hosted server.

    ConfiguratorRendererDisplayerBackend
    master key✅ readable✅ readable🚫 absent🔒 encrypted
    displayer key✅ readable✅ readable✅ readable🔒 encrypted

    To pass those keys through the hosted stack, each user also generated a public/private key pair using elliptic-curve cryptography. The master and displayer keys were encrypted with that public key, and that encrypted payload was stored on the server. Only the matching private key could decrypt them.

    ConfiguratorRendererDisplayerBackend
    user’s private key✅ readable🚫 absent🚫 absent🔒 encrypted
    user’s public key✅ readable🚫 absent🚫 absent✅ readable

    But now we had the problem of storing the user’s private key. We needed to encrypt that key and for that, the Configurator used PBKDF2 to create the password-key, a key derived from the same password the user entered to register or log in. Only someone who knew the password could re-derive that key on a new machine and recover the private key.

    ConfiguratorRendererDisplayerBackend
    password-key⏱️ briefly in memory🚫 absent🚫 absent🚫 absent

    If you are following along, the master/displayer keys get encrypted with the public key of the user, and the private key gets encrypted with the password-key. If you feel there’s one step too many there, you’d be right, until you see how Displayers and Renderers are enrolled. Hold on tight.

    This meant that to successfully log in and start operating the system from the Configurator app all you needed was a single password. But at the API level the Configurator needed two pieces of information: the hashed password (SCrypt, for authentication) and the password-key (PBKDF2, for decryption). Both were derived from the same plaintext password, but only the hashed password ever reached the Server.

    Now picture an attacker with persistent access to the Server: they can dump the database and watch incoming login traffic. The dump only yields the over-hashed password, which can’t be replayed to log in. Watching logins over time, however, lets them collect hashed passwords as users sign in, and that is where most apps would be leaking the plain text password. In Dashman, the hashed password still couldn’t decrypt anything: that needed the password-key, which was derived from the plaintext password and never touched the Server. That is what kept a Server compromise from exposing cookies or URLs.

    Each Renderer and each Displayer also generated their own public/private key pair locally, same as the Configurator. Each also generated a random password, kept on that machine, for authenticating to the Server. Unlike with the Configurator, these private keys never left the machine; the Server received only the public key. The credential and the private key lived and died with the machine: unlike users who could log in from anywhere, identity for Renderers and Displayers was tied to a specific machine and not transferable.

    ConfiguratorRendererDisplayerBackend
    a renderer’s password🚫 absent✅ readable🚫 absent🚫 absent
    a renderer’s private key🚫 absent✅ readable🚫 absent🚫 absent
    a renderer’s public key✅ readable✅ readable🚫 absent✅ readable
    a displayer’s password🚫 absent🚫 absent✅ readable🚫 absent
    a displayer’s private key🚫 absent🚫 absent✅ readable🚫 absent
    a displayer’s public key✅ readable🚫 absent✅ readable✅ readable

    The mechanism by which the master key got to a Renderer was that during enrollment that key was encrypted with the Renderer’s public key and stored on the Server. The Renderer fetched that record and decrypted it with its own private key. That is the payoff: the user’s elliptic-curve key pair exists so the master and displayer keys can be re-wrapped for each new Renderer or Displayer using their public keys, without the user’s password ever touching another machine. The password-key alone couldn’t have done that, because it only exists on the Configurator.

    The master and displayer keys were each put in a key ring: the master key ring and the displayer key ring respectively. That was because whenever a Renderer or Displayer was removed, new keys were generated and then rotated. For some time after rotation, data encrypted with the previous keys still needed to be readable, so all parts of Dashman could decrypt both old and new information.

    With all of this in place, we achieved the goal of cookies and URLs only present in the Configurator and Renderer, and screenshots only readable by the customer but not me, completing the full table of all information:

    ConfiguratorRendererDisplayerBackend
    cookies and URLs✅ readable✅ readable🚫 absent✅ encrypted
    screenshots🚫 absent✅ readable✅ readable✅ encrypted
    password⏱️ briefly in memory🚫 absent🚫 absent🚫 absent
    hashed password⏱️ briefly in memory🚫 absent🚫 absent⏱️ briefly in memory
    over-hashed password⏱️ briefly in memory🚫 absent🚫 absent✅ readable
    master key✅ readable✅ readable🚫 absent🔒 encrypted
    displayer key✅ readable✅ readable✅ readable🔒 encrypted
    user’s private key✅ readable🚫 absent🚫 absent🔒 encrypted
    user’s public key✅ readable🚫 absent🚫 absent✅ readable
    password-key⏱️ briefly in memory🚫 absent🚫 absent🚫 absent
    a renderer’s password🚫 absent✅ readable🚫 absent🚫 absent
    a renderer’s private key🚫 absent✅ readable🚫 absent🚫 absent
    a renderer’s public key✅ readable✅ readable🚫 absent✅ readable
    a displayer’s password🚫 absent🚫 absent✅ readable🚫 absent
    a displayer’s private key🚫 absent🚫 absent✅ readable🚫 absent
    a displayer’s public key✅ readable🚫 absent✅ readable✅ readable

    Registration

    Registration is the first ceremony, the moment when everything the rest of Dashman depends on comes into existence: the authentication material the Server will compare against on every future login, the elliptic-curve key pair the user will carry across machines, the symmetric rings that protect cookies and screenshots, and a single root for all of it (the password). It is the longest of the ceremonies because it has to bootstrap every kind of material at once. After it, the Configurator can fetch and decrypt the user’s data from any machine that knows the password, and the Server is left holding only ciphertext.

    Registration: the user enters a password, the Configurator hashes it and posts it for authentication, the Server creates the account and over-hashes, then the Configurator generates the cryptographic estate locally and uploads only the encrypted artifacts.

    Hashing the password

    The user enters a name, an organization, an email, and a password. The password never leaves the Configurator in plaintext. Before it leaves the machine at all, the Configurator runs it through SCrypt configured with this spec:

    ParameterValue
    AlgorithmSCrypt
    Cost (N)2^14 (16,384)
    Block size (r)8
    Parallelization (p)1
    Output length256 bits
    Salt256-bit random, unique per user

    SCrypt is a memory-hard password hashing function: each guess costs not only CPU time but also a sizeable block of RAM, which makes GPU and ASIC acceleration uneconomical for a brute-force attacker. In 2018 it was the strongest password hashing function available in Bouncy Castle, the cryptography library Dashman shipped on, and the cost parameter is tunable: as hardware gets cheaper, the cost goes up, and because the Configurator stores the spec alongside the result, future generations of the same password are not stuck at today’s setting. The per-user 256-bit salt prevents precomputation across the user base.

    The hash and the spec that produced it travel together in the registration request. The plaintext password does not.

    Storing nothing the Server can replay

    The Server does not store what the Configurator just sent. It runs SCrypt one more time over the incoming hash, with a fresh server-side spec, and stores the result. Call that the over-hash: it is the SCrypt of an SCrypt.

    Hash-then-store on the Server is standard. The unusual half is client-side: by SCrypting before sending, the Configurator keeps the plaintext password off the wire entirely. An attacker watching login traffic only ever sees hashes, not the password itself, and in a system like Dashman, where the plaintext password is what unlocks the rest of the cryptographic material described below, that distinction is the whole game. The client’s spec is stored alongside the over-hash so any Configurator can reproduce the same inner hash from the password later.

    Once the Server has committed the over-hash and created the account and tenant rows, it returns enough about the new user for the Configurator to continue. The Configurator now has an account on the Server, but the Server is not yet holding anything sensitive: no cookies, no rings, not even a public key.

    Generating the cryptographic estate

    Everything that protects sensitive data is built next, and it all happens on the Configurator before anything is sent back.

    First, the Configurator generates an elliptic-curve key pair on secp521r1. The curve choice is deliberate. secp521r1 is the largest of the standard NIST curves, and at the volumes Dashman ever expected (small numbers of agents per tenant, low frequency of cryptographic operations) the runtime cost of the larger curve is invisible. The Server will hold the public key and use it later to wrap material for the user; the private key will be the only thing on the network capable of unwrapping that material, and the Server will never see it without a layer of encryption around it.

    Next, it derives a 256-bit symmetric key from the password using PBKDF2 with this spec:

    ParameterValue
    AlgorithmPBKDF2 with HMAC-SHA-256
    Iterations65,536
    Output length256 bits
    Salt256-bit random, per-derivation

    We called the output the password-key. Both SCrypt (used at registration and login) and PBKDF2 run once per login on the Configurator, and both serve the same underlying purpose: making each guess at the password expensive enough that brute force is impractical. They guard different ciphertexts. SCrypt-derived material is what the Server stores as the over-hash and what an attacker would have to crack to recover the password from a database dump. PBKDF2-derived material is what wraps the elliptic-curve private key, where an attacker who got hold of that wrapped blob would need to brute-force the password through AES-GCM to read it. Within the cross-platform Java stack Dashman ran on, SCrypt (via Bouncy Castle) was the strongest password hashing primitive available, and PBKDF2 with HMAC-SHA-256 (via the JCE’s SecretKeyFactory) was the standard option for deriving a symmetric key from a password.

    The Configurator then encrypts the elliptic-curve private key under the password-key with AES in GCM mode (AES/GCM/NoPadding). The IV, the PBKDF2 salt, and the iteration count travel alongside the ciphertext when it is stored on the Server. The password never does. AES-GCM gives authenticated encryption, so a wrong password produces a GCM tag failure rather than silently decrypting into garbage that might look like a private key.

    Two empty rings are created next, the master ring and the displayer ring, and one random 256-bit AES key is added to each. The keys are pulled directly from the platform CSPRNG: there is no derivation, no input, just bytes. A ring is an ordered list of AES keys where the last entry is the current key (used for fresh encryption) and earlier entries are retained so that older ciphertexts remain decryptable. At registration, each ring has exactly one key; rotations described later append more.

    Finally, each ring is wrapped with the user’s elliptic-curve public key using Bouncy Castle’s ECIESwithAES-CBC profile, with a 128-bit nonce and a 128-bit MAC. ECIES, as a family, glues an ECDH agreement to a symmetric cipher and a MAC; the BC profile chosen was the straightforward bundled option at the time, rather than assembling the construction from primitives by hand. Before encryption, each ring is placed inside a small verifier envelope identifying the user and account it belongs to; on decryption the verifier must match or the operation fails. The verifier is what makes a misrouted blob (say, one user’s ring pasted into another’s row) fail closed instead of silently decrypting into something the application would try to use.

    At this point the Configurator holds in memory: the password (still, very briefly), the password-key, the elliptic-curve key pair, both rings, and the SCrypt hash of the password.

    It sends to the Server, in one PUT request: the public key, the ciphertext of the private key, the ciphertext of each wrapped ring, and the parameters needed to rebuild the password-key on a future login. The Server persists all of that as JSON it cannot decrypt.

    By the time the request returns, the password is dropped from memory. From the Server’s vantage, it is holding pieces of mathematics whose meaning is gated on a password it has never seen and on a private key it cannot reconstruct. From the Configurator’s vantage, it now has everything it needs to manage cookies and sites for this account, and a path to recover those materials on any other machine that knows the password.

    Logging in

    Login does two things at once: it authenticates the user to the Server, and it rebuilds the same in-memory state that registration produced. Those are independent code paths that happen to share a single input (the password) and a single transport (the Configurator’s HTTPS connection). If the first one succeeds and the second one fails, the user is signed in but cannot read anything.

    Login: the Configurator fetches the user's SCrypt spec from the Server, hashes the entered password with it, authenticates with HTTP Basic, then receives the encrypted key pair, the two encrypted rings, and the encrypted cookies and decrypts them locally.

    Authenticating

    Each user has their own SCrypt salt, generated at registration. The Configurator does not yet know it on a fresh install, so it asks the Server: given this email, what spec should I use? The answer is not secret. Without the password, the spec is useless. With the spec, the Configurator can reproduce the same SCrypt hash registration computed, regardless of which machine it runs on.

    The Configurator runs SCrypt with the fetched spec and forms HTTP Basic credentials of the form email:base64(hash). The Server runs SCrypt once more over the incoming hash with the spec that sits next to the over-hash, compares byte for byte, and either accepts or rejects. There is no separate session token; subsequent requests reuse the same Basic credentials until the Configurator clears them.

    If the spec the Configurator received is older than current defaults (say, the cost parameter has been raised since the account was created), the Configurator re-hashes the password with the current spec and uploads the upgraded hash on the next save. Work factors rise over the lifetime of the system without forcing anyone to reset their password.

    Recovering the keys

    Authentication only proves the password matches the stored over-hash. The Configurator still has to decrypt everything that follows. The Server returns, in the response to the authenticated GET, the encrypted elliptic-curve key pair, both encrypted rings, and the tenant’s encrypted cookies. The Configurator then:

    1. Reads the PBKDF2 spec stored alongside the encrypted private key and re-derives the password-key from the entered password. PBKDF2 with the same inputs yields the same key it produced at registration.
    2. Decrypts the elliptic-curve private key under the password-key. AES-GCM verifies that the ciphertext has not been tampered with, and a wrong password produces a tag failure rather than a corrupt private key.
    3. Decrypts each ring with the elliptic-curve private key, using ECIES. The verifier inside each ring’s ciphertext must match the expected user and account, otherwise decryption is rejected even if the algebra would have succeeded.
    4. Decrypts cookies under the master ring. The ring is tried current-key first, then older keys, until one verifies; that is how cookies encrypted before the last rotation remain readable.

    By the time the Configurator’s log in process finishes, its memory looks identical to the state at the end of registration, except that the Configurator did not generate any of these materials, it derived and decrypted them. The password is dropped; the password-key has done its job and is discarded with it.

    A practical corner of this design: a wrong password and a tampered private-key blob both manifest as the same AES-GCM tag failure. In practice this rarely caused confusion because the authentication step ahead of decryption already filtered out the common case (a mistyped password).

    The reason the elliptic-curve private key is wrapped under the password rather than stored only in an OS keychain is that Dashman was designed to recover on a fresh install: type the password, get everything back, no preloaded secrets needed. A keychain-only design would have hurt that experience and would not have changed the security story, since whatever the keychain held would still need an unlock secret tied to the user.

    Changing the password

    A password change in Dashman is, deliberately, the cheapest cryptographic operation the system performs. The hard work is concentrated at registration; from that point on the password unwraps exactly one thing (the elliptic-curve private key) and nothing else.

    Password change: the Configurator hashes the current password to authenticate, hashes the new password with a fresh spec, derives a new password-key, re-wraps the elliptic-curve private key under it, and sends the bundle to the Server.

    The flow only runs while the user is already logged in, which means the Configurator already holds the decrypted elliptic-curve private key and both rings in memory. It does not hold the original password (login dropped it), so the user has to enter the current password again to prove identity to the Server, plus the new password to provide fresh derivation input.

    The Configurator does, in order:

    1. SCrypts the current password with the stored spec to construct authentication credentials, the same way login does. This proves to the Server that whoever is asking for the change still knows the password the account was created with.
    2. SCrypts the new password with a fresh spec (new salt, current defaults). This is the hash the Server will over-hash and store going forward.
    3. Re-derives a new password-key from the new password with PBKDF2 and a fresh spec. The old password-key was never persisted, so there is no need to invalidate it; it just stops being useful once the new one is in place.
    4. Encrypts the elliptic-curve private key, which is sitting in memory in plaintext, under the new password-key with AES-GCM. The public key does not change. The rings do not change.

    The Configurator sends, in a single request: the new hashed password and its spec, the new ciphertext of the elliptic-curve private key (with its new PBKDF2 salt embedded), and any other profile fields the user edited. The Server first verifies the request using the current password’s Basic credentials, then over-hashes the new hashed password, stores it in place of the old one, and replaces the encrypted key-pair blob.

    The master and displayer rings are untouched. Cookies, site configurations, and screenshots already in S3 do not need to be re-encrypted: their keys live in the rings, which are themselves wrapped under the unchanged elliptic-curve public key. Re-encrypting any of that on a password change would be a lot of work for no security benefit, since the password only ever protected one specific layer of the cryptographic onion.

    A password change does not retroactively invalidate any other Configurator that previously decrypted the private key. Cryptographic invalidation of in-flight or cached material is what ring rotation is for, and that runs in a different ceremony, described later.

    Approving a Renderer

    Setting up Dashman means enrolling at least one Renderer and at least one Displayer. The two ceremonies share almost the entire shape, both in the UI the customer touches and in the sequence of network calls underneath; what differs is what each agent ends up holding. The Renderer is the fuller case: it receives both rings (master to decrypt cookies and site URLs, displayer to encrypt screenshots). I’ll walk through it first.

    Renderer approval: the Renderer generates its own key pair and a random password and connects unapproved; the user approves it in the Configurator, which wraps both rings under the Renderer's public key; PubNub notifies the Renderer, which fetches and decrypts the rings and the cookies.

    Connecting without keys

    A Renderer is launched on a machine the customer controls. On first start it has no account, has nothing to render, and does not know anyone’s password. The only input it has is a target: a slug or email the user types in that identifies the tenant it wants to join.

    Locally, before contacting anyone, the Renderer generates two things. First, its own elliptic-curve key pair on secp521r1, the same curve the user picked at registration but generated independently and never derived from the user’s. Second, a random password drawn from the platform CSPRNG, which the user never sees. That password is purely a credential for talking to the Server later; it is not used to derive any encryption material. Unlike the Configurator, where a wrapped copy of the private key is uploaded so the user can log in from any machine, a Renderer’s private key never leaves: both the credential and the private key live and die with this specific machine, and onboarding another Renderer means generating fresh ones on the new machine with no way to transfer the original identity.

    It sends to the Server: the tenant identifier, a display name (typically the machine’s hostname, useful when the user is looking at a pending Renderer in the Configurator and has to recognize it), the public half of the key pair, and the random password.

    The Server hashes the password once with SCrypt and persists a Renderer record marked unapproved. A single hash, rather than the double SCrypt used for users, is appropriate here because the password was generated by a CSPRNG and is not a stretching target for an offline guesser; the hash exists only so the Server is not storing the credential in clear. The Renderer is now visible on the Server, but the only material attached to it is its public key, its name, and authentication state. The Server then publishes a PubNub notification on the tenant’s user channel, which is how the Configurator learns there is a pending Renderer.

    Approving

    In the Configurator, the user sees the pending Renderer appear in the list and clicks Approve. This is the cryptographic step: it is the moment the user, who has the only copy of the unwrapped rings, decides that a specific Renderer’s public key is allowed to decrypt them.

    The Configurator wraps both rings, master and displayer, under the Renderer’s public key with ECIES, using the same ECIESwithAES-CBC profile and the same verifier envelope as everywhere else. The verifier here ties the wrapped material to the Renderer’s identifier and the account identifier, so a copy of one Renderer’s ciphertext cannot be reused by a different Renderer even if both belonged to the same tenant. The Configurator then PUTs the wrapped rings, with the Renderer marked approved, to the Server.

    Setting approved without wrapping the rings would be inert: the Renderer can authenticate either way once it has a password, but it cannot decrypt anything until the encrypted rings exist. Approval and key wrapping are bundled in the same request to keep the two from drifting out of sync.

    A subtle point worth pulling out. The user’s rings were already wrapped, at registration, under the user’s elliptic-curve public key. Approval wraps them a second time, under the Renderer’s elliptic-curve public key. The Server ends up holding two ciphertexts that decrypt to the same plaintext: one for recovery on a fresh Configurator (login on a new laptop), one for use on the Renderer. The Configurator only ever sends material wrapped under public keys; the user’s password never reaches the Renderer machine, and the Renderer’s private key never reaches the Configurator.

    Coming online

    The Server publishes a second PubNub notification, this time on the Renderer’s channel, saying the Renderer is now approved. The Renderer fetches its own record (now with encrypted rings populated), decrypts the rings under its own private key, decrypts the tenant’s cookies under the master ring, and from that point on can render pages.

    Approving a Displayer

    The ceremony for adding a Displayer looks almost identical to the one for a Renderer. The Displayer generates an elliptic-curve key pair and a random password locally, sends a connection request with its public key, the Server hashes the password and creates an unapproved record, the Configurator gets a PubNub notification, the user clicks Approve, the Configurator wraps a ring under the Displayer’s public key, the Server stores it, PubNub tells the Displayer, the Displayer decrypts.

    Displayer approval: same shape as the Renderer ceremony, but only the displayer ring is wrapped and shipped.

    But did you notice the difference?

    The Configurator wraps a ring (singular), not the rings. The master ring is not part of the payload, and on the Displayer side, the accessor that would return the master ring deliberately returns nothing. The Displayer’s reality is: it receives only one ring, it only ever decrypts one kind of payload (screenshots), and there is no path in the codebase by which a master key could end up in its memory.

    That asymmetry is the entire reason the Displayer exists as a separate component. A Renderer is a trusted machine in the customer’s network (a data center, an admin’s desk) that needs cookies in order to do its job. A Displayer is the machine wired up to a screen in a lobby or a hallway. Even if a stranger walked up to the lobby machine with a USB keyboard and dumped everything in memory, all they would get is whatever screenshots had recently been displayed and the key that decrypts a few more from S3. The decryption authority that could log into Google Analytics never touches that machine, and the Configurator has no way to send it there.

    The flip side of that constraint is that the Displayer cannot do anything with a screenshot until it has been encrypted under its ring. Which is what the next section is about.

    Rendering a screenshot

    Up to this point every ceremony has been about provisioning. Once Renderers and Displayers are approved, Dashman spends the rest of its life in a loop where bytes flow through the system and end up on a screen. The architectural separation from earlier and the cryptographic plumbing from the last few sections finally pay off together in this loop.

    Screenshot rendering: a Displayer asks the Server for a screenshot, a render job goes through PubNub to a Renderer which fetches encrypted site config, renders the page, encrypts the screenshot for the displayer ring, uploads it to S3, and the Displayer downloads and decrypts it.

    A Displayer asks the Server for the best screenshot at its current resolution. The Server consults its cache. If the cache cannot satisfy the request, it queues a render job for a fresh screenshot and publishes a notification on the renderer channel. The earlier render-loop section covered the cache and the queue; here the focus is on what is encrypted at each step.

    A Renderer wakes up, claims the newest job (LIFO so that a Displayer waiting on the screen sees fresh pixels first), and gets back from the Server two things: the encrypted site configuration (URL, per-site delay, anything else specific to the site), and a pair of presigned S3 URLs (one for upload, one for download), both valid for a day. The site configuration is encrypted under the current master key; the Renderer decrypts it by trying every key in its master ring until one verifies, in practice the current one.

    With the URL in hand, the Renderer loads it in an embedded WebView with the tenant’s cookies attached. After the page has loaded, the Renderer waits the per-site delay (some sites take seconds for charts and fonts to settle), snapshots the WebView, and serializes the result as PNG bytes.

    It then encrypts those bytes with the current displayer key, using AES-GCM, and attaches a verifier identifying the screenshot and the site the bytes belong to. On the Displayer side, decryption will reject any blob whose verifier does not match the screenshot the Displayer asked for. The resulting ciphertext, along with its IV and the small envelope around it, is uploaded to S3 via the presigned PUT URL. The Server never sees the bytes; the upload goes directly from the Renderer to S3.

    The Renderer then PUTs a small marker back to the Server reporting the job done. The Server flips the screenshot to rendered, publishes a notification on the displayer channel, and, as an opportunistic latency improvement, returns the next claimable job in the same response so the Renderer can chain into it without round-tripping through PubNub.

    A Displayer subscribed to the displayer channel receives the notification, fetches the screenshot metadata (which includes the presigned GET URL), downloads the ciphertext from S3, and decrypts it with its own displayer ring. The verifier check rejects any blob whose screenshot or site identifiers do not match what the Displayer asked for: a misrouted file fails closed instead of decrypting into something unexpected. Decryption succeeds, the Displayer hands the PNG to the screen rendering code, and the public sees pixels.

    Three things land at once at this point. The Renderer briefly saw cookies and a fully rendered web page, but they existed only in memory while it painted. The Displayer never saw cookies or a URL; it received an opaque blob, verified it, decrypted it, and showed it. The Server stored a record of which screenshots existed and pointers to where in S3 they lived, but it stored no plaintext; an operator browsing the production database, or an attacker who dumped it, would find ciphertext indexed by tenant. S3 held the ciphertext but could not read it. The end-to-end story holds.

    Decommissioning a Renderer

    Renderers and Displayers come and go. A data center contract ends, a machine is replaced, a Displayer in a lobby is suspected of having been physically tampered with. From a cryptographic point of view, the question is whether the new state of the system can be reached without invalidating the user’s password. The answer is yes, by rotating rings rather than rotating roots.

    Renderer removal: the Server deletes the Renderer record and notifies it via PubNub; the Configurator generates new master and displayer keys, re-wraps both rings for every remaining agent, re-encrypts site data, and saves the whole thing in one transaction.

    When the user removes a Renderer in the Configurator, the Configurator first asks the Server to delete the Renderer row. The Server deletes it and publishes a notification on the Renderer’s channel. If the Renderer is online, it receives the notification, logs out, and stops; its credentials no longer exist on the Server, so even if it tried to keep polling, the Server would reject it.

    Deletion alone is not enough. The removed Renderer kept whatever plaintext it had already extracted, and it kept its copy of both rings on local disk. If it is offline at the moment of deletion (the machine was stolen, the operator only knows it is missing), it will never receive the notification at all. Anything encrypted with the current master or displayer key from this point onward (future cookies, future site configurations, future screenshots) must be unreadable to the Renderer that just left.

    So the Configurator, immediately after the delete completes, performs the rotation:

    1. Generates a new random 256-bit AES key and appends it to the master ring. The previous master key stays in the ring; it has to, because cookies and site configurations encrypted before this moment are still in the database and still need to decrypt.
    2. Generates another fresh 256-bit AES key and appends it to the displayer ring, for the same reason: screenshots already in S3 are encrypted under the old displayer key and still need to be readable by the Displayers that survived.
    3. Re-wraps both extended rings under the user’s elliptic-curve public key, since the user themselves needs to recover the new state on the next login.
    4. For each Renderer that remains approved, re-wraps both rings under that Renderer’s public key with ECIES, replacing the previous ciphertext on the Server.
    5. For each Displayer that remains approved, re-wraps the displayer ring under that Displayer’s public key, replacing the previous ciphertext.
    6. Re-encrypts every site’s configuration (URL and cookies) under the new master key. Because encryption always uses the last key in the ring, anything saved from now on is under the new key only; the old key stays in the ring solely to decrypt historical artifacts.

    All of that, plus the deletion that prompted it, is sent to the Server as one PUT and committed in one transaction. Either every agent is updated and every site is re-encrypted, or the rotation is aborted; there is no window in which some agents have the new ring and others do not.

    The removed Renderer is now in a peculiar but desirable position. It still has its old copy of both rings, so any old ciphertext it kept around (a screenshot that happened to be in transit, a site configuration it cached before deletion) would still decrypt locally. But it has no credentials to fetch anything new from the Server, and even if it had a backdoor channel, every new screenshot in S3 is encrypted under a key it never received, every site configuration the Server holds is encrypted under a key it never received, and any new cookie the Configurator saves is encrypted under a key it never received. The decryption capability the Renderer kept is bounded by what it already had, not by what the system will produce going forward.

    Removing a Displayer follows the same routine. The Displayer never held the master key, so the master rotation is, strictly speaking, more than the threat requires; the displayer rotation is what matters. The codebase rotates both anyway because it is cheap, the routine is shared with Renderer removal, and rotating both leaves no current key the removed agent ever held in either ring. Simpler to reason about, no real cost at the volumes Dashman handled.

    Verifiers and wire formats

    A note about a detail that has shown up in every ceremony so far without much explanation. Every encrypted payload, symmetric or asymmetric, is wrapped in a small typed envelope before it is encrypted. The envelope holds the payload itself plus a verifier: a small object identifying what the payload is supposed to be. For user blobs the verifier carries the user identifier and the account identifier; for renderer and displayer blobs, the agent and account identifiers; for screenshots, the screenshot identifier and the site identifier. The envelope is serialized to JSON, encrypted as one piece, and on decryption the verifier inside the plaintext must match the verifier the caller expected. If it does not, decryption raises a verification failure rather than returning the bytes.

    This is not a digital signature. It does not prove the payload came from a particular party. What it gives instead is type-level safety inside the ciphertext: a payload meant for one tenant cannot be misrouted to another and silently decrypt into something the application would try to use. Crossing tenants is the kind of bug a system like this should not tolerate as silent corruption; verifier mismatches make it loud.

    For asymmetric wrapping the system uses Bouncy Castle’s ECIESwithAES-CBC with a 128-bit nonce and a 128-bit MAC. The curve choice (secp521r1) was the largest NIST curve with a mature ECIES profile in Bouncy Castle. The bigger curve hedges against future cryptanalysis at a runtime cost invisible at Dashman’s volumes. Assembling ECIES from primitives by hand would have been more work for no real gain at the time.

    For symmetric encryption the system uses AES in GCM mode (AES/GCM/NoPadding) with a per-payload random IV. GCM gives authenticated encryption (a tag that fails closed on tampering), which composes cleanly with the verifier envelope: a wrong key produces a GCM tag failure, a right key over the wrong tenant’s payload produces a verifier mismatch, and both are surfaced as decryption failures with no plaintext returned.

    I would revisit the curve and the specific ECIES profile if I were starting again today; standards drift and library defaults have moved on. The compartmentalization story (which keys live where, who can wrap material for whom, and what each component is mathematically prevented from reading) would stay the same regardless.

    That was fun

    Building Dashman was a lot of fun. Thinking through how it could be attacked, from lobby USB keyboards to a malicious operator rifling through S3, was fun in the same way hard puzzles are. Putting it into production and watching real customers use it was satisfying: the design decisions held up under real configuration mistakes, flaky networks, and Renderers disappearing mid-job, not only under the load tests I ran myself.

    I do wish it had seen more sustained load in the wild. It did get real usage, just not the “millions of screens” scale that would have stress-tested every corner of the queue and cache behavior beyond what I could simulate. Still, for a retrospective on a system I built years ago, that is a good problem to have.

  • When I was building Dashman, I hit a wall that Apple’s paid developer support told me was unsolvable. Then I solved it by reading the binaries in assembly until I understood them well enough to swap them out.

    A note up front: this happened a long time ago and I’m writing it from memory. The shape of the work is right; small specifics, API names, and exact sequence of events may be slightly off.

    What was Dashman?

    Dashman was a B2B SaaS for displaying business dashboards (which are, technically, just web pages) on the screens around an office. An early version of it was a native macOS app with an embedded browser. That’s the version this post is about.

    The problem

    On macOS, an embedded browser meant Apple’s WKWebView, which shared a single, system-wide cookie jar with Safari.

    That was a non-starter. Dashman needed to be logged into the dashboards it displayed, typically with accounts different from the user’s: Google Analytics, AWS consoles, internal admin tools. Mixing Dashman’s and Safari’s cookies would have caused constant overwrites in both directions. The longer-term goal was stricter still: one cookie jar per dashboard, so the same install could display two accounts of the same web application side by side.

    Long story short, I needed full control of the cookie jar. Other forms of local storage existed by then but were still nascent; the cookies were where the leakage actually happened.

    I paid Apple’s developer support fee and asked. The reply was friendly, brief, and unambiguous:

    After investigating whether different WKWebViews can each have their own cookie jar the answer is no they cannot — they all share a single container. If you feel that this is something that you need for your application, please file an enhancement request here:

    Challenge accepted, Mr Apple.

    Reading the cookie jar in assembly

    The macOS cookie jar is NSHTTPCookieStorage, a Foundation class with a small public API: get cookies, set cookies, delete by URL, the usual. Underneath, it talks to CFNetwork, the C framework Foundation wraps. The implementation I’d need to replace lived in a binary I didn’t have source for.

    So I opened CFNetwork.framework in Hopper Disassembler and started reading.

    Disassembled CFNetwork in Hopper, showing the implementation of -[NSHTTPCookieStorageInternal initInternalWithCFStorage:] and the symbol list of every NSHTTPCookieStorage and NSHTTPCookieStorageInternal method.

    I sometimes wonder what this would look like on Apple Silicon today: ARM instead of x86, Swift sneaking into Apple’s frameworks. Different bytes, presumably the same archaeology.

    What the disassembly revealed was a three-layer onion. The public class, NSHTTPCookieStorage, was a thin façade. Most of its methods forwarded to an internal Objective-C class, NSHTTPCookieStorageInternal, that held the actual state behind an NSRecursiveLock. That class in turn called down into a family of plain C functions named _CFHTTPCookieStorageCreateDefault, _CFHTTPCookieStorageGetClass, _CFHTTPCookieStorageCreateInMemory, and friends. Foundation wrapper on top, private Objective-C in the middle, C core at the bottom.

    Three-layer diagram of Apple's cookie storage stack, with the public Foundation façade above the private internal class and the CFNetwork store shared with Safari.

    The disassembly told me five things at once. Which selectors did real work. Which were forwarded. Which were vestigial. Where cookies actually got persisted. And, as a bonus, the error strings the binary fell back to when something went wrong: literal byte sequences like Cannot get default cookie store - using a memory store for this process were sitting right there in the data segment, telling me which failure modes Apple’s engineers had anticipated and which they hadn’t.

    It was, honestly, fun. Hopper’s cross-references made it possible to climb up from any C function to every Objective-C selector that called it, and from any selector back down to the persistence calls. After a while I had a mental map of which methods I’d need to satisfy and which I could safely ignore.

    There’s no symbols in the source, which made this hard. Method names survive (Objective-C runtime needs them) but everything else is x86 instructions. What looks in the source like if (self.policy == NSHTTPCookieAcceptPolicyAlways) shows up as a cmp against an immediate, with a jne to a label that another four objc_msgSend calls eventually reveal as the rejection branch. You build the abstraction back up by hand.

    The replacement: a private cookie jar

    What I built was a single Objective-C class, CAHTTPCookieStorage, that implemented the subset of NSHTTPCookieStorage‘s interface that real callers actually used. Cookies live in a dictionary keyed by a composite of (domain, path, name), all lowercased, so duplicates collapse on insert:

    - (NSUInteger)hash {
    return self.domain.hash ^ self.path.hash ^ self.name.hash;
    }
    - (BOOL)isEqual:(id)other {
    if (![other isKindOfClass:[CookieKey class]]) {
    return NO;
    }
    CookieKey *otherCookieKey = (CookieKey *)other;
    return [self.domain isEqual: otherCookieKey.domain]
    && [self.path isEqual: otherCookieKey.path]
    && [self.name isEqual: otherCookieKey.name];
    }

    The lowercasing matters more than it looks. The HTTP cookie spec is case-insensitive on domains and paths in some places and case-sensitive in others, and NSHTTPCookie itself doesn’t normalise consistently. Without the lowercase hash, you can end up with two distinct entries for www.example.org and WWW.EXAMPLE.ORG and behaviour that diverges from Safari’s depending on which one you set last. The test suite (next section) caught this kind of thing immediately.

    Persistence is NSKeyedArchiver to a path under ~/Library/Application Support/<bundle>/Cookies, with the directory created lazily on first write. If the file is corrupted on read we surface it via a NSLog and start with an empty jar rather than crashing the host process. That seems obvious in retrospect; it wasn’t obvious the first time the archive deserialiser threw an exception three frames deep into application launch.

    Domain matching is the one piece that always trips people up. Cookies with a leading dot match subdomains; without a leading dot, they only match exactly. The whole rule collapses to a single ternary once you accept that:

    + (BOOL)match:(NSURL *)url toDomain:(NSString *)domain {
    NSString *host = [[url host] lowercaseString];
    if (host != nil) {
    return [domain hasPrefix:@"."]
    ? ([[@"." stringByAppendingString: host] hasSuffix: domain])
    : [host isEqualToString:domain];
    } else {
    return false;
    }
    }

    Expiry I made lazy. Every call to cookies walks the dictionary and evicts anything past its expiry. Cheaper than running a timer, and it always returns a clean view:

    - (NSArray *)cookies {
    NSMutableArray *cookiesToRemove = [NSMutableArray array];
    for (CookieKey *key in _cookies) {
    NSHTTPCookie *cookie = [_cookies objectForKey: key];
    if ([cookie expiresDate] != nil
    && [[cookie expiresDate] isLessThan:[NSDate date]]) {
    [cookiesToRemove addObject: key];
    }
    }
    for (CookieKey *key in cookiesToRemove) {
    [_cookies removeObjectForKey: key];
    }
    return [_cookies allValues];
    }

    About 300 lines of core logic across the storage class, the cookie key, and the swizzle category. Most of it is RFC 6265 plumbing.

    A test suite for someone else’s code

    Once I had a working mental model, I wrote a test suite. Not against my own code: against Apple’s. The suite exercised every observable behaviour I could find. Cookie expiry, including Max-Age versus Expires. Domain matching, including the leading-dot wildcard rules and the prefix trap (prefixexample.com must not match a cookie set on example.com). Path prefix matching. Secure flag handling. Set-Cookie parsing. The order in which cookies came back. Case sensitivity quirks across domain, path, and name. Cookie acceptance policies (Always, Never). Round-tripping through archive and unarchive. I deliberately included behaviours I suspected were abandoned, on the theory that if they still existed in the binary, something somewhere depended on them.

    The trick that paid off: every test was written so it could run against either the real NSHTTPCookieStorage or my CAHTTPCookieStorage. Same assertions, swap the class name at the top of the file, run again. When something passed Apple’s run and failed mine, I had a divergence to chase. When it passed both, I knew I’d matched behaviour.

    There was a charming downside, immortalised in a comment at the top of the test file:

    Be aware though that these tests are cookie destructive. If you run them against NSHTTPCookieStorage it will destroy all your Safari cookies and if you run them against CAHTTPCookieStorage it will destroy all the Ninja cookies.

    The tests cleared the jar before each run so the assertions had a known starting state. Validate them against Apple’s class to confirm the spec, and you’d lose every Safari cookie on the machine. Bank logins, session tokens, the lot.

    Once swizzling was in (more on that later), I went a step further. Every test got a twin with the suffix ThroughSharedStorage: same assertion, but reaching in through NSHTTPCookieStorage.sharedHTTPCookieStorage() rather than my class directly. This proved two things at once. First, the swizzle was alive: a call into the Foundation singleton was actually landing in my code. Second, my class behaved identically whether you called it directly or arrived through the Apple façade. Both halves had to be true for the integration to be safe, and the dual-path testing made any drift between them impossible to miss.

    This was the contract. Whatever I built next had to satisfy it.

    Method swizzling, the final step

    The last piece was making the rest of the system call my implementation instead of Apple’s. Objective-C method swizzling lets you swap the implementation of a selector with another at runtime: the class is still NSHTTPCookieStorage, the API surface is unchanged, but underneath it now calls into mine.

    I used JRSwizzle as the runtime helper, ran the swap from a +load method (which the Objective-C runtime calls before main), and guarded it with dispatch_once so a category accidentally loaded twice wouldn’t try to double-swap and undo itself:

    + (void)load {
    static dispatch_once_t swizzleMethodsToken;
    dispatch_once(&swizzleMethodsToken, ^{
    NSError *error = nil;
    [[self class] jr_swizzleMethod:@selector(deleteCookie:)
    withMethod:@selector(caDeleteCookie:) error:&error];
    [[self class] jr_swizzleMethod:@selector(cookieAcceptPolicy)
    withMethod:@selector(caCookieAcceptPolicy) error:&error];
    [[self class] jr_swizzleMethod:@selector(setCookieAcceptPolicy:)
    withMethod:@selector(caSetCookieAcceptPolicy:) error:&error];
    [[self class] jr_swizzleMethod:@selector(cookies)
    withMethod:@selector(caCookies) error:&error];
    [[self class] jr_swizzleMethod:@selector(cookiesForURL:)
    withMethod:@selector(caCookiesForURL:) error:&error];
    [[self class] jr_swizzleMethod:@selector(sortedCookiesUsingDescriptors:)
    withMethod:@selector(caSortedCookiesUsingDescriptors:) error:&error];
    [[self class] jr_swizzleMethod:@selector(setCookie:)
    withMethod:@selector(caSetCookie:) error:&error];
    [[self class] jr_swizzleMethod:@selector(setCookies:forURL:mainDocumentURL:)
    withMethod:@selector(caSetCookies:forURL:mainDocumentURL:) error:&error];
    });
    }

    Eight selectors. That was all the public API anyone called. The list maps one-to-one to the symbols I’d seen in Hopper’s left sidebar.

    There was one subtlety. My first instinct was to override +sharedHTTPCookieStorage so that anyone asking for the singleton would get my object instead of Apple’s. That didn’t work, and I wrote down the reason at the top of the swizzle header so I wouldn’t forget:

    We are not hijacking sharedHTTPCookieStorage because something from Safari calls it, gets a copy of NSHTTPCookieStorage and then calls private methods in it that CAHTTPCookieStorage doesn’t implement.

    The crash was deep in CFNetwork, with a stack trace pointing at internal selectors I’d seen but deliberately not bothered to ship. Swizzling on the class itself, on the real NSHTTPCookieStorage, sidestepped it. The shared instance stays the shared instance. Its public methods just do something different now. The private ones still work, because they’re still Apple’s.

    Two side-by-side process containers: a cyan Dashman process whose NSHTTPCookieStorage has the eight swizzled public selectors routed to a remotely synchronized cookie jar, and a red Safari process whose NSHTTPCookieStorage delegates unchanged through CFNetwork to the local cookie jar.

    There was a second subtlety, easy to miss. WebKit, by default, processes cookies for you on every request. If you only swizzle storage, you still get Safari-cookie behaviour because WebKit’s request pipeline reaches into its own cookie machinery before your storage runs. The fix is to disable WebKit’s cookie handling per-request and call the storage manually:

    request.HTTPShouldHandleCookies = false
    CAHTTPCookieStorage.sharedHTTPCookieStorage().handleCookiesInRequest(request)

    Same on the response side: receive the response yourself, extract Set-Cookie headers, hand them to the storage, then return. This had to be wired into WKWebView‘s request pipeline in two places: once on outgoing requests (to attach our cookies) and once on incoming responses (to harvest Set-Cookie). Redirects had to be handled twice, on both legs, because the redirect response itself can carry cookies that the next request needs.

    What shipped

    A single, fully isolated cookie jar separate from Safari, for the macOS build. That solved the immediate problem: nothing else on the user’s machine could be authenticated as Dashman just by being on the same machine, and Dashman couldn’t accidentally inherit the user’s personal Safari sessions either. The longer-term goal of one jar per site I designed for but didn’t ship before the architecture pivoted to a Java/JavaFX client where I owned the entire HTTP stack and the swizzling story became unnecessary.

    I’d call that a healthy outcome. The hack solved a real, present problem. The future problem got solved a different way, by changing the architecture rather than by deepening the hack.

    What I took from it

    A few things I’ve carried with me since:

    One more proof: “can’t be done” is, for me, a starting line. Long before Apple’s reply, treating “impossible” as a cue to dig in had become a reflex. When someone tells me a thing can’t be done, that’s the moment I commit to doing it, and I’ll go as deep as the problem needs. Apple’s reply was technically accurate from inside their boundaries. I went outside them. Here that meant x86 assembly. Next time it’ll be whatever the next problem demands.

    Behavioural equivalence is a real specification. When the source is unavailable, the binary is the spec, and a test suite written against the binary is the next best thing. The suite was as valuable as the replacement, because it told me, precisely, when I’d drifted. The dual-target trick (run the same tests against Apple’s class and mine) and the dual-path trick (run them direct and through the swizzled singleton) were what made the substitution trustworthy rather than hopeful.

  • A few years ago, at Wifinity, around 4 in the morning, I made one of the hardest decisions of my career: I rolled back a migration we’d spent months preparing for.

    We were migrating the production infrastructure of an ISP serving 200k+ customers. Traffic was already flowing through the new environment in read-only mode, the cutover was minutes from its point of no return, and we’d burned every minute of buffer trying to fix the issue that had surfaced. As we hit the edge of the rollback window, I made the call to bail.

    The rollback executed like the most rehearsed ballet production I’ve ever seen. We re-ran the migration cleanly a week later. Zero customer complaints.

    This post is about how rollback became the thing that made the migration work.

    The situation

    I’d inherited a production environment from a third-party agency that had been running our ISP’s software co-mingled with services for unrelated clients: one shared cluster, one database server hosting multiple databases, our services intermixed with theirs. There was no clean way to separate “our” infrastructure without fully migrating off.

    So we replicated the entire production stack into our own AWS environment, set up database replication between the two, and planned a single overnight cutover. The software was authenticating every Wi-Fi user across military bases, holiday parks, hotels and universities. Communication between the application and the network gear ran in both directions.

    And we had two overnight windows, a week apart, in which to do it. The expectation going in was that one window would be enough. I negotiated for two, because I like building wide margins of error into anything this consequential. If we missed both, the next opportunity wasn’t for another six months.

    Those two windows were both the earliest and the latest realistic options we had. The months on either side were full of problematic dates. It was tight enough that one of the windows fell on my wife’s birthday. She was understanding. Still, as an apology I bought her the Hogwarts LEGO castle. In our relationship it’s become a meme. She asks if work is making me miss her birthday again so she can pick out the next big apology LEGO set.

    The reverse runbook

    Every step of the forward runbook had a corresponding reverse step. About 120 steps in total. Some reverse steps were no-ops. Most weren’t. We wrote them in parallel with the forward steps. Writing the forward runbook and “thinking about rollback later” was never on the table.

    Treating rollback as a first-class artifact had design consequences.

    Engineer the point of no return to be as late as possible. The new infrastructure ran in read-only mode for as long as we could keep it there. Traffic flowed through it, customers got internet directly without going through the captive portal, and no new state was being written to the new system. Until we crossed the read-only-to-read-write boundary, the rollback was always available. We could test that things were working properly and still bail out.

    Keep users online by default. Our internal principle: it’s better to give internet access and not charge for it than to not give internet access at all. With one exception for a customer with specific security requirements, that principle simplified a lot of judgment calls during the cutover.

    Handle the legally-required logs inside the rollback envelope. UK law required us to keep connection logs for a defined retention period. During the read-only phase we captured them via a separate path, in a separate format, in a separate location. We made the deliberate choice to never integrate those logs back into the new system. They sat in storage for the retention period and that was that. Sometimes the right call is to accept a small permanent ugliness in exchange for never having to think about it again.

    The best workflow tool was…

    The migration was more than 100 individual steps, each with its reverse counterpart, forming a dependency graph that would split and merge, with different people in different teams performing different actions. The merges meant a person stopping and waiting for someone else to catch up. Not respecting these dependencies, in some cases, could have had catastrophic results.

    We tried workflow tools that could express all of this, but in the end, the best tool was a spreadsheet. It let us easily reorder steps, which we did a lot during planning, and it was flexible enough to capture everything else.

    Adversarial rehearsals

    We rehearsed the migration repeatedly over the months before the windows. The format was a mix: some live commands against the real environment, some test commands against a test environment that didn’t quite mirror production, and a lot of tabletop walkthroughs. We didn’t have a clean production analogue to play with, so we leaned on inspecting real systems and on injecting faults verbally: “assume this server is not responding from now on, what happens next, can we finish the migration?”

    Each rehearsal was more adversarial than the last. Late in the cycle we ran one where every engineer in the company was invited and asked to aggressively poke holes. They found issues. We kept rehearsing until nobody could think of any more.

    In the final rehearsal, I dropped from the call on purpose, without warning. My role was orchestrating the steps. I wanted to know whether the team could continue without me. Someone picked up the orchestration within a minute and the rehearsal carried on. By the time of the real migration I knew the team didn’t need me present for it to succeed. You know, in case my ISP boycotted me and decided to kill my internet that evening. We had contingency plans for other engineers too, but they were more complex, sometimes forcing us into rollback.

    A note on team shape. Most of my engineers had been at the company about six months. Almost nobody had deep tenure on the inherited system. The rehearsals were both a stress test of the plan and the team’s training course in how the system actually worked. The veterans we had (on the network side, plus a few from the original agency) brought essential know-how and surfaced questions we wouldn’t have thought to ask.

    The decision

    In the first window, a database incompatibility surfaced after traffic was already switched, with the new infrastructure running in read-only mode. We worked the problem. We burned every minute of buffer. As we hit the edge of the rollback window, I had to choose: keep trying to fix it, or bail.

    Even with all the preparation, the call was one of the hardest of my career. The competing voices in my head: maybe one more hour and we solve it. What if the rollback itself fails on something we missed? What’s more risky here: solving an unknown unknown under time pressure, or executing a rehearsed rollback? And underneath all of that, the knowledge that I’d planned the second window as a buffer to not be used, and consuming it meant I’d planned for two windows when I should have planned for three.

    I’d file that as a transferable lesson: buffer that gets used isn’t buffer. If your plan needs all N windows to succeed, you didn’t have margin, you had a fallback. Plan for one more than you think you need.

    I called the rollback. It ran beautifully. Smooth, quick, parallel across teams. We landed back on the original infrastructure right at the edge of our time budget. The rehearsals had paid off.

    The week between

    The week between the two windows looked like roughly one day to understand and mitigate the database issue (a calm second look at the problem, away from the pressure of an active cutover, made it tractable), and the rest of the week double-checking, searching for further unknown unknowns. We didn’t run more rehearsals.

    The mood I shielded my team from: I had dread going in. I was confident in the rollback if we needed it (we’d just proved it worked), but a second failure would no longer have been a tech problem. It would have been a business problem, with consequences I’d be answering for. I don’t think the team felt it. I made sure of that.

    The second window ran clean. We had a 5 hour window booked. We were done in under three. One of those hours was budget reserved for a rollback we didn’t need.

    We monitored customer support tickets live during the cutover. Customers complain about Wi-Fi immediately. At one point during the migration the support monitor interrupted the call to announce a ticket about no internet access. Everyone’s heart stopped. We started checking systems. Then someone re-read the ticket carefully: it was a customer of one of our competitors who had misdialled us. Zero real complaints.

    I’m now ready for my next big migration, but I’ll try to avoid my wife’s birthday. LEGO is expensive.

  • I see a lot of grandiose statements floating around:

    You will not be replaced by AI, you’ll be replaced by someone using AI.

    AI won’t replace your creativity.

    AI can’t do THIS or THAT.

    AI will never be able to do THIS or THAT.

    And many more.

    My take: maybe. Who knows! LLMs are getting better at an alarming rate and they behave in unexpected ways. They make the simplest mistakes but at the same time, they have solved some of the hardest math problems. This is well know as the jagged edge. The future is extremely hard to predict and every day it’s getting harder.

    But here’s what AI can’t do now and probably won’t for a while: have true agency.

    ChatGPT and Claude do nothing until someone tell them what to do. Same for all LLMs. Even the ones that seem to be always up, like OpenClaw, are just LLMs triggered on a timer. Without the timer, nothing happens.

    In concrete terms, no LLM is waking up one day and deciding to start a company. Some people have experimented with putting an LLM in the CEO seat, but those experiments would not have happened without a human applying their agency first. A human starting the company and prompting an LLM to make the decisions. The agency there belonged to the human, not the LLM.

    So the question is: what do you do when nobody tells you what to do?

    I think people tend to divide into two categories here. Those that wait and those that find something to do. Those that wait may be safe for a while when their specific skills don’t transfer well to an LLM yet. But given enough time, AI will likely acquire all the skills. And then their jobs are at risk.

    Those that find something to do are irreplaceable.

    One expression of that is deciding to start a company. But it shows up everywhere. Deciding to paint a picture or write a poem. Deciding what they’ll be about, what their aesthetics will be. Those decisions will continue to be irreplaceable.

    Well… they’ll be the last thing to be replaced. Because you can only replace them with an entity that is always active (LLMs wake up on a prompt) and that wants things. The moment we have AI that is always active and that wants things, we have bigger problems than our jobs. We’ve created a new species that will eventually be better at everything than us. Hopefully it’ll be friendly.

    The purpose of this post is not to dream or be scared about that future. It’s to convey the fact that deciding to do something with no inputs is the final frontier.

    If you want your job to be safe, find a way for it to not require inputs. If you need a ticket to write code, you are at risk of being replaced by an LLM that can write the code from the ticket. If instead you write the ticket then you are much harder to replace.

    This is why the profession of PM, the Product Manager, is flourishing in the era of AI. Of all the roles at a company is the one that is the most open, the one that has no or very few inputs (finding and deciding what inputs to use is part of the job). All building jobs will be a lot more like a PM in the future (or will have been replaced by an LLM).

    Having that agency to do something when nobody asks sets you apart.

  • SaaS valuations dropped. The narrative is simple: AI lets anyone build anything, so why pay for software? They’re calling it the SaaSpocalypse.

    I think part of it makes sense. But a big part of it doesn’t.

    You don’t buy code, you buy decisions

    When you buy a piece of software you’re not just buying the code. You’re buying the decisions that shaped that code. What to build, what not to build, how to model the domain, which tradeoffs to make. Those decisions were valuable before AI and they are still valuable today.

    This is why I think we’re entering the era of product managers. They are the ones making those decisions. And those decisions require deep domain knowledge. Gaining that domain knowledge is not trivial.

    The Fusion test

    Let me make this concrete. I do 3D design for 3D printing. It’s a hobby, not my life’s work. I use Autodesk Fusion.

    Should I vibe code my own 3D design tool instead?

    Vibe coding something equivalent would cost far more in my time and token spend than the Fusion subscription. But cost isn’t even the main problem. The main problem is that I don’t know the domain.

    Fusion has a specific set of tools for creating 3D designs. It starts with parameterized 2D sketches, from which you build parameterized 3D shapes. Choosing the right set of primitives to manipulate 3D geometry inside a computer — flexible enough to be useful, constrained enough to be learnable — is not trivial. It probably took the industry years of trial and error to get here. I remember the primitive AutoCAD of the 90s. We came a long way. I would have to replicate that trial and error, and that is not cheap.

    Technically, I could ask an AI to clone Fusion. But that is only possible because Fusion exists. You could argue that’s fine, it exists today. But next year, Fusion will have evolved. Without Fusion pushing the domain forward, there is nothing to clone.

    And no matter how good I am at building Fusion with AI, the people at Autodesk would be better at it. They have the same AI tools, but more domain knowledge.

    This generalizes

    Pick any domain and you’ll find the same thing. I wouldn’t want to vibe code my accountancy software, or my email client, or my calendar. Don’t believe me? Think about how you would represent recurring calendar entries where each recurrence can be individually rescheduled. This is not a trivial problem.

    For some tools, the domain is trivial, and the code was the moat. That moat is gone. The drop in valuation for those companies makes sense. But when the moat is decisions, taste, or domain knowledge, it’s still there.

    The real SaaSpocalypse

    The real SaaSpocalypse is not customers vibe coding their own tools instead of buying them. It’s the new startup that vibe codes a competitor.

    The real danger to Autodesk is not me building my own Fusion. It’s someone who dedicates their life to 3D design building a competitor with AI.

    That newcomer would have an AI-friendly codebase from day one. If they invest in keeping it that way, they would add features much faster than Autodesk can on top of decades of tech debt.

    The real SaaSpocalypse is that you can now catch up to and surpass incumbents with less effort, fewer people, less investment. And the incumbents, to fight that off, will need to rebuild their internals to be AI-friendly. At some point — as crazy as it sounds — it might be faster to start from scratch than to evolve an old, tech-debt-laden codebase.

    Buy vs. build still applies

    At work we recently needed a tool. It was going to cost us tens of thousands of dollars. An engineer proposed we vibe code it instead. I seriously considered it. We understand this domain much better than I understand 3D design, and for internal tools you can move fast when security constraints are lighter.

    But it would have taken us 2 to 3 weeks. And during those weeks we would not have been developing our core product. Our unique differentiator is our own product, not a tool we can buy and use off the shelf. Our competitors can buy that same tool and have it running in hours. We shouldn’t spend weeks to get there. Focusing on our core product was the right call.

  • I’m sure this is not my idea, so I’m not claiming it to be. I’ve been wanting to do a sort of continuous AI eval in production for a while, but the situation never presented at work. It was a mixture of having the data to do the eval off line, and wanting to avoid the risks of doing it in prod. But now I’m going to do it for a side project.

    I don’t want to reveal what my side project is yet, so I’ll keep it vague. I’m very excited about this part, so I wanted to share it early. And I’m hoping that the Internet will tell me if, as it usually does, this is a bad idea.

    I have a task that will be done by an AI and I can measure how successful it was done but only 2 to 7 days after the task was completed and seeing it out there, in the world. I will gather some successful examples to use as part of the prompt, but I don’t have a good way to measure the AI’s output other than my personal vibes which is not good enough.

    My plan is to use OpenRouter and use most models in parallel, each doing a portion of the tasks (there are a lot of instances of these tasks). So if I go with 10 models, each model would be doing 10% of the tasks.

    After a while I’m going to calculate the score of each model and then assign the proportion of tasks according to that score. So the better scoring models will take most of the tasks. I’m going to let the system operate like that for a period of time and recalculate scores.

    After I see it become stable, I’m going to make it continuous, so that day by day (hour by hour?), the models are selected according to their performance.

    Why not just select the winning model? This task I’m performing benefits from diversity, so if there are two or more models maxing it out, I want to distribute the tasks.

    But also, I want to add, maybe even automatically, new models as they are released. I don’t want to have to come back to re-do an eval. The continuous eval should keep me on top of new releases. This will mean a fixed percentage for models with no wins.

    What about prompts? I will also do the same with prompts. Having a diversity of prompts is also useful, but having high performing prompts is the priority. This will allow me to throw prompts on the arena and see them perform. My ideal would be all prompts in all models. I think here I will have to watch out for the amount of combinations making it take too long to get statistically significant data about each combination’s score.

    What about cost? Good question! I’m still not sure if cost affects the score, as a sort of multiplier, or whether there’s a cut-off cost and if a model exceeds it, it just gets disqualified. At the moment, since I’m in the can-AI-even-do-this phase, I’m going to ignore cost.

  • A company’s overall goals can shape its culture. Take the popular “Employee of the Month” idea, when employees see it as meaningful, it becomes ingrained in the culture and can motivate higher performance.

    But these goals need to be achievable, and sometimes making them achievable is a small matter of phrasing. Let me show you an example. For Canva, where I work, tenure is important. One way to celebrate tenure would be by marking when people joined. For example, a company can choose to give hoodies like this:

    Person wearing a gray hoodie with 'CLASS OF 2023' printed in white on the front, standing against a plain background.

    The problem with this approach is that it would be impossible for someone to improve in this dimension. Anyone that was hired in 2024 will never achieve having been hired in 2023 no matter what they do.

    Instead, Canva celebrates Canvaversary (your anniversary of having joined Canva). It still transmit the same information: “tenure is valuable, tenure is important”. But the big difference is that every time you see someone display a Canvaversary badge with a number bigger than yours, it is a badge that you can acquire by staying around long enough. It is an achievable goal.

    My laptop now has these stickers:

    Stickers on a laptop showing '1' and '2' for Canvaversary celebrations, along with a Pexels logo.

    And I also have this beautiful pin:

    A commemorative Canvaversary badge featuring the text 'Happy Canvaversary' and the number '2', displayed on a blue background with decorative designs.

    Oh, and at the five year mark they make a poster of you. They are really good. I’m looking forward to my 5 years at Canva poster.

  • Bring your own keys into an affiliate relationship

    I’m starting to observe a problem where a lot of LLM-enhanced apps are starting to pop up. For coding you have Cursor, but now there’s also a terminal called Warp and it costs $15/month. For individuals, consultants, small and even medium sized companies, this isn’t a workable pricing model. All apps were already turning into subscriptions and the cost of LLMs is accelerating that.

    What compounds the problem is that, because everyone feels uncomfortable with the potential surprise high bill of pay-for-what-you-use, many of these apps are charging a single monthly fee. A simple single flat fee, except that the cost of the LLMs is not flat. It grows linear with usage. This means having to throttle the LLM usage to stay within the margin of the flat fee and have a chance at a profit. That means using cheaper models, which yield worse result on average.

    I think this is why when I compare Cursor to Claude Code I find Claude Code to be better: I’m giving Anthropic a lot more money than Cursor. But also, I’m happy with that, because I can use Claude in many other ways, where Cursor is a single-use application.

    I think from now on, for each LLM powered application, I either want to be able to put my own keys in, or have pay-as-you-go with a lot of transparency. When I need it to work, I want it to work well. When I’m not using it, I want it to cost nothing.

    There’s another solution though. LLM providers could have affiliate systems where other companies get a commission for the token usage they generate. Using Warp as an example, Warp wouldn’t ask me for $15/month. Warp would ask me for my Claude API key. Warp would identify all those requests as caused by Warp, and then Anthropic would pay Warp a proportional fee for the usage generated.

    This is a win-win-win: Anthropic gets more token usage (more customer expansion), Warp gets more customers (I’m not paying $15/month, but I would plug in my key), and the user gets to have another LLM tool that otherwise they would not.

  • How Your Response Determines Your Growth

    I work at a pretty amazing company, Canva, that has a culture of feedback. I came in practicing Kim Scott’s Radical Candor, and Canva has been a strong environment for it. I think I have given and received more feedback in the two years I’ve been here than in the rest of my professional life put together. It has taught me something critical about the importance of a growth mindset. Managing several teams also gave me a lot of perspective here.

    I always thought a growth mindset would have an effect on what happens when you receive feedback, but now I’ve discovered it also has an effect on the frequency and complexity of the feedback. When I have a piece of feedback to give to someone, if that person has a growth mindset, I just give it.

    For the people with a fixed mindset, I know I’ll have objections, challenges, push backs, defensiveness. In those cases, for the feedback to be accepted, I need to collect evidence. I may need one clear case to base my feedback around but then further ones to display the pattern. It takes a lot more work and effort, and at a time when my calendar is back to back meetings and my to do list keeps growing.

    What naturally happens, even if I try to fight it, is that for people with a growth mindset, I give feedback frequently, and for people with a fixed mindset, I drop the frequency. A side effect of that is: the size of the feedback stays lower the higher the frequency is. This means the pain of receiving that feedback is lower. Let’s not pretend that receiving growth feedback is not painful.

    I think a chart showing the two growth paths would help:

    The lesson here is that gracefully accepting feedback has a massive impact on how much of it you will get, and if feedback is a source of growth, then it’s extra valuable to be graceful. Possibly even when the feedback is not correct: saying “Oh, interesting point, I’d like to think about it” is the right strategy. This is a lesson I’m still learning.

  • I’ve been thinking about this one for a while. Imagine you are the CEO of a company and your competitors are getting ahead of you because they started to use ChatGPT to build their tech. You ask the CTO what’s going on and the CTO says “ChatGPT was not on my job description, so I ignored it”. How would you feel as the CEO? Would you think “Ok, fair enough for not putting it in the job description” or would your thoughts be a bit more… colorful?

    The CEO has no job description other than: “Making the company successful“. The CEO is responsible for everything. If tech fails, it is the CEO’s fault, if accounting is done wrong, it is the CEO’s fault, if marketing wastes money, it is the CEO’s fault. The buck stops there for everything.

    The CTO’s role is exactly that, but just the tech part. The job description should be “Do anything and everything in the tech domain to make the company successful“. Are the servers down? It is the CTO’s fault. Marketing doesn’t have the features they need for a campaign, it is the CTO’s fault. Ransomware destroys operations, it is the CTO’s fault. Tech is spending too much money, it is the CTO’s fault. Too much technical debt, too little, not delivering fast enough, having the wrong kind of skills on the devs, devs are unhappy. Again… it all falls on the CTO’s shoulders. At least this is how I take on the role when I carry the title of CTO.

    This means the CTO is responsible for problems that nobody ever assigned to them. That’s one of the reasons this role is hard. And to be able to do such a role, the CTO needs autonomy, information, access, etc.

    Autonomy is achieved through budget authority. The CTO presents a budget to the CEO and CFO, who approve it and then executes on it. Ideally, then the CTO receives periodic updates from the CFO comparing expenses to budgets, and whether the company has the revenue to back that budget up. If the CTO overspends beyond the tech budget, that’s a problem, but if the company shrinks, that’s a problem too. In both cases the CTO should be proactively thinking about how to cut cost and manage the expenses (before hitting a wall, having a massive layoff, etc).

    Information and access is achieved through having a strong exec team. An exec team that is all on the same page. Including a clear vision from the CEO, a clear understanding on how all other departments are achieving their goals, and how tech helps or hinders them.

    So if there is a job description at all, it should be not for the role of the CTO, but rather for the company itself: what it means to achieve, and how it behaves to empower its C-Suite to further their ambitions.