What I do at Finegrain

published 2026-02-07 [ home ]

Note: Yes, I use em-dashes when I write, and no, this is not AI. As always I only use “AI” for spellchecking.

I have been at Finegrain for three years so it is time for the next article in this series where I share a bit about what I’m doing at work.

What Finegrain does

I am going to skip things we have done in the early days that have become irrelevant and focus on what we are doing now.

So first, we make image editing models. We train them, we evaluate them, and we ship them in a format customers can integrate. We have always focused on keeping the models small and efficient, initially so we could have a fast and cheap API, and now so they can run entirely on edge devices such as mobile phones.

We sometimes build end-user tools to demonstrate those models too. Among those you may have seen in our API era, there was a Web-based image editor, ComfyUI nodes, a chatbot for image editing, and an ad generation assistant. All those are gone, but we still have a few public Hugging Face Spaces. Now our flagship product is the mobile SDK, we have a free mobile application to demonstrate it.

Now, let’s get a bit more into what I am working on exactly.

PHX, the server-side inference stack

The first important thing I have built is our server-side inference stack. What does this include?

A model is a bunch of layers, which basically means matrices of numbers corresponding to operations (multiplications, convolutions, etc) organized into a graph with non-linearities sprinkled in-between. The model takes tensors as input and returns tensors as output.

When you want to do something with a model, you need some preprocessing to feed it data and postprocessing to take it out. In the case of text, that often means tokenizing. In the case of latent image models that means encoding and decoding — which by the way also involve models.

You also need a lot of things to optimize the model’s operation. That includes the mode of execution (compilation, quantization, numerical precision, etc). Crucially, in the case of models that you call several times in a loop such as diffusion / flow matching models, this also includes solvers (samplers / schedulers), which can be the most complex and math-heavy part of an inference pipeline.

In our inference stack, which is called PHX (a reference to the Phénix airplane), all of this is packaged in a processor (which some call a pipeline). Processors can be used locally or deployed to Modal — which we use e.g. for model evaluation purposes — but in production they are deployed on GPU servers and exposed to callers using NVIDIA’s inference server Triton.

Disciple, the Web backend

PHX is not exposed to customers or applications directly. We provide a Web API that is easier to call, asynchronous (because although our inference is fast it is still slower than a typical Web request), and deals with authentication and payment.

In addition to this API, we have a Web backend for customers, a bit of admin and monitoring… All of this lives in a modular monolith called Disciple, a reference to a character of the comic strip Léonard whose most famous line is “I serve science, and that is my joy.”

The stack is nothing too fancy; I picked technologies I had previous experience with and knew to be reliable. We use Quart with Blueprints and PostgreSQL. For asynchrony we use Nchan for SSE and Beanstalk for background jobs. For efficient image processing we rely on VIPS.

The iOS SDK

For some time now, the main focus of Finegrain is running all our models on mobile devices directly. This means we had to port our inference code to run on iOS (for now) using Apple’s Neural Engine (ANE).

This involves, among other things, compiling and quantizing the model using Apple’s Core ML Tools. This is not a straightforward step; most models cannot be compiled out of the box and require architectural changes first. As for quantization, obtaining high average quantization ratios while retaining precision is a craft that would deserve its own blog post.

Once you have that model, you need the equivalent of PHX’s processors, but for the mobile device. I use PHX as the reference codebase, but porting to Swift on iOS is not as straightforward as one may think. On mobile anything can quickly become a bottleneck, and if you are not careful you end up spending more time doing pre- and post-processing than running the model. We use Apple’s Accelerate framework extensively to speed up those parts.

Science and hacks

All of this is advanced but straightforward engineering, but if you want to be good in this field you need to go beyond that and implement cutting-edge research or innovate.

A large part of it is in model training and the datasets we use, of course. I am hardly the person who trains the most models at the company — I do a little, mostly focusing on performance-related things. We all read papers in the domain and discuss approaches to try all the time though. Personally, I have been studying the things “around” the main model a lot, including solvers, NFE (Number of Function Evaluations) reduction techniques, and recently auto-encoders.

In addition to that, there are a few ideas I have had and implemented that make a difference. The most important one is the set of techniques we use to edit high-resolution images with a model backbone that works at much lower resolution. We call them “fixes” and let’s just say having a background in signal processing and classical computer vision as well as a good understanding of latent spaces helps a lot there. I have spent — and still spend — a lot of time on this, and it is a crucial part of what we do, so if you were to ask me what I am most proud of so far, that would be it.

Another notable thing I have implemented recently is the ability to swap the capabilities (skills) of the CoreML model. Core ML has Multifunction Models, but in theory they are built by compiling each function and merging them once on a Mac. With the tooling I have designed, we can create new functions for a given backbone in pure PyTorch, and add / remove / swap functions in the model almost instantly on any machine including an iPhone. This makes it possible, for instance, to update a single function in an application without re-downloading the whole model.

As always, I haven’t talked about everything I do. We are a small company and I am still a jack of all trades kind of guy, so I always write tooling and fix bugs here and there. That should give you a decent idea of what my day-to-day is about. As always, if something piqued your interest, feel free to get in touch.