Note: Yes, I use em-dashes when I write, and no, this is not AI. As always I only use “AI” for spellchecking.
I have been at Finegrain for three years, so it is time for the next article in this series where I share a bit about what I’m doing at work.
I am going to skip things we did in the early days that have become irrelevant and focus on what we are doing now.
So first, we make image editing models. We train them, we evaluate them, and we ship them in a format customers can integrate. We have always focused on keeping the models small and efficient, initially so we could have a fast and cheap API, and now so they can run entirely on edge devices such as mobile phones.
We sometimes build end-user tools to demonstrate those models too. Among those you may have seen in our API era, there was a Web-based image editor, ComfyUI nodes, a chatbot for image editing, and an ad generation assistant. All those are gone, but we still have a few public Hugging Face Spaces. Now our flagship product is the mobile SDK, we have a free mobile application to demonstrate it.
Now, let’s get a bit more into what I am working on exactly.
The first important thing I have built is our server-side inference stack. What does this include?
A model is a bunch of layers, which basically means matrices of numbers corresponding to operations (multiplications, convolutions, etc) organized into a graph with non-linearities sprinkled in-between. The model takes tensors as input and returns tensors as output.
When you want to do something with a model, you need some preprocessing to feed it data and postprocessing to take it out. In the case of text, that often means tokenizing. In the case of latent image models that means encoding and decoding — which by the way also involve models.
You also need a lot of things to optimize the model’s operation. That includes the mode of execution (compilation, quantization, numerical precision, etc). Crucially, in the case of models that you call several times in a loop such as diffusion / flow matching models, this also includes solvers (samplers / schedulers), which can be the most complex and math-heavy part of an inference pipeline.
In our inference stack, which is called PHX (a reference to the Phénix airplane), all of this is packaged in a processor (which some call a pipeline). Processors can be used locally or deployed to Modal — which we use e.g. for model evaluation purposes — but in production they are deployed on GPU servers and exposed to callers using NVIDIA’s inference server Triton.
PHX is not exposed to customers or applications directly. We provide a Web API that is easier to call, asynchronous (because although our inference is fast it is still slower than a typical Web request), and deals with authentication and payment.
In addition to this API, we have a Web backend for customers, a bit of admin and monitoring… All of this lives in a modular monolith called Disciple, a reference to a character of the comic strip Léonard whose most famous line is “I serve science, and that is my joy.”
The stack is nothing too fancy; I picked technologies I had previous experience with and knew to be reliable. We use Quart with Blueprints and PostgreSQL. For asynchrony we use Nchan for SSE and Beanstalk for background jobs. For efficient image processing we rely on VIPS.
For some time now, the main focus of Finegrain is running all our models on mobile devices directly. This means we had to port our inference code to run on iOS (for now) using Apple’s Neural Engine (ANE) and/or the GPU.
This involves, among other things, compiling and quantizing the model using Apple’s Core ML Tools. This is not a straightforward step; most models cannot be compiled out of the box and require architectural changes first. As for quantization, obtaining high average quantization ratios while retaining precision is a craft that would deserve its own blog post.
Once you have that model, you need the equivalent of PHX’s processors, but for the mobile device. I use PHX as the reference codebase, but porting to Swift on iOS is not as straightforward as one may think. On mobile, anything can quickly become a bottleneck, and if you are not careful you end up spending more time doing pre- and post-processing than running the model. We use Apple’s Accelerate framework extensively to speed up those parts.
Another notable thing I have implemented is the ability to swap the capabilities (skills) of the CoreML model. Core ML has Multifunction Models, but in theory they are built by compiling each function and merging them once on a Mac. With the tooling I have designed, we can create new functions for a given backbone in pure PyTorch, and add / remove / swap functions in the model almost instantly on any machine including an iPhone. This makes it possible, for instance, to update a single function in an application without re-downloading the whole model and without waiting for device specialization.
All I have mentioned so far is important but straightforward engineering, but if you want to be notable in this field you need to go beyond that and implement cutting-edge research or innovate.
Nobody in the current Finegrain core team is really a pure software engineer or a pure “AI researcher”. We all read papers in our field and discuss approaches to solve our problems. We each have our affinities, of course. Personally, I have been studying the things “around” the main model a lot, including solvers, NFE (Number of Function Evaluations) reduction methods and auto-encoders.
One thing I worked on that made a difference is the set of techniques we use to edit high-resolution images with a model backbone that works at a lower intrinsic resolution, that we call “fixes”. Having a background in signal processing and classical computer vision as well as a good understanding of latent spaces helps a lot there.
For the first two years I spent at Finegrain I used to say I did a bit of everything except training models. But ultimately my goal is to solve problems, and sometimes it does involve training. I am still by no means the person who trains the most on the team, but I do a little, mostly focusing on performance-related things.
For confidentiality reasons, I won’t get into details about that work, but I can say that we use (non-exclusively) Modal for serverless training and Trackio for experiment tracking. I wrote and open-sourced two tools to help with this: trackio-tool to merge and synchronize runs across remote machines and Modal environments, and multimodal to simply track the progress of several Modal jobs on the CLI. They’re vibe-coded, but I use them almost every day.
As always, I haven’t talked about everything I do. We are a small company and I am still a jack-of-all-trades kind of guy, so I always write tooling and fix bugs here and there. That should give you a decent idea of what my day-to-day is about. As always, if something piqued your interest, feel free to get in touch.