What I wanted: ENB++ with genuine style. What we got : Instagram Filters.
The DLSS5 demo has tainted my hope for the future. My personal dream and vision of AI-enhanced rendering of games was dearly held and a quite different to what we see.
Artists and gamers who’ve seen the demo share the same concern, that highly capable diffusion models tend toward a homogenising influence. To quote the Forbes article referencing images of Resident Evil character Grace Ashcroft:
Commentators joked that DLSS 5 had “yassified” Grace, like an AI beauty filter made for social media.
Anyone who has tried to get an online image service to produce something other than an Instagram influencer snapshot will know what I mean. The expressive choices of the original artists – the specific way a character’s face was modelled, the intended mood of a scene’s lighting – risk being smoothed into the confident generic aesthetic that modern AI image models favour.
The core idea of DLSS5 – taking generated game frames and passing them through a diffusion-style layer to produce enhanced visuals – gives me cautious hope that this application of AI in gaming which I had discussed and hoped for, is closer to realization than ever. The concern is that a mis-step in implementation might sully the idea of adoption. History suggests that when a promising technology is deployed carelessly, the backlash attaches to the idea rather than the implementation, and the window doesn’t always reopen.
Ever since I first played Everquest1 I wished for a future where games would look like the box art, like Dragon magazine covers, akin to Caldwell or Elmore or Parkinson painting. Even then I imagined multiple “style” options being available for personal choice.
I imagined a game outputting the 3D world but with hair and cloaks blowing in the wind, skin rendered as a realistic oil painting, items depicted with material properties filtered through the rules of art as much as science.
And as soon as I saw Stable Diffusion I envisioned a future where, soon, frames could be run through a process and enhanced using targeted AI, once we had sufficient GPU to spare. It all seemed within reach.
The ultimate goal of computer games is not to be photographic; the styles are not failures to achieve realism.
Every game has a visual identity its artists fought hard to create. The hand-drawn look of Borderlands, the exaggerated physiques of 2011’s Brink, the inference of scale in Grounded, the strange beauty of Senua’s Sacrifice. The 3D artists, texture design, animators and more all work to produce a look and feel, which we hope the AI inference layer will amplify and not homogenise.
What I had hoped for
I had imagined a metadata layer in addition to the image frame. This might look a bit like “seeing the Matrix” if we looked at it unprocessed! Give the diffusion model enough information about intent and it has less room to impose its own. A customized highly trained checkpoint and LoRA are others.
The core vision of this was always that core game engines could output additional metadata relating to each pixel and object.
This is actor #ac05f3 in lighting #00bb43 with expression #505d67…
a comprehensive description of the intent of the scene. Not simply overpainting the frames but maintaining a consistency across a playthrough, ensuring that the game engine outputs information about what the game was intending to depict, instead of just relying on image-to-image results to achieve an enhancement conjuring trick.
Ideally each game would have its own well trained custom checkpoint model variant, alongside recorded LoRA-style data models about characters, objects and places within the world depicted. Without that, would Deep Rock Galactic even render? A generic model has never seen a Glyphid. It has no concept of how Skyrim’s Draugr differ from generic zombies. How will it render them? We know it will try, even in deep ignorance. Without a custom checkpoint trained on each game’s specific visual vocabulary, the diffusion layer is painting confidently in a language it doesn’t know that it doesn’t speak.
These complexities are the main reasons I left the ideas to one side in my procrastagnation pile. If I’d ever had the expensive hardware to test it on I like to think I would have experimented, but at the moment even demonstrations of working models require $10k of hardware, before any model training costs are considered.
Is there a future?
My hope for DLSS5 as it develops is that it acts, not as a paintover-and-hope, but as a form of super-ENB – image enhancement and polish, but sticking closely to the important details of the rendered output and maintaining consistency of characters and objects, the intention of materials.
My fear is it will lose characterization and consistency, reverting to the generic looking AI outputs of more recent highly polished checkpoints and systems. The technology as a whole will be rejected by the gaming community and a truly powerful opportunity will be lost.
Whatever the current limitations or pitfalls, we are a big step forward on one of the features I was hoping AI would be used for; an uncontroversial, legitimate and ethical use case for the technology. If we can guide the future, we might end up with something wondrous.