Type the words “ZDNet superb reporting” into Nvidia’s new artificial intelligence demo, GauGAN 2, and you will see a picture of what looks like large pieces of foam insulation wrestling in a lake against a snowy backdrop.
Add more words, such as “ZDNet superb reporting comely,” and you’ll see the image morphed into something new, some barely recognizable form, perhaps a Formula One race car that has been digested, proceeding along what looks sort-of like a road, in front of blurry views of a man-made structure.
Roll the dice with a little button of an image of two dice, and you’ll, and the same phrase becomes a spooky, mist-shrouded landscape with a yawning mouth of some sort of organic nature, but completely unidentifiable as to its exact species.
Typing phrases is the latest way to control GauGAN, an algorithm developed by graphics chip giant Nvidia to showcase the state of the art of AI. The original GauGAN program was introduced in early 2019 as a way to draw and have the program automatically generate a photo-realistic image by filling in the drawing.
The term “GAN” in the name refers to a broad class of neural network programs, called generative adversarial networks, introduced in 2014 by Ian Goodfellow and colleagues. GANs use two neural networks operating at cross purposes, one producing output that it steadily refines until the second neural network labels the output valid. The competitive nature of the back and forth is why they’re called “adversarial.”
Nvidia has done pioneering work extending GANs, including the introduction in 2018 of “Style-GAN,” which made it possible to generate highly realistic fake photos of people. In that work, the neural network “learned” high-level aspects of faces and also low-level aspects, such as skin tone.
In the original GauGAN from 2019, Nvidia use a similar approach, letting one draw a landscape as areas, known as a segmentation map. Those high-level abstractions, such as lakes and rivers and fields, became a structural template, and the GauGAN program would then fill in the drawn segmentation map with real-world forms.
Version two of the program has been updated to handle language. The intention is that one will prompt GuaGAN 2 with sensible phrases, things pertaining to landscapes, such as “coast ripples cliffs.” The GauGAN 2 program will respond by generating a realistic-looking scene that matches that input.
The program was developed in its “training” phase by being fed 10 million high-quality landscape images, says Nvidia, using the Selene supercomputer built from Nvidia GPUs.
A segmentation map can also be created automatically, allowing one to go back and edit the layout of the landscape in the way the original GauGAN allowed one to create.
As Nvidia describes GauGAN 2 in a blog post, the combination of text and image and segmentation map is a break-through in multi-modality AI:
GauGAN2 combines segmentation mapping, inpainting and text-to-image generation in a single model, making it a powerful tool to create photorealistic art with a mix of words and drawings. The demo is one of the first to combine multiple modalities — text, semantic segmentation, sketch and style — within a single GAN framework. This makes it faster and easier to turn an artist’s vision into a high-quality AI-generated image.
The practical benefit, says Nvidia, is that one can use a few words to get a basic image together without any drawing at all and then tweak details to refine the final output.
But adding words that don’t have anything to do with landscapes, such as “ZDNet,” starts to generate crazy artefacts that have by times revolting freakishness, and by times appalling beauty — depending on your taste. In the terminology of deep learning, the freakish images produced by nonsense phrases result from the program having to grapple with language that is “out of distribution,” meaning not captured in the training data fed to the machine. Faced with irreconcilable phrases, the program is struggling to match an image to the phrase.
As can be seen in a series of images, the “coast ripples cliffs” produces a very faithful image at first. Adding qualifiers with impertinent words — bicycle, New York City, the name Cassandra — starts to shift and shape the landscape in strange ways.
Even more interesting things happen when all the landscape words are removed, leaving only the nonsense. Strange, futuristic landscapes or multi-colored amoebae come into view.
The experiment can be taken even further with extended phrases that are suggestive without exactly being descriptive. Try feeding in the first line to T.S. Eliot’s poem The Wasteland, “April is the cruellest month, breeding lilacs out of the dead land.”
The result is some striking images that are, in fact, somewhat appropriate. As one rolls the dice, many variants of appropriate landscapes emerge, with only slight artefacts in some cases.
Thanks to the innovations of StyleGAN, GauGAN is able to apply a style to the image, to essentially condition the output to be in the form of some other image, rather like a mash-up.
The application of style to Eliot’s poem distorts the faithful landscape images beyond recognition. Once again, a whole host of weird objects appear with a kind of sickening organic quality to some of them, others merely broken shards of what was once an image.
One can also submit images and even draw on GauGAN 2. Submitting an old photograph taken at Þingvellir, the site of the ancient Icelandic parliament, didn’t do much. The image remained mostly untransformed, in limited testing.
Adding the word “Þingvellir,” however, produced a realistic-enough landscape that was in keeping with the Þingvellir site.
Adding the word “volcano” produced a striking alternative landscape, less realistic, more surreal.
Adding an impertinent word, such as “Technology,” further shook up the landscape, adding strange nonsense figures.
Rather than submit a photo of a landscape, one can draw, as was the case in the original GauGAN. Again, choosing something, not in keeping with the demo, a drawing not of a landscape but of a person’s head, produces more interesting results. The face is able to be re-skinned, if you will, by using the mash-up function. Rolling the dice produced interesting variations.
Combining the drawing with the word “Þingvellir” produced subtle changes, as did adding additional words such as “volcano” and “rift.” The image was re-skinned to have a kind of volcano-like texture.
Note that the user interface of the app can be hard to scroll in desktop browsers. For some reason, it seems to work better in a tablet browser, such as an iPad.