Once upon a dream
About 20 years ago I wrote a poem which sadly I can't find but the gist of the poem was the wonder I held at the fact that I had dreamed the night before, a dream so vivid as to be mistaken for reality by my sleeping consciousness. I awoke that day and was so moved by the visceral nature of the dream that I penned the poem. What makes a dream seem real? A combination of factors owing to spatial and temporal alignment to how the world evolves during our wake states. For one , time beats off roughly equally ....we don't experience dramatic speed up or slow down of time as can often happen in the dream state. Another obvious tell is that in dreams the laws of physics are often either non existent or dynamically pliable in a way that to our dreaming minds obviously puts them in the realm of fantasy. Other aspects stand out, like how things look or sound or taste.
Ultimately the combination of flaws in our brains ability to properly correlate a dreamed event with how it would proceed in reality is caught as unreality...yet it is quite remarkable that we have dreams that are so vivid that we often fail to discern their state as dreams until after we've woken up.
How is a few pounds of warm brain matter able to create such a convincing simulation of reality in such a low power state as sleep states (demonstrably lower than our wake states)?
Computers generated drawings
The field of compute generated imagery goes back nearly 50 years, the first widely reported example of work in this regard is cited to Ed Catmull who has been working at Pixar since its inception and whose contributions to the growing field of digital image synthesis has been countless since he presented his dissertation on a method of rendering imagery using computer algorithms back in 1973. A common theme from that time to the current age of photo realistic computer graphics however has been the consistent progression of steps:
1) A model or models of a physical process or phenomena is made leveraging mathematics and physics to ensure the model closely maps to realistic expression of the same phenomena. These models include algorithms for global illumination (lighting), physical modeling (surfaces and their geometry), physical reflection characteristics (translucence, reflection, anisotropy, texture mapping...etc.). They also include models for time evolved processes like physical models of object interaction under gravity or friction or participating interaction of rendered objects with one another...like water with hair or fire with gasoline.
2) An optimization of the model for the needs of the current state of computer hardware, allowing that hardware to practically execute the model in order to generate output in a time and cost tractable sense. This has been a huge part of the field with increasingly sophisticated mathematical approaches used to minimize the cost of calculating the evolution of a given model to allow as realistic an output as possible with a minimized cost basis. Here the strongly physical and mathematical models are collapsed to more simple numerical approximations that are readily discretized and computed across clusters of computers with some fidelity in the trade off between computational cost and rendering time versus extracted realism.
3) The gathering of computational resources required to allow the optimized model to be processed on real hardware and satisfy the time and cost targets set in 2. This is also an important part the development operations stage of linking together clusters of machines and loading them with the optimized models that can allow efficient rendering of high realism frames under time constraints. The advance of computational power under Moore's law enabled a very long run of continued advances in computational power that could be leveraged for reduced cost and time...but the waning of those advances now pushes a desire for more realistic output against hard physical limits that necessitate new approaches to extract more realism without incurring greater computational or time rendering costs.
4) The construction of temporal workflows for generating desired scene output within the generated model, often involving manual modeling of objects within the scene, manual placement and animation of those models using methods like direct and inverse kinematics. The last 15 years have seen a dramatic move to more automated modeling processes for modeling and animating large crowds of agents, enabling the internal degrees of freedom of relative attributes of these agents to determine how they can or can not move in the context of the animation. Still the large themes are generally hand constructed by human animators....the cutting edge work leverages models for automatically generating object attributes like the geography of planets or moons or even generating realistic cityscapes. Still large aspects of this work involves manual human labor as in for example the animating of human faces for performance.
5) Finally, with all the work done the final step of rendering must proceed...where all the elements come together to generate a realistic output of the desired objects, agents according to the temporal workflows of the evolution of the scenes being generated.
It is quite remarkable that in as 44 years we've come this far but the labor involved in this method of generating computer images is clear. Labor in the theoretical step of generating mathematical and physical models, labor in the tricky process of optimizing those models to make them practical for the current harware capabilities at acceptable cost levels. Labor in the construction of the massive server clusters of highly parallel machines with custom GPU processors optimized to run the optimized model mathematics that allows efficient computation of the many parameters of the optimized models for a given scene. Labor in the actual temporal workflow of the objects and agents in the scene and how they should evolve over time to present a scene or story and finally the actual computational muscle of churning through all the input data to output frame after frame of realistic images...as well as a final Q&A and possible re-rendering of output should artifacts be discovered.
How dreaming is rendering
Given the incredible challenge posed by realistic image synthesis described above a question that comes to mind regarding how it is that our brains....cold as they are relative to the thousands of processor and memory cores churning for weeks on end to render a few thousand frames for a given CGI rendering task...could be so successful at fooling US when we dream vivid dreams. What magic is the brain engaging that allows not just the accurate visual reproduction of objects within the dream state but also how those objects may feel, how they respond to sound , how they smell or even taste? It turns out that the rendering that our brains do uses a very different approach to that of the rendering described in the complex process of CGI generation. Our brains rely not on active computational processing to generate how dreamed scenes evolve over time they instead leverage the massive stored information about previously experienced object , process and event states and then tying those previous states of recalled memory together in a way that is so efficient that they can do it when our brains are on lowest working power when we are sleeping and achieve photorealistic results within time frames that are a fraction of the necessary rendering time for complex realistic scenes for CGI.
Along came a CNN
We know that the brain works by synthesizing memory and making predictions regarding memory because the work of neuroscientists over the last 20 years have illuminated this fact quite brilliantly thanks to the use of fMRI imaging. The ability to track thoughts as they emerge and move through the brain is a reality. Today, we even have the ability to take thoughts of visual imagery and be able to use complex analysis of blood flows through the visual cortex to generate a prediction of the thought image. This latter work has been done by using the advance of convolutional neural networks in conjunction with other machine learning models to extract features from brain state data sets so as to recreate real objects...if only rough representations.
Looking at how CNN's work to make accurate predictions of objects in images shows how computer graphics will soon be moved away from the formal labor intensive process described earlier and toward an approach that uses neural network models and massive trained memory configurations.
CNN's work by essentially storing how images look from multiple feature scales for all types of entities and elements in those images. Research is still investigating exactly what a given trained network "sees" at any given layer but output for some networks after training shows clear stratification of fundamental attributes of a given training data sets. For example, the first largely successful results for CNN's were achieved after training on a simple hand drawn data set called MNIST. These characters were trained into the network and upon layer examination attributes of the numbers and letters could be seen at each layer. Some layers specialized in horizontal lines, others in vertical, others in combination of horizontal and vertical but not curves, others in curves. The feature decomposition then allowed the forming of a massive primitive feature data set that could then be used to correctly predict to identify new images of different hand drawn characters as associated with members in the trained data set.
This building of a primitive feature set is what then allows instantaneous identification of new objects because in truth the network is not identifying anything...it is making a prediction of what it is looking at after comparing an examination of the new object to it's trained dataset of built primitive features. This is how variably such networks can be trained to identify a "face" as independent from a "human face" or a "cat face" but have since been trained to refine beyond just "face" and beyond "human face" or "cat face" to now be able to identify specific individuals. This is enabled by having more input training data for specific faces which allows the building of even more elaborate primitives for the features of particular human faces not just faces in general. Further , beyond faces these models are able to make predictions not just of what things are in images but what they are independent of their orientation....an upside down face is still a face, a car is a car if it is coming and going or side parked...this is possible
The progression of this ability should ring some bells as it strongly resembles or approaches an ability we have. A human can be given a new object like a cup and with a single observation be able to recognize that same object without having seen it from various orientations that it may have been seen in...what is interesting is that though it appears to us that this ability is innate ...it is actually learned. Our ability to recognize the many different common states that our visual field creates for objects is built up over years of experience as we grow from babies to toddlers and beyond. We further have an ability to leverage different aspects of our experience outside of the visual systems to correlate behavior of objects with their images to be able to then make highly accurate predictions despite not having had direct experience.
This ability to cross correlate features across dimensions of sensory import is a massive key to how we can reduce the complexity of making predictions for new entity and object states based on sparse information about those entities. It's how we can observe a fraction of a second of a moving ball and roughly gauge it's speed and direction. It's how we can be told that an object is made of a given material and know how to touch it so as not to damage it or be damaged by it. It's how we are able to be shown an object and then be told to draw it from a view point that we've never physically seen by inferring the appearance from a hypothesized view perspective and position.
Rerendering
So what is the future of rendering ? It is essentially taking advantage of neural networks to perform the costly aspects listed earlier and then store their output states so that deep primitive feature sets can be stored in memory. Once those sets are stored the ability to render real time photo realistic new images upon providing descriptive information will be trivial.
Here's an example of how that would work:
1) A world generation and illumination model set is brought to bare on a given simulation. Thus a model that defines the physics and geometry of a planet or moon ...complete with atmospheric and weather participation...and day night cycles enabled by an illumination model.
2) Geometrical primitives are dropped into this simulated world and a cycle of renderings of those primitives (ideally at low resolution) are used to generate a massive primitive feature set of stored examples of how the primitives look under given geometric view points relative to a world camera as well as how they are illuminated from countless light positions in that world.
3) This massive data set is fed into a very deep neural network which can highly characterize not just the geometric aspects of the primitive object but how those are transformed by the illumination data. The purpose being to allow the trained model to create de novo illuminations of new geometric primitives. Essentially to allow it to dream or imagine how that new object would look given how the old object looked as it remembers it.
4) Changing the physical attributes of the geometrical primitives would then allow deeper characterization of the imagined output. For example, training on a geometric primitive that is made of glass or plastic or steel could then allow new primitives that are imagined. Adding in more and more attributes into the trained simulations will dramatically multiply the output possibilities for "imagined" output and do so while preserving accurate geometric detail in theory for a sufficiently dense training under simulation.
5) Training for temporal workflow events like objects falling or shattering depending on their material composition or even evolving over time ..like the growing of a seed to a tree can all be deep trained into a sufficiently deep network allowing it to "imagine" increasingly detailed novel output.
A main advantage of this approach to rendering is that though the training cost is high over time that cost is paid for by the generation of increasingly realistic imagined output demonstrating requested scene dynamics that can be nearly instantly generated at very low computational cost...just as CNN's predictions are low cost relative to the training stage and also how a sleeping brain is able to create a hyper realistic dream scape while in its power saving mode.
I imagine that 20 years from now computer graphics generation will be more a conversational process than a computational one, humans will basically tell the computer what they'd like the scenes to contain , how they would be illuminated and then populate them with primitives ...the computer models will understand the instructions and be able to generate output in real time matching the request from 20 years of stored deep memory information and the ability to transform stored memory when given new data cues....just as we are able to be shown a cup once and then be able to draw it at various angles.
Comments
Incredibly, Google Deepmind just published a new paper that basically presents a model that precisely creates the first solution along the lines of that detailed in this post.
I am quite amazed at to how similar their methodology is to what I described in this post 6 months ago.
Here's the paper.
https://drive.google.com/file/d/1yhXQqznJmfDiA17GtRhYSOT4TTlAaSxh/view?usp=sharing