Did a couple posts on this already and wanted to continue the train of thought on the idea of handheld controllers for navigating latent spaces of image diffusion models like Midjourney (though theoretically the same concepts should apply for navigating any high dimensional space).

When this idea first landed, I worked extensively with ChatGPT v4 to flesh the concept out in a bit more detail, but I never captured that anywhere public. So here it is…

Core design challenges

After sketching out the general idea, so we were both on the same page. I asked CGPT what are the core design challenges here. I won’t include every little piece of the response, but selected highlights are below.

The core design challenge in this project is indeed interpreting a high-dimensional latent space into a physically navigable form using a handheld controller with limited input options. Mapping the complexity of a multi-dimensional latent space onto a device with a restricted number of controls, while maintaining a smooth and intuitive user experience, can be difficult.

Some specific challenges include:

  1. Dimensionality reduction: Latent spaces of image diffusion models can have hundreds of dimensions. The controller needs to facilitate navigation through these dimensions effectively while only using a few input controls (joysticks, dials, and buttons). Developing a suitable method for dimensionality reduction, such as PCA or t-SNE, that retains meaningful features and provides an intuitive navigation experience, is essential.
  2. Interpolation and navigation: Designing a system that enables smooth transitions between images while navigating the latent space is challenging. Interpolation techniques, such as linear or spherical interpolation, must be carefully chosen and optimized for the best results.

I’m not actually sure what all that means, but following those trails lead to some interesting and useful discoveries. I won’t bother trying to explain it and will instead pass the baton to CGPT.

Dimensionality Reduction

ChatGPT says on this topic:

Dimensionality reduction is a technique used to reduce the number of variables or dimensions in a dataset while preserving the relationships and structures within the data. In the context of your project, dimensionality reduction would be used to simplify the high-dimensional latent space of image diffusion models into a lower-dimensional representation that can be effectively controlled and navigated using the handheld controller.

PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are two common dimensionality reduction techniques.

PCA is a linear technique that identifies the directions (principal components) in the high-dimensional space where the variance is maximized. By selecting the top principal components, you can project the high-dimensional data onto a lower-dimensional space while preserving most of the data’s structure.

t-SNE is a nonlinear technique that aims to maintain the local structure of the high-dimensional data in the lower-dimensional space. It’s particularly useful for visualizing high-dimensional data in a 2D or 3D space, as it tends to cluster similar data points together.

In your project, you would need to develop a suitable method for dimensionality reduction that:

  1. Retains meaningful features: The chosen technique should preserve important relationships and structures in the latent space, so that the reduced representation remains meaningful and allows for intuitive navigation.
  2. Provides an intuitive navigation experience: The reduced representation should be easy to navigate with the handheld controller. Users should be able to explore the latent space and understand the relationship between different points in this space.

Are there any existing controllers for high dimensional spaces?

Before spending a ton of time reinventing the wheel here, I asked if it knew of any existing controllers which could already handle this kind of high dimensional navigation. It gave these ideas:

While there isn’t a specific controller designed solely for navigating high-dimensional spaces, several existing devices can be adapted or customized for this purpose. Some examples include:

  1. 3Dconnexion SpaceMouse: This 3D controller is designed primarily for navigating 3D environments in CAD, modeling, and design software. It features a pressure-sensitive puck that can be pushed, pulled, twisted, or tilted to control movement in up to six degrees of freedom (X, Y, Z, pitch, roll, and yaw). Although it was not designed specifically for high-dimensional latent spaces, it could potentially be adapted for such a purpose with custom software.
  2. MIDI controllers: MIDI controllers are used to control music production software and instruments, but they often feature a variety of knobs, sliders, and buttons that can be mapped to different parameters. These controllers could potentially be adapted for navigating high-dimensional spaces by assigning each control to a specific dimension or function.

The SpaceMouse actually looks pretty cool for 3D work like CAD:

And I saw elsewhere those buttons are all assignable to keyboard shortcuts, so perhaps there’s something there… I’ve also experimented with using small MIDI keyboards as custom macro controllers in the past (basically just for an assignable bank of buttons), and found that to be pretty useful too. So there might be something there, and I find this idea that music itself might just be a high dimensional space to be pretty intriguing.

User experience moving through High-D space

I then went in the direction of trying to nail down more specifics about what the UX might be like as you move from point to point within a latent or high dimensional space. Here’s excerpts around that:

Navigating through a latent space using a physical controller can be an immersive and engaging experience for the user. Here’s a step-by-step walkthrough of what the user experience might be like:

  1. Initial setup: The user connects the handheld controller to the software interface running on their computer. The software interface presents a visual representation of the latent space, with the current position highlighted or centered.
  2. Starting point: The user begins at a specific point in the latent space, which corresponds to a unique image generated by the image diffusion model.
  3. Exploration: Using the handheld controller’s inputs (joysticks, dials, and buttons), the user navigates through the latent space by moving along different dimensions. As they move, the software interface updates the displayed image in real-time, providing a fluid and responsive experience.
  4. Discovery: As the user navigates through the latent space, they will encounter images with varying degrees of similarity to their starting point. The user can find interesting, unique, or aesthetically pleasing images during their exploration.
  5. Adjustments: The user can use the controller’s dials to adjust settings like interpolation speed, image resolution, or level of detail. These adjustments can help fine-tune the navigation experience, making it more enjoyable and tailored to the user’s preferences.
  6. Saving and organization: If the user discovers an interesting image, they can press a dedicated button on the controller to save it to their image library. The software interface can include organizational tools, like folders or tags, to help users manage their saved images.
  7. Model or category switching: The user can press a button on the controller to toggle between different image diffusion models or categories, allowing them to explore various aspects of the latent space.

I’ll come back to this notion of categories in a bit.

Mapping Latent Space

First, I proposed that there should be some kind of map that gets generated from explorations, so you can see and reproduce the track you took during a session (or across sessions).

Adding a map that visualizes the user’s path of travel and marks saved images as points of interest would further enhance the user experience and provide additional context for their exploration of the latent space. Here’s how the map feature could work:

  1. Display: The software interface could include a separate panel or window displaying the map. The map can be a 2D or 3D representation of the reduced-dimensional latent space.
  2. Path visualization: As the user navigates through the latent space, the map can show their path of travel as a continuous line or a series of connected points. This provides a visual reference for the user’s exploration and can help them understand the relationships between different areas of the latent space.
  3. Saved points: When the user saves an image, the corresponding point in the latent space is marked on the map. This can be done using different symbols, colors, or icons to indicate the saved points. The user can hover over or click on these points to see a thumbnail of the saved image and access additional information or options (e.g., renaming, adding tags, or opening the full-resolution image).
  4. Zoom and pan: The user can zoom in or out and pan around the map to get a better view of their path and the saved points. This allows them to easily navigate and manage their saved images in the context of their exploration.
  5. History navigation: The map can also serve as a way to navigate through the user’s exploration history. By clicking on points along their path, the user can quickly return to previous locations in the latent space.
  6. Exporting the map: The user can have the option to export the map as an image or a file format compatible with other mapping or visualization tools. This can help them share their exploration with others or analyze their path in more detail.

What the user sees in real time

Got it to nail down more clearly what the user sees in real time as they navigate along dimensions…

When the user navigates from one point to another in the latent space using the physical controller, they will see a continuous and smooth transition of images in real-time on the software interface. The images displayed during this transition will represent intermediate points in the latent space between the starting point and the destination.

The real-time visualization of the user’s movement through the latent space can be achieved using interpolation techniques, such as linear or spherical interpolation. Interpolation allows for a smooth blending of features between the starting and destination images, creating a seamless sequence of images that gradually change as the user moves through the latent space.

As the user navigates, they will see the images evolving, with features from the starting and destination points mixing and merging in different ways. This real-time visual feedback helps the user understand the relationships between different points in the latent space and discover new, interesting, or aesthetically pleasing images during their exploration.

Mode switching & dimension assignment

I’m flying by the seat of my pants here imagining this device, but having used image diffusion models a lot (I’m up to 92 AI art books that I published), I have a pretty strong if still intuitive sense of how it might all work. One thing I suggested was that we’d need a way to rapidly switch controller modes and/or assign dimensions to manipulate on the fly. This might be tricky given that, for instance, even a single pixel could be considered a dimension (depending on the model). Here’s what CGPT offered around this topic:

You’re correct that allowing users to rapidly flip between or reassign different dimensions to manipulate as they travel through the latent space is important for an intuitive and engaging experience. Here are a few potential solutions to achieve this:

  1. Mode switching: Assign a button on the controller to switch between different modes, where each mode corresponds to a specific set of dimensions that the user can manipulate. In each mode, the controller’s inputs (joysticks, dials, etc.) can be reassigned to control the dimensions associated with that mode. The software interface can provide visual feedback indicating the current mode and the dimensions being controlled.
  2. Dynamic dimension assignment: Include touch-sensitive controls or additional buttons on the controller that can be used to cycle through dimensions for each input control. For example, a user could tap a button to cycle the X-axis joystick control through different dimensions, while another button could be used to cycle the Y-axis control. Visual feedback on the software interface can display the currently assigned dimensions.
  3. Custom presets: Allow users to create custom presets that define specific combinations of dimensions to be controlled by the inputs. These presets can be saved and loaded using the controller’s buttons or through the software interface. This enables users to quickly switch between their preferred dimension assignments during navigation.
  4. On-screen interface: Implement an on-screen interface that users can interact with directly using a mouse, touch screen, or another pointing device. The on-screen interface can be used to change the dimensions assigned to the controller’s inputs, providing a visual way to manage dimension assignments.

It also offered voice & gesture controls as additional options, but I don’t want to include either of those in initial design scope. To me those are just secondary ways of accessing primary functions controlled in the app.

What the app UI might consist of

I even got it to output very basic SVG code of the software layout, but it’s not worth sharing. So here’s the text description of what it thinks the app window would contain:

  1. Main window: The main window should be divided into two primary sections, the image preview area and the control panel.
  2. Image preview area: This section occupies the majority of the main window and displays the current image generated based on the user’s position in the latent space. It should be large enough to comfortably view the image and observe the real-time changes as the user navigates through the latent space.
  3. Control panel: The control panel can be located on the side or bottom of the main window and contain various interactive elements for controlling the software and adjusting settings. This panel can include:
    a. Dimension assignment: Include a list or grid of available dimensions, where users can click on a dimension to assign it to a specific controller input. The currently assigned dimensions can be highlighted or marked in some way.
    b. Model or category selection: Provide a dropdown menu or a list of available image diffusion models or categories for the user to choose from.
    c. Presets and modes: Include buttons or a dropdown menu for selecting custom presets or modes that define specific dimension assignments for the controller inputs.
    d. Interpolation settings: Provide sliders or input fields for adjusting interpolation speed, image resolution, or other relevant parameters.
    e. Navigation history: Display a visual timeline or a list of the user’s previous locations in the latent space, allowing them to click on a point to return to that location.
    f. Map view toggle: Include a button or checkbox to toggle the display of the map view, which shows the user’s path of travel and saved points in the latent space.
  4. Map view (optional): If the map view is enabled, it can be displayed as a separate panel or window, either floating or docked within the main window. The map should show the user’s path of travel, marked saved points, and allow for zooming and panning.

Agile user stories

I wanted to see just how far I could push it to do the job of the product manager, so I had it write up some Agile user stories describing the software product only:

  1. As a user, I want to see a real-time preview of the image generated based on my position in the latent space, so that I can observe the changes as I navigate.
  2. As a user, I want to assign different dimensions to my controller inputs, so that I can easily switch between controlling various dimensions in the latent space.
  3. As a user, I want to choose between different image diffusion models or categories, so that I can explore various aspects of the latent space.
  4. As a user, I want to save interesting images I discover during my exploration, so that I can revisit or use them later.
  5. As a user, I want to adjust interpolation settings like speed, image resolution, and level of detail, so that I can fine-tune my navigation experience.
  6. As a user, I want to view a map of my path of travel through the latent space, so that I can better understand the relationships between different areas and locate saved images.
  7. As a user, I want to access my navigation history, so that I can quickly return to previous locations in the latent space.
  8. As a user, I want to create custom presets for dimension assignments, so that I can quickly switch between my preferred combinations during navigation.

There are obviously technical issues to work out here, but as a PM reading those, they actually feel like something that could probably be built with an engineering team who has suitable domain expertise.

Categories

Since it brought up ‘categories’ again, I finally inquired what it meant by that, and it’s very useful:

In the context of item 3, categories refer to different groups or subsets within the latent space that are associated with specific types of images, visual features, or content. These categories could be derived from the underlying image diffusion model, which may have been trained on various classes of images, or they could be based on user-defined groupings.

For example, if the image diffusion model was trained on a diverse dataset containing images of animals, landscapes, and buildings, the categories could be “Animals,” “Landscapes,” and “Buildings.” By selecting a category, the user would be able to focus their exploration within the latent space on images that correspond to that specific category.

Alternatively, users may create their own custom categories based on their interests or preferences. These custom categories could be defined by selecting and tagging images during the exploration, effectively creating a user-defined grouping within the latent space.

Allowing users to choose between different categories helps them to more efficiently explore the latent space, as they can focus on specific types of images or visual features that are of interest to them.

Visit adjacent neighborhoods

One thing I always want when I explore in Midjourney especially is, if I find a good area within the latent space, I want to be able to easily and effectively scan adjacent neighborhoods. Here’s what CGPT says on how to potentially do that:

Local exploration mode: When the user enables local exploration mode, they will be presented with multiple smaller image previews that represent the adjacent neighborhoods in the latent space. These previews can be arranged in a grid or circular pattern around the current position.

Neighborhood selection: The user can click on any of the smaller image previews to instantly move to that adjacent neighborhood in the latent space. This will update the main image preview to reflect the new position and generate a new set of adjacent neighborhood previews.

Adjusting proximity: Provide a slider or input field in the software interface that allows the user to adjust the “proximity” of the adjacent neighborhoods. A smaller proximity value will result in neighborhoods that are closer to the current position in the latent space, while a larger value will display neighborhoods that are farther away. This enables users to fine-tune the scope of their local exploration.

While I’m at it, the idea of actually physically mapping out a high dimensional space like this is not so crazy, it seems. I found this Google Arts & Culture experiment, which they’re calling a t-SNE map. It’s hard to do justice to this visualization in only 2D, but here’s a screenshot:

That’s a pretty zoomed out view of a landscape of visually similar images. Here’s a more zoomed in look:

The best thing is really to try it out on the website, but it seems like their sidebar items in the guided basically correspond to what CGPT describes as ‘categories.’

What’m envisioning here is even more fluid and for lack of a better word “VR,” in that not only could you see the specific images like you can above, but it would fluidly move between them with attributes (dimensions) receding or increasing as you travel.

Other modes of use

I asked about what other modes users might want in addition to local exploration, described earlier. It said:

  1. Global exploration mode: This mode allows users to explore the entire latent space more freely, without being restricted to a specific category or local area. Users can navigate in any direction, discovering a wide variety of images and visual features along the way.
  2. Category-focused mode: As mentioned earlier, this mode lets users focus on a specific category within the latent space, such as animals, landscapes, buildings, or user-defined categories. The navigation and exploration will be tailored to that specific category, making it easier for users to find images that match their interests.
  3. Favorites mode: In this mode, users can revisit their saved images and explore the neighborhoods around those points. This can help users to discover new images that are similar to their favorites or to fine-tune their saved images by navigating the local areas around them.
  4. Guided tour mode: This mode provides a pre-defined or user-generated sequence of points in the latent space that users can navigate through. The guided tour can be based on specific themes, popular locations, or user-created paths. This mode can be useful for users who are new to the latent space exploration or who want to follow a curated experience.
  5. Random walk mode: In this mode, the software automatically generates a random path through the latent space, with the user being able to influence the direction or speed of the movement. This can create a more relaxed and serendipitous exploration experience, where users can discover unexpected images and visual features.
  6. Search mode: If the underlying image diffusion model supports it, users can search for specific images or visual features within the latent space using keywords or example images. In this mode, the software would display search results as points in the latent space, which users can then navigate to and explore.
  7. Collaborative mode: This mode allows multiple users to explore the latent space together, either by sharing a single exploration session or by viewing and interacting with each other’s paths and saved images. This can create a more social and engaging exploration experience, with users being able to share discoveries and collaborate on finding interesting images.

Referent-based mode

As I explained to CGPT:

i keep visualizing the ability to like define a group of referents, points or features or whatever in the latent space (like a blue ball, an elephant, a storm, etc) and then… something… explore themes around those but with different configurations or treatments. does that make sense?

And it replied, partial:

What you’re describing is a mode that allows users to define a group of referent points, features, or themes in the latent space and then explore variations and relationships between those referents. This can help users discover new and interesting images that share similarities with the chosen referents but have different configurations, styles, or treatments.

Here’s a potential approach to implementing this idea:

  1. Referent selection: Provide a user interface element (e.g., a list or a grid) where users can add, remove, or modify referent points, features, or themes. Users can either manually select these referents by navigating the latent space or use a search functionality if the underlying image diffusion model supports it.
  2. Theme exploration: Once the referents are defined, the software generates a set of variations or combinations based on the selected referents. These variations can be displayed as a grid of image previews or a more interactive visualization like a 2D or 3D scatter plot.

As it went on (and I clipped some less useful parts), I got to wondering, isn’t this just the same thing as simple prompt mode? How would this be different than just entering words for the things I want as my referents?

The concept you described earlier is different from a simple prompt mode. Both modes are aimed at guiding users to explore specific themes or concepts in the latent space, but they approach this goal in different ways.

In a simple prompt mode, users input a textual prompt (e.g., a keyword or a phrase), and the software generates a single image or a small set of images related to the prompt. The exploration in prompt mode is more linear, focusing on the specific concept described by the prompt. Users can adjust the prompt to refine the resulting images or explore other concepts, but the focus remains on a single theme at a time.

On the other hand, the referent-based mode you described earlier involves defining a group of referent points, features, or themes in the latent space and then exploring variations and relationships between those referents. The exploration in this mode is more dynamic and interactive, allowing users to discover new and interesting images that share similarities with multiple referents, while presenting different configurations, styles, or treatments.

In the referent-based mode, users have more control over the exploration process, as they can adjust parameters, add or remove referents, and interact with the generated variations in a more flexible way. This mode also promotes a deeper understanding of the relationships and connections between different areas of the latent space, as users can visually observe and manipulate the variations and combinations of referents.

So I think this distinction might be like… text prompts could be used to get me to a particular place in the latent space (or to modify what I’m seeing already). Whereas, a referent-mode might allow me to – as I’m exploring – say select certain entities that I see within the visualization… and maybe add them to a basket or something that I can take along with me as I go. And those referents wouuld show up or stick with me as I continue navigating the space, until I choose to release them. Something like that.

Short list of required app features

So, from there I had it go back through and distill all of this into a list of required features for the app:

Here is a short list of required features for the app, based on the discussions so far:

  1. Real-time image preview: Display the generated image based on the user’s current position in the latent space, updating in real-time as the user navigates.
  2. Dimension assignment: Allow users to assign different dimensions of the latent space to specific inputs or controls, enabling them to easily switch between controlling various dimensions.
  3. Category selection: Provide the ability to choose between different categories or subsets within the latent space to focus the exploration on specific types of images or content.
  4. Save images: Enable users to save interesting images they discover during exploration for future reference or use.
  5. Interpolation settings: Allow users to adjust settings like interpolation speed, image resolution, and level of detail to fine-tune their navigation experience.
  6. Exploration modes:
    a. Local exploration mode
    b. Global exploration mode
    c. Category-focused mode
    d. Favorites mode
    e. Guided tour mode
    f. Random walk mode
    g. Search mode (if supported by the image diffusion model)
    h. Collaborative mode
    i. Referent-based mode
  7. Map view: Display a map of the user’s path of travel through the latent space, with the ability to show saved points, referent points, and other user-defined landmarks.
  8. Navigation history: Provide access to the user’s navigation history, allowing them to quickly return to previous locations in the latent space.
  9. Custom presets: Allow users to create and manage custom presets for dimension assignments, making it easy to switch between preferred combinations during navigation.
  10. Adjustable proximity for local exploration: Provide a slider or input field to adjust the proximity of adjacent neighborhoods during local exploration.
  11. Help and tutorials: Offer in-app help and tutorials to assist users in understanding and utilizing the various features and exploration modes.

Minimal feature set for a v1

So from that, I asked it to reduce those features for a stripped down v1 prototype/proof of concept, and it said:

  1. Real-time image preview: Display the generated image based on the user’s current position in the latent space, updating in real-time as the user navigates.
  2. Dimension assignment: Allow users to assign different dimensions of the latent space to specific inputs or controls, enabling them to easily switch between controlling various dimensions.
  3. Save images: Enable users to save interesting images they discover during exploration for future reference or use.
  4. Local exploration mode: Provide a focused exploration mode that allows users to navigate the immediate neighborhood around a specific point in the latent space.
  5. Global exploration mode: Offer a more free-form exploration mode that enables users to explore the entire latent space without restrictions.
  6. Adjustable proximity for local exploration: Provide a slider or input field to adjust the proximity of adjacent neighborhoods during local exploration.
  7. Navigation history: Provide access to the user’s navigation history, allowing them to quickly return to previous locations in the latent space.

Wrap-Up

There’s a lot to digest in all of this of course. Apart from just the interesting aspects of the technical challenges described in this product design, the thing I’m most excited about is just how far you can take ChatGPT in a product design & planning session. The answer, at least in v4, is really really damn far.

Then, if it’s something that is amenable to the types of visual representations that Midjourney is capable of, this starts to become a pretty powerful set of tools. I suspect once they publicly release the image description tools for GPT-4, you’d also gain other possible vectors of insight into this design iteration process.

All told, this direction of a controller (hardware + software) specifically for Midjourney, but adaptable to any high-dimensional space or model, is a really compelling one for me as an artist making heavy use of the tools in the somewhat primitive state we see them today.

I’m not going to guarantee that I’ll be the one who builds it, but perhaps once Github Copilot X is released, I’ll be that much closer to actually being able to build some aspects of this down the road. If nothing else, I’m hoping these blog posts exploring these topics can inspire other innovations in the field. Exciting stuff!