Thinking about Your Data

Programming is often thought of in terms of code - instructions for a computer. In practical terms, the code is only a means to an end - transforming data from one form to another, from inputs to intermediate forms to outputs.

Whether this be text files to twitter posts, images and meshes to 3d graphics on screen, or last frame's object state to this frame's object state - it is the data that matter.

This is not a new idea, but in my experience it is quite an uncommon framing amongst newer developers, and developers without a formal programming background. There's been some push towards "Data Orientation" in the last few years, but it seems everyone has a different interpretation of what that actually means.

If you're at the tip of that spear, this post probably isn't for you! In this post, I aim to provide a way of thinking about code, and some practical steps a developer can take to improve the efficiency, modularity, and (to a subjective degree) cleanliness of her code - particularly if they are a game developer.

Note: I'm an advocate of drawing this kind of stuff on paper - but I'm also aware that sketches are a messy way to present it for an article, especially in a hurry! 🤔 I'll hopefully find time for cleaner digital visualisations soon. Apologies for the scratchy handwriting in the meantime!

Anatomy of an Algorithm

Any algorithm can be visualised as a simple pipeline.

Data is transformed by any algorithm, producing new data.

These pipelines can be chained together if the output of one is a suitable input for another.

Transformations can also take multiple different inputs, or produce multiple outputs. This is where things start to get muddy - but to cut through the mud and see where your data is really coming from and going to, you have to break things down into their units.

Here are some "real world" game examples, start to finish.

Pathfinding is often a relatively self-contained system.

Start/End Pairs, Path Constraints, and World State, are transformed by A* into Output Paths, for consumption by other code.

Sprite Rendering needs to talk to externally defined APIs, like OpenGL, SFML, or SDL.

Images are decoded to textures, game logic drives animation and physics to compute sprite frames and positions, and these are collectively transformed by the sprite rendering system to OpenGL draw calls, resulting in pixels on the screen.

You'll notice that the "moving parts" we usually think about here are just single nodes in these graphs. In the above pathfinding representation you can swap out A* for Breath-First-Search if you wanted and the pipeline doesn't change. When you're thinking about your code in this way, code optimisation becomes a matter of improving (or replacing) bits of the pipeline.

This is what makes those node based shader (and code!) editors so powerful. They show your data flow explicitly as a matter of course, and tangled data dependencies appear as a visually tangled graph. Long pipelines appear as long chains of nodes. With most imperative programming languages, we have to do this visualisation ourselves.

Choosing the right format for your needs

Often, you can't really control the final form of your outputs. You've got to make specific function calls with specifically formatted data for your graphics API (or whatever you've got that sits above that), you've got to prepare json to POST, the sound card needs samples at a specific rate, whatever.

Much of the time, however, you can control your inputs; and for multi-stage pipelines, you generally control the intermediate representation as well.

There are also cases where you will "close the loop" and consume your own output, so you have total control of the entire pipeline. In games, this applies to things like saves, network representation, level formats, and how you choose to represent persistent data in your major systems.

Networking in games is a closed loop - the programmer ultimately controls the data representation at all stages.

Packets are recieved and decoded. Along with the old world state, these compute a new world state. This is then encoded and sent to other instances of the game, which consume them using the same pipeline.

You can often, with some effort, design a straightforward data representation that you'll enjoy working with - once you understand what your pipeline is really doing.

For example, when pathfinding, you could output a list of references to graph nodes from your pathfinding algorithm, or a list of tile indices, or a list of vectors, or a list of piecewise components - or plenty of other representations. I'd opt for the list of vectors in most cases. Most pathfinding dependent code will have an easy time consuming vectors and will likely want a vector representation for further decision making, so it saves conversion later on. You can think about the trade-offs the other representations might involve.

Often, these choices can make the difference between "clean code" and "dirty code".

Benefits

Thinking carefully about how data flows through your program can lead to many benefits; but 3 big ones come to mind.

Batching

Batching can be thought of loosely as "working with a lot of the same data at once". Folks familiar with low level graphics programming will already have an idea of what I'm talking about, but this can be applied to many, many areas of code.

To use a common graphics programming analogy: rendering 1000 particles as individual draw calls is much slower than rendering them as one big VBO.

When thinking in terms of data flow, a rule of thumb is that if your input is a plural (Sprites, not Sprite), you can often batch processing and save time.

For pathfinding, we might consider moving from an immediate interface find_path(from, to) to a deferred interface request_path(from, to), which would either give us an id to poll on, or take a callback to be run once the path is ready.

To do this gradually, we could still provide the immediate interface alongside the deferred one.

A deferred interface would allow us to collect pathfinding queries in one place, process them once per tick, and notify the caller of the result. We could process them either all at once, or up to some limit to keep a cap on the total pathfinding cost per frame.

This can be applied to many game relevant concepts. Collision Detection, Physics, Rendering, Pathfinding, Networking, Raycasting, and Particles are all systems where batching is a common important optimisation.

Parallelism

Straightforward data pipelines are (relatively) easy to parallelise - particularly if they're batched.

Where there's one thing to be transformed, there's usually many - and when there's many, and we've formalised the data dependencies of each, there's often a great deal of parallelism to be found.

To keep the pathfinding theme going, in the previous deferred pathfinding case, pathfinding queries can be processed in parallel. We can keep N workers around, and once per tick they can each burn through 1 / N of the paths to be processed. Each query is independent of the others, and the shared data (the world state) is read only as far as pathfinding is concerned (so there's no false sharing worries). If we're using a callback model we'll need to be careful about when we run callbacks for finished paths, because callbacks could do anything and be susceptible to race conditions - but we could just call them serially at the end, with the collected results.

The API doesn't change anywhere, and the data flow graph is unchanged on a macroscopic level - it's just the data flow within the pathfinding system that has changed. Because we understand how data is coming and going from the system, we can do these kinds of optimisations without causing trouble for anyone else (our ourselves).

This opportunity for parallelism can be a great performance "pressure relief valve" later on in development, if the performance-sensitive pipelines in your code are set up the right way from the start.

Happiness

This is the biggest benefit, in my opinion. I'm sure most programmers will remember a point in time where they really hated working on some code - often code they had written in the past.

Tangled code that reaches at will into someone else's variables, enormous classes trying to accomplish multiple things at once, nasty nested loops repeating ad hoc queries into the world...

All of these are symptoms of code before design, of not thinking about data flow up front. Sitting down and thinking carefully about how data moves around in your program helps keep things simple, easy to work with, and hard to screw up.

You'll be happier for it.

Thanks for reading!

This article was written for #notGDC; a not-conference from twitter started by Ben Porter in 2017 and continued in 2018 by Lucy Morris and Mike Cook. There are a lot of interesting talks about bees, balls, and radial menus; motivation, homebrew, and procedural generation. It's a great resource for all the devs, and a nice support network for those who don't make it to GDC every year. Thanks so much to the organisers for putting in the work!

In the spirit of the event, these words were prepared in quite a hurry 😉 so if you notice any typos or have anything you'd like to correct or question me on, please reach out to me on twitter!

If you liked what you read, consider checking out my other writing or my games. Thanks for your time!

Previously...