Implementation Complexity, Part I: The Tower of Babel, Gremlins, and The Mythical Man-Month
I thought I'd post a follow-up, in a sense, to an older post about complexity in consumer electronics I wrote a year and a half ago. That was kind of a rant against overly complex user interfaces. I am a huge opponent of unnecessary complexity in almost any kind of interface, whether a user interface or a programming interface or an electrical interface. Interfaces should be clean and simple.
Now, instead of interface complexity, I'll be talking about implementation complexity, with a little more of a philosophical slant. Bear with me.
The Tower of Babel
Those of you who know me personally know that just about the last thing I would do is quote Biblical verses. I'm not a superstitious person. But there's one part of the Bible that gets me a little spooked.
1 And the whole earth was of one language, and of one speech. 2 And it came to pass, as they journeyed from the east, that they found a plain in the land of Shinar; and they dwelt there. 3 And they said one to another, Go to, let us make brick, and burn them thoroughly. And they had brick for stone, and slime had they for morter. 4 And they said, Go to, let us build us a city and a tower, whose top may reach unto heaven; and let us make us a name, lest we be scattered abroad upon the face of the whole earth. 5 And the Lord came down to see the city and the tower, which the children of men builded. 6 And the Lord said, Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do. 7 Go to, let us go down, and there confound their language, that they may not understand one another's speech. 8 So the Lord scattered them abroad from thence upon the face of all the earth: and they left off to build the city. 9 Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth: and from thence did the Lord scatter them abroad upon the face of all the earth.
We can look at the Tower of Babel story on a few levels.
Literal and allegorical interpretations
First there's the literal one: we have a united community with the ambition and arrogance to build a tower to heaven on their own; God doesn't like it and introduces communication barriers. Problem solved, and as a result we have languages and people are dispersed over the earth. Moral of the story — and this is a recurring theme of the Old Testament — don't turn your back on the Lord, thinking you can live life without Him, lest He retaliate in strange and mysterious ways.
I suppose the Tower of Babel could also be interpreted as an allegorical warning against the use of science and technology, or against the concentration of power.
Somehow, though, technology and power don't worry me so much. Is it scary that we can send spacecraft to the moon and beyond, or split the atom, or level mountaintops, or reshape rivers? Perhaps. Or how about that hundreds of millions of people can create celebrities or pariahs out of thin air by focusing mass media attention? Maybe. Or how about that we've had governments that have so many resources at their fingertips that they can go to war, or imprison whole communities of their people, or poison the environment? Hmmm, there's something to that. But strangely it doesn't keep me up at night.
What I see as the moral of the Tower of Babel is a warning against unmanageable complexity.
The Mythical Man-Month: Design-and-construction-time costs of complexity
Fred Brooks, in his classic book The Mythical Man-Month, interprets the story as an example of large engineering projects. (And as far as Biblical engineering projects go, the Tower of Babel is the second, the first one being Noah's Ark.) The technology is sound, but the edifices of man crumble without proper communication and organization — whether in ancient times, or today in the Information Age.
Communication and group dynamics
In a small project, one person might be able to finish alone. If two or three people are involved, they can split up the work; they just have to communicate well enough to avoid conflict, misunderstandings, and oversights. As more and more people get involved, the cost of mutual communication increases as O(N2): if 10 people each need to have one-on-one conversations with everyone else in the group, then there are 45 pairs of conversations that need to happen. With 20 people it's 190 pairs. With 100 people it's 4950 pairs. Ugh.
Not only do all these people need to communicate well, they need to be able to function together. With 100 people it's a lot more likely that two of them are not going to get along well, than if there are only two or three.
Project management: labor structure, conceptual integrity, abstraction, and partitioning
The key to beat this O(N2) business is specialization of labor: not all of the people building the Tower of Babel can actually be doing the building, not everyone can design the building, not everyone can work with everyone else, and not everyone can take the time to work on each part of the project.
Some of them have to be managers, providing direction to workers, resolving disputes, and organizing the project. Large projects need some kind of structure that reduces the number of simultaneous interactions that have to happen.
Some of them have to be architects, providing a consistent high-level structure to the design. Brooks calls this “conceptual integrity” — in order for the overall project to be successful, a relatively small number of people need to have a consistent vision about what the project should be.
A large project's success depends on the ability of management and architecture to apply a divide-and-conquer approach, and break down the problem into modular units which can be thought of in an abstract way. If you're treating a software project as a million lines of code without any modular structure, there's no way to figure out how the different areas interact, much less keep a vision of how it all works.
Keeping the focus on the project, not on the individual
Brooks brings up several issues in large projects that are psychological in nature rather than technical or logistical. There are two in particular that I noticed on my most recent reading of The Mythical Man-Month.
One has to do with optimization. During the development of the OS/360 operating system, there were important program size limitations. As the operating system modules were divided among a team of programmers, individual team members got their piece to meet requirements, but the team as a whole had problems. Why? Because team members were focusing on their piece alone, and optimization efforts in one module to meet size targets sometimes imposed a performance cost on other parts of the system. Sometimes individual contributors have conflicts between their own individual stated goals, and what makes the most sense for the project as a whole. Good communication between individuals and managers can help prevent these conflicts, by appropriate reprioritization of subtasks.
The other issue is one of status reports and micromanagement. Brooks mentions the issue of schedule slippage and the chain of command. There is a tendency for a group manager to want to address problems within his group, before concerning upper management -- for fear that when a higher-level manager hears about a problem, he will take it into his own hands to solve it. Brooks suggests dividing operational discussions into status meetings and action meetings; status meetings are for reporting the current state of things, and action meetings are for addressing problems that come up during status meetings. This encourages an open exchange of status information, decoupled from how to solve problems. Furthermore he suggests that higher-level managers need to have the discipline to allow their team members to solve issues on their own. Again, this topic involves individual tendencies that can get in the way of what's best for the project as a whole.
Large projects need written design documentation, interface definitions, development tools, testing... and unless they're online collaborative open-source projects, they usually need financing, office space, shipping and receiving, legal counsel... argh! Do you really want to organize a large project?????? I don't.
I really can't do justice to retelling the good points of Fred Brooks's book, so if you haven't read The Mythical Man-Month, get a copy, find some quiet time, and dig in — trust me, it's worth the effort.
Gremlins: multipliers in design-time complexity, effort, and risk
I used to joke at my last job that we weren't performing the proper small-animal sacrifices to drive out the gremlins. You know these gremlins: you've got a project that's going well for the most part, and then suddenly there are three or four things that take what should be a simple aspect of a project, and for inexplicable reasons, make everything grind to a halt, where you drop all other work and do whatever voodoo adjustments are necessary to overcome some weird problem.
What causes the gremlins? Some of them are due to bad luck, and others to poor management decisions, but others are less clear.
Okay, so here's where we start to get towards the more insidious — and interesting — aspects of the Tower of Babel.
Picture this: you are living a few thousand years ago, in the plains of Shinar, and there's this neat new open-source project to build a big tower. You've got the materials, you've got the manpower, you've got only one language because the Lord hasn't yet seen fit to interfere... and the one thing you haven't got is time. Everyone is in a rush. Because there's a lot to do, and it's going to take so long to finish this thing that if you don't work full speed ahead, you might die before seeing its completion. Also, because it's a typical open source project, there is no source of funding, so you've all got day jobs like farming and hunting and pottery-making to do first, in order to keep the project going, and that doesn't leave too many hours in the day to work on the tower. Perhaps more important than that is the sense of momentum — at first the tower grew quickly and it was easy to see progress, but now that it's getting larger, the project is proceeding more slowly... without that sense of continual observable progress, your motivation may be in jeopardy, and naysayers have an easier time attracting newcomers to their side than you do.
Or maybe it is a paid project, and those venture capitalists are getting edgy.
In any case, time is of the essence.
So you all have to divide your time between getting work done, and communicating to get the work done. Some of these jobs are pretty easy to define, and it's a quick exchange:
“Dig a trench to the lake so water will flow here.”
“Make bricks 1/2 cubit by 1/4 cubit by 1/6 cubit.”
But then others have some more subtleties. Like the floor plans, and how to make sure each new floor in the building is level, and where to put the pulleys and ramps, and the food and water caches in the upper levels so the workmen don't have to spend half their day traveling up to the top and back down again.
And then there are the problems that happen from time to time. All of a sudden, a bunch of people need to stop to rip up some of the bricks and carefully rebuild them the right... but they can't do too large of a repair job at once, since they don't want the whole building to fall down. Instead of having careful time to plan, the architects and managers have to find a solution NOW that will work right, and they have to think of the things that can go wrong and make sure the problems won't get worse.
The point is, no matter how much time you'd like to spend just getting the thing done, there's some minimal amount of time needed to describe how to get the thing done. The simpler a task is to describe, the more time you and the other workers can just do it and not worry about many details. When it takes longer to describe, it takes more concentration to work out the details, and there are more things that can go wrong. Sometimes there are details you thought were simple, but they really need some careful discussion to make sure everyone understands.
So one measure of complexity of a task is how long it takes to describe it. In computer science, this idea as applied to algorithms, is called Kolmogorov complexity after one of its early researchers.
The thing is, not only do large projects require more overhead for partitioning and management and structure than small projects, but it seems these sorts of complex subtleties become more common, and the ratio of planning time (or documentation!) to work time increases. I would hypothesize that in general, as projects grow larger, the Kolmogorov complexity of projects increases faster than the number of laborers or man-hours needed to actually complete the work.
How can we reduce this tendency? Disciplined modular design and clean internal interfaces will reduce the task overhead of complexity, but it's really hard to prevent. You need really good architects and managers, and a good communication chain between the people at the top of the org chart and the people at the bottom, so that everyone sees a consistent vision of completion.
Internal interfaces, abstractions as approximations, gray boxes, convergence, and farsourcing
Note that I just mentioned interfaces again. This time it's not a user interfacethat matters, but rather an internal interface between partitions of the project. Maybe it goes without saying, but if you can break a project down into two subprojects A and B, with one team working on each, the compatibility at the interface between A and B is critical. Ideally the interface can be specified exactly, and each team can treat the other subproject as an abstract entity (aka “black box”) which complies with the interface specification... but this is rare, and sometimes there are subtleties that evade even the most careful specification.
This is kind of an abstract idea, so let's go over a few examples.
One interface that should be very familiar is the NEMA 5-15 electric outlet. (Image from Wikipedia.)
What can we say about the two components of this interface?
On the plug side: there are three terminals here, flat-blade line and neutral conductors and a round earth grounding terminal. The plug has an earth ground pin which is longer so it makes contact first — metal housings of equipment are connected to earth ground to prevent electrical shock, and you want that connection made before you bring in the line voltages, so if there is an internal short circuit, it will blow the fuse rather than make the metal housing a shock hazard. The equipment being connected is also not supposed to draw more than 15A rms current, so it doesn't blow a fuse under normal operation.
On the socket side: the slots for the blades are asymmetric, to allow polarized 2-terminal plugs to be connected properly; the larger blade is neutral and the smaller blade is the line voltage (or “hot”) connection. The contacts in the socket are recessed to prevent shock hazards. There are two sockets here; they may both be wired to the same circuit, but not necessarily.
Sounds pretty simple. What's missing from this description? Well, for one thing, the physical dimensions and tolerances of the components. Those are going to be in the ANSI/NEMA WD6 standard.
What else? Well, we haven't specified the allowable metals that can be used for the contacts; after all, you wouldn't want wear or corrosion on your electric outlets. We also haven't specified the voltage limits (114-126V) for the receptacle — that's in the ANSI C84.1 standard — but this gets tricky, because the standard only covers the voltage that shows up at the utility service entrance (the electric meter), and since internal wiring can vary and there are multiple power loads in your house, it's not really certain what the voltage on the outlet will be. The frequency limits are even weirder; if there's a specification, it's not easy to find, and frequency can vary over time depending on what the power flows happen to be in the power grid, and on the specific electrical characteristics of the generators, which usually slow down a little as they are more heavily loaded. The devices you plug into the grid have a 15A rms limit, but limits of transients and harmonics are less clear; when we get to voltage and current harmonics, the current you draw from one device will affect the voltage seen by other devices, because house wiring has a small but significant resistance. High-frequency components are limited to comply with standards on electromagnetic interference. And there's also the IEC 61000-3-3 standard on flicker — basically, the idea is that if you have a device like an induction cooking top, or a microwave oven, that has a time-varying power load, the variations in load currents can cause the local voltage on your house wiring to vary with time as well, and with incandescent light bulbs this will translate into visible flickering that at best causes irritation to some people, and can induce epileptic attacks in others. The flicker standard defines a specific measuring process, and says that the load changes caused by devices that plug into the AC mains must have flicker statistics below an allowed threshold. If you make a device which doesn't meet the standard, it can be tricky to remedy.
There's a lot of subtleties here! They're necessary because there are many manufacturers of electrical supplies, and thousands of manufacturers of devices that plug into the wall. And it still doesn't guarantee that you can plug two particular devices into the same outlet and not trip a circuit breaker.
The kinds of interfaces that are internal to a project are a little different. Wiring harnesses are a good example. Your car has wires connecting the battery, fuse box, lights, starter motor, radio, power locks and windows, and a bunch of other little things. Each car has its own different wiring harness, and it's not designed to work with arbitrary 3rd-party components: if Ford or Toyota wants to make a change, they don't really have to worry about making the car incompatible with aftermarket components, because the things that connect with the wiring harness are designed by people that are responsible to the car company. All they have to do make sure the groups are in sync with each other.
And here's where the gremlins come in.
You have three choices about how to manage the design details of internal project interfaces.
- Write it into a strict formal interface specification,requiring no interpretation. This is the most rigorous way to handle compatibility. What you do when you create an interface specification is describe an approximation of how components interact together. If I have a starter motor, a wiring harness, and a battery, I can design my system with three different people who don't need to know all the details of each of them. They just have to know enough relevant details of compatibility.
There are a few downsides to a formal interface specification.
Another downside is that they're time-consuming. For interoperability standards, the time is usually worth the investment, since there are millions of different implementations. For an internal interface which is a one-time project, that only links Company X Component A with Company X Component B, the need for bulletproof formality is often unjustified.
And then there's always the dilemma that some details of the interface may just be impossible to nail down to a formal specification, either because it's impractical to do so, or some of the details are unknown, or the details change over the course of the project life cycle. Specifications are notoriously difficult to write for batteries, for example. How do you write a spec for a battery, when the load currents may be nearly arbitrary waveforms? You can't, and you don't. You have to take best-guess representative or statistical approximations. Does this really matter? It sure does: among other things, we've got hybrid vehicles, and the only way to make sure a battery design is going to be robust, and have a useful lifetime of years, rather than weeks or months, when thousands of different people are using it different ways, is to come up with standard driving cycles that are a reasonable representation of use patterns.
Finally, if the interface specification seems formal enough, engineers will use it as a crutch to replace good communication. (“Why didn't you tell me you had to buy expensive connectors for the brake circuit?” “Well, I was just following the interface specification.” “But we could have changed the spec to make it easier for you!”)
- Keep the teams communicating well. Interface specifications are one tool to maintain compatibility between subsystems. You still need to make sure that teams operating on different sides of an internal interface are interpreting the specifications in the same way, that they're making the same assumptions, that they are aware of any imminent changes that must be made, that they are briefing each other on any problems they have, and that they have the same concept of the project as a whole. And in addition to the interface itself, each team really should know just enough about what the other team is doing, that they have confidence their subsystems will work together. Instead of a black-box abstraction, this is a “gray-box” approach that relies on cross-interface awareness.
This famous cover of The New Yorker is one way of thinking about cross-interface awareness: you don't need to know a whole lot about what goes on across the interface in that other subsystem, but you'd better be right if you're planning to support what they're doing beyond the formal details in a specification.
This kind of convergence between teams isn't so hard when they are located near each other. But lack of frequent and open communication between teams can be problematic:
- one team is halfway around the world from the other, and there are limited opportunities for real-time communications
- one team is a contractor with different motivations and backgrounds (you have a job and a future on the line, but they just want to get paid for the project)
- for security reasons, one of the teams can't have access to all the project information
- one team is nearby but just doesn't communicate often
I call these situations “farsourcing”: it's really the psychological distance between teams that matters, rather than physical distance, or whether one team is an outside contractor.
Rely on common assumptions. After all, everyone knows what a transistor or an R/S latch or a CRC or a quicksort or a semaphore does, right? No need for spending time stating the obvious. Er... um... well... common training or backgrounds can help start teams in sync, but it's not really a viable solution here. Too many things can go wrong.
One point on assumptions that is rather important: don't make them without writing them up front in a formal document. (Ever heard of the term “requirements document”?) If you have a design choice that you're trying to figure out in a group, and one of you says excitedly, “Oh, hey, we don't need to handle that case, because the customer's not going to be using features X and Y at the same time!”, then for goodness sake, if you're not using formal requirements documents, then at least write it down in an assumptions list and make sure that any other design decisions that depend on it are clear. Because if you ever have to remove an assumption, you're going to have to find all the pieces of your project that rely on that assumption no longer being true, and fix them.
Gremlins are unavoidable in complex projects, and you're going to have to pay for them somehow. Either you repel them ahead of time, or you budget the time to deal with them as they come, or you're just raising the odds for failure.
Success and throwing it over the wall
OK, so let's say that the right people get together into a well-functioning team, and build our Tower of Babel, whether it's the Empire State Building, or the electrical grid, or the Internet, or a billion-transistor processor, or an operating system, or a new car, or a jumbo jet, or a space satellite. This team has written a requirements document, and internal interface specifications, and a performs effectively at all levels. And they've overcome the gremlins.
We'll look at what happens after design and construction in my next post.
Previous post by Jason Sachs:
Isolated Sigma-Delta Modulators, Rah Rah Rah!
Next post by Jason Sachs:
Implementation Complexity, Part II: Catastrophe, Dear Liza, and the M Word
There are no comments yet!
Add a Comment