The following were the initial research requirements for Erlang when they sat out to investigate a new language for telecom (link at the bottom). It is in the thesis written by Bjarne Däcker, and I think it would be fun to scribble down my thoughts on the different requirements. My view may very well differ from the original views, since I came into the world of Erlang pretty late.
Handling of a very large number of concurrent activitiesIn a telecom system, or in an internet webserver, many things happen concurrently with each other. While one person is initiating a call, another person may be talking on a line while a third caller is trying to set up a conference call between 4 parties. This requires you to be able to operate many things concurrently with each other.
In a webserver, it is the same thing. While you are taking in new GET requests, somebody is doing a POST somewhere while another client is getting data through a Server-Sent-Event channel.
Note that this is not a requirement for parallelism at all. The only requirement is that we can easily describe such concurrent activities. We don't care if it executes on a single core at all.
Actions to be performed at a certain point in time or within a certain timeFor a telecom system, this is quite important. You must be able to handle timing quite precisely. In principle, you would like to have hard realtime, but in practice soft real-time is often enough.
But note: This means that you will prefer low latency over system throughput. It is more important that the system begins responding within due time that it is important it can deliver Gigabytes of bandwidth throughput. Often, latency and throughput are opposite one another. Getting latency down can hurt throughput and vice versa.
It also means that your system must focus on being able to run many timers at once and handle all of them precisely. You may be woken up later than the 200ms you specified, but not before.
Systems distributed over several computersThis is a requirement for robustness of the system. The interesting thing to note here is that there are two large categories of systems of distributed nature: shared-nothing (SN) and those who are not. While it is highly desirable to have an SN system, these are not always easily possible to get. The problem occurs as soon as you need to share state between the given architectures. Many developers attempt to avoid sharing systems, for good reasons. But for certain problems, you cannot avoid sharing data. This is where a language with seamless distribution shines.
Sharing information is very important in a telecom system. A configuration change must eventually be distributed to all end points. If one node goes down, another node must be able to keep on operating. So a telecom system must share some information quickly and cannot be made as an entirely shared-nothing architecture.
There are other areas where you need to track state, preferably across machines: Computer Game servers, Instant Messaging systems, and Databases are a few such examples. Do also note that every shared-nothing system eventually has a place which shares state. It can be a database deep in the backend which handles multiple requests. It can be a memcached instance. Or a file on disk, even. In any case, few systems share no state.
Where seamless distribution really rocks is when you need in-memory objects of state. If the disk turns out to be too slow, you need to materialize the thing you are operating on in memory and then periodically checkpoint the state to persistent storage. This is the case where it becomes too expensive to take a request, load the state from disk, change and manipulate the state and then store it back to disk.
Interaction with hardwareIn telecom, there are certain operations which are impossible to achieve in software. Part of the 3g protocol is the recalculate optimal mobile-phone-to-mast configurations once every millisecond. This makes it impossible to do in software with general purpose chips. You need to handle it with FPGAs or even purposefully crafted chips.
Back in the day, when Erlang was first developed, the problem has probably been the need to handle ATM switching hardware from the software layer. It also suggest that efficient handling of binary protocol data is important.
Very large software systemsOf course, what constitutes very large is subject to change over the years. But it does yield some thoughts on how the construct a language. In very large software projects, you will have many programmers working on the same code base. They must be able to use each other code easily. It must also be possible to evolve the code in one end of the system without affecting other ends.
Compile speed is important. A recompile can't take too long in this setup. Also, it must be easy to construct interfaces that other programmers can use. Note that a major part is to battle change-over-time in the software, where certain parts of the code gets manipulated over a period of years. It creates its own slew of problems since code must still fit together.
Another important point when programming-in-the-large is that you need a way to split up a program into packages and pieces. Otherwise, you can't really manage the complexity. You need a way to take different pieces, describe their dependencies and then assemble them into a working system. Preferably, you also want to be able to seamlessly upgrade one part of the software while keeping other parts constant. This suggests that you must be prepared to replace a package at some point in time, without needing to go back and change other parts of the software.
Complex functionality such as feature interactionThis requirement ties in with the shared-nothing approach from above. In certain systems, like telecom and computer game servers, the different features of the system will interact in intricate ways. You can't use a database for storing this since the changes must be kept into main memory. Otherwise it is too slow. In other words, it is important that the language allows you to write elaborate and complex solutions to problems where different parts of the system interact in non-trivial ways.
This requirement is very far from the typical web server, where there is only a single interaction chain. A client will talk to a database. Most of the other things happen to be mere glue facilitating this main requirement.
Continuous operation for many yearsTelecom systems are expected to have long lifetimes. The systems are expected to run for many years without being stopped for maintenance. Hence you need to handle a continuous operation of the system. If a fault occurs, you must be able to inspect the fault while the system is running. You can't stop it and have a look at the stopped system. Furthermore, the concurrency constraints means that you can't really halt the system, since other parts of the system will continue to operate normally.
It also means that there has to be an upgrade path going forward. When Erlang was designed, it was not clear what kind of system architecture there would be in the future. There were MIPS, Digital Alpha, x86, HP-PA RISC, Sun SPARC, PowerPC and so on. And there were as many different software platforms: OS/2, Windows, UNIX in different incantations, WxWorks, QNX, NeXt and so on. This may have been the deciding factor in making Erlang into a virtual machine where ease of portability is more important than execution speed or hardware utilization.
Software maintenance (reconfiguration, etc.) without stopping the systemThis is a requirement in internet networking equipment as well as in telecom systems. You can't stop a router when you decide to reconfigure it. Also, it means that configuration is not always a static thing you can keep in a configuration file. Some of the configuration may be dynamic in nature and be configured as you go along. Probably, this decision was what led to the incorporation of the mnesia database into Erlang.
It also means that you need to introspect and upgrade the software while it is running. You can't stop operation just to get the system up again. Luckily, on the internet, we often can get away with some kind of service interruption, if done correctly. In a shared nothing architecture, we can often roll servers one at a time and thus upgrade service without anyone noticing. We can do database upgrades by rewriting client code so it can operate on multiple different schemas at a time and then we can go upgrade the scheme. In schemaless databases, we can even upgrade the database schema lazily in a read-repair fashion as we are reading old records.
Games like Guild Wars 2 employ rolling upgrades by running two versions of the software on the same machine. See for instance the Green/Blue archtecture idea by Martin Fowler, et al. The idea is that when they upgrade the game, they begin adding new players to the new version while keeping the old version running until the last player leaves the server. Of course they can hint the player to reconnect when the population becomes low. It does mean, however, that the player can decide when they want the reconnect. If they are in the middle of something important in the game, they can wait a bit.
But there are important things to be thinking about here. How do you upgrade the state of the player from the old version to the new one, and so on.
Stringent quality and reliability requirementsThere are certain decisions in Erlang which supports these requirements. First, the language decided to use garbage collection which eliminates many bugs pertaining to memory management right away. Note that the way the garbage collection is handled in Erlang means that usually GC times are extremely short-lived and thus never a problem for latency.
Second, the language is very functional. Only a few parts operate in an imperative way, amongst those the messaging primitives and the ETS tables. The effect is the elimination of a lot of state-bugs in the code. These are often problematic in many imperative languages.
Another decision is that integers are not bounded in size by default. There are no exceptional cases and there are no overflow/underflow bugs which can occur. A measurement was that quite many bugs in code bases are due to these errors. And the price to correct faults in large systems tend to be expensive due to the vast amounts of QA needed. By tolerating such bugs in the virtual machine you can eliminate the cost of fixing these bugs altogether.
The language prefers operating on functional structure in programs. This means your programs have few variables used for indexing into structure and you operate with maps and folds over large general structures. It also means your code flow avoids complex if-then-else-mazes but has a single generic flow in them which processes data.
Finally, programs are written in a certain style, OTP, which means that a lot of patterns are covered once and for all. As soon as you see an OTP-compliant system, you instinctively know how to absorb its inner workings. It helps quite a lot when you need to understand a system. OTP also enourages splitting up your systems into multiple process contexts. This means that each part is easier to understand. You only need to understand the part itself and the process contexts it communicates with. Often, this limits the complexity of the system, since you can get away with analyzing only a subset of the whole.
OTP also encourages you to think into system protocols. To an Erlang-programmer an API is often a protocol which describes how you must communicate with a subsystem. It is different from usual library APIs in the sense that it is not always just function calls. It may be asynchronous messages that flows back and forth. That is, the protocol may specify that you send certain messages and you will get certain, often different, messages delivered to your mailbox. The erlang terms are symbolic, so you have very good ways to describe the contents of a messages.
Fault tolerance both to hardware failures and software errorsNote the emphasis that you must be fault tolerant to hardware failure as well as software failure. In certain situations, the hardware breaks down partially, but can still operate on degraded service. If a link is faulty, or you cannot use a given telephony channel, then you may be able to route around the given problem.
In my opinion, this is one of the places where Erlang fares best. In a highly distributed system, you have to sacrifice some failure scenarios. The reason is that handling all of them is too complex and takes too long time. Some failure scenarios are even impossible to handle at all, and you are forced to aim differently.
A system can not be free of errors in hardware or software. The thing under your control is the error rate. Even in a highly consistent single-machine-system, that system may break down. It means that the error-rate can never be 0, like in the distributed case. Everything you did not account for is a fault and the system must be built to tolerate those. This is a fairly complex thing to handle, and Erlang is built with a toolbox allowing you to handle the nastier errors of the lot.
In practice, you are lucky on the internet. There is a noise floor for errors. Suppose your system fails 1 in a million requests. Now suppose that a user uses your service a million times. On average, the poor guy should have a service disruption. But what if his ISP has a rate of 10 in a million? This is the noise floor in effect. People will just retry the request and if you can then give service, you are relatively safe.