[To be published in Computer Music Journal 18:4 (Winter 94)] Examples of ZIPI Applications Matthew Wright Center for New Music and Audio Technologies (CNMAT) Department of Music, University of California, Berkeley 1750 Arch St. Berkeley, California 94720 USA Matt@CNMAT.Berkeley.edu In this article, we will list some sample applications of ZIPI: alternate controllers, transfers of large amounts of data, and some common applications from the MIDI world. We show how they would be implemented in ZIPI, and, where appropriate, evaluate ZIPI's performance in each case. Basic knowledge of ZIPI is assumed. A ZIPI Guitar ============= Zeta Music will soon release a combination ZIPI controller and synthesizer. The controller will be a general-purpose sound-to-ZIPI converter, taking arbitrary sound sources as its input and producing ZIPI control data. The synthesizer portion is the opposite, taking incoming ZIPI control data and synthesizing sounds. The controller will work with any instrument, but for now assume it is connected to an guitar with a hexaphonic pickup. ("Hexaphonic" means one output channel for each of the six guitar strings, instead of the usual case in which the sound of all six guitar strings comes out of one output.) The controller will track pitch, loudness, even/odd balance, pitched/unpitched balance, and brightness information for each of the six strings of a guitar in real time, updating each parameter every 8-10 msec. It will also produce articulation information, noting when trigger and release events occur on each string. The built-in synthesizer will use sample playback and will be able to vary the pitch, loudness, even/odd balance, pitched/unpitched balance, and brightness of a sound. Thus, the sounds produced by the synthesis will be able to follow all these nuances of the acoustic sound, not just its pitch and volume. The idea is to give an instrumentalist---a guitar player---more expressive control over the sounds produced by a synthesizer. When the guitarist picks closer to the bridge, a synthesizer producing an organ sound would make it brighter. When the guitarist plays a 12th-fret harmonic, a trumpet sound would go up an octave and change to a quieter, more delicate timbre. When the guitarist partially mutes the strings with his or her palm, a saxophone sound would have a more breathy, noisy character. When the guitarist completely mutes the strings with his or her left hand and scratches rhythmically, a piano patch would produce just the sound of hammers hitting strings, with no pitched content. This instrument will use ZIPI to control an external synthesizer with the same high-bandwidth continuous control information that it uses internally. As the default configuration, the controller will send the information from each guitar string to a separate MPDL note address. It would be useful for these six MPDL notes to be in the same instrument, to make it possible to control all of the guitar's synthesized sound with a single message, e.g., pan or amplitude. How much bandwidth does this require? Every 10 msec the controller will send a ZIPI packet that looks like this: Address of note 1; pitch, loudness, brightness, even/odd balance, noise balance; new address: note 2; pitch, loudness, brightness, even/odd balance, noise balance; and so on for the other four notes. Pitch and loudness are 3-byte messages (including the note descriptor ID); the other three are 2-byte messages. So we have 12 bytes (3 + 3 + 2 + 2 + 2) of data per string, and 72 data bytes (6 * 12) altogether. The note address at the beginning of the frame is three bytes; the other five note addresses must be specified with 5-byte "new address" messages, for a total of 28 bytes (3 + 5*5) of addresses. Including the seven overhead bytes for each ZIPI packet, that's a total of 107 bytes per update. 107 eight-bit bytes every 10 msec requires a bandwidth of 85.6 kBaud---just over 1/3 of ZIPI's bandwidth at the slowest speed, and over 2.5 times the bandwidth of MIDI. This does not include articulation messages for triggering and releasing notes, but these would be much more than 10 msec apart on average, and would not take up any appreciable amount of bandwidth. What happens to the bandwidth if we add in many other ZIPI synthesizers to layer the sound? Nothing. ZIPI synthesis modules will typically only listen to the network and not send their own messages. So there could be one or ten synthesizers connected to this controller, with the same network performance in either case. A ZIPI Keyboard =============== ZIPI keyboards, like MIDI keyboards, would send messages whenever a key is pressed or released. Probably, a ZIPI keyboard would pre-allocate as many ZIPI notes as it has keys, sending each of them a pitch message only once as part of a setup routine. (For a keyboard, a note's address truly is the same as its pitch, so MIDI's model, in which the address is the same as pitch, applies.) A "note-on" packet would have a loudness note descriptor computed from the key's velocity, followed by an articulation note descriptor ("trigger"). A "note-off" packet would just consist of a single articulation note descriptor---"release." Seven overhead bytes, plus a 3-byte address, three bytes for the loudness note descriptor, and two bytes for articulation is 15 bytes. At 250 kBaud, this takes 480 msec to transmit. Like all ZIPI controllers, a ZIPI keyboard should be able to send raw controller measurements instead of mapping those measurements onto parameters such as loudness and pitch. In this mode, key number would replace pitch, and key velocity would replace loudness. It is possible that a ZIPI keyboard controller would want to implement its own note-stealing algorithm rather than to rely on that of the synthesizer being controlled. In that case, it could allocate as many notes as the receiving synthesizer has voices of polyphony. In each packet that contains a trigger message, there would also be a pitch message. Now, the keyboard knows that any note it asks the synthesizer to play will be played, and if a note needs to be turned off, the keyboard can choose which one. The keyboard controller could even control multiple synthesizers of the same type, allocating notes among them according to the polyphony capabilities. On a ZIPI keyboard, alternate tunings would be a feature of the keyboard controller, not the synthesizer. Remember that pitch in ZIPI is a 16-bit quantity. The keyboard already has a mapping between key numbers and pitches; for example, it knows that the middle C key has the ZIPI pitch with the hexadecimal value 7900. If the musician wanted alternate tunings, the keyboard could just use a different mapping, perhaps associating the middle C key with ZIPI pitch hexadecimal 78E2 or hex 792A. How to do Multis ================ Most MIDI timbre modules are "multi-timbral," meaning that the same synthesizer can produce pitches with different timbres on different MIDI channels. Since MIDI has only 16 channels, it is important to choose which 16 timbres to use at a time. Therefore, many synthesizers have the concept of a "multi," which is a collection of up to 16 timbres. Sometimes it is possible to select a whole multi all at once, which is the equivalent of sending program change messages on all 16 MIDI channels. What is the equivalent mechanism in ZIPI? In ZIPI, there are 8001 instruments in 63 families, so it is easy to set up an instrument with every timbre you are ever going to need, and then choose timbre just by selecting a particular instrument to trigger. However, if you like the idea of "swapping in" a set of instruments, i.e., changing 16 timbres at once, it would still be easy via ZIPI. The ZIPI controller would have a data structure similar to a MIDI multi, but of any size, and possibly spanning multiple synthesizers. At the push of a button, the controller could send program change messages to all the appropriate instruments of all the appropriate synthesizers. Y-splitters and Mergers ======================= In MIDI, two common tools are a Y-splitter and a merger. The Y-splitter has one MIDI input and multiple MIDI outputs, all of which are copies of the input signal. (Thus, it looks like the letter "Y.") It is useful to send the control information from one computer or musical instrument to multiple synthesizers, e.g., to make them play in unison. The merger performs the opposite function, taking some number of MIDI inputs and combining the MIDI messages logically into one single stream of MIDI data at the output. It is useful for controlling the same synthesizer or computer with two different MIDI instruments. Neither of these is required in ZIPI. Any collection of ZIPI devices can be on the same network, and any of them can send a message to any other. If two devices want to send messages to the same synthesizer, they both send data to the appropriate network device address. Likewise, if a device wants to send a message to two synthesizers, nothing needs to change in the ring configuration. The sending device can just send two messages addressed to the two devices, and both will reach their destinations. If the message is intended for all devices in the ring it can be broadcast, meaning that every device sees the same message. A ZIPI Vocoder ============== A vocoder is a musical instrument that applies the frequency spectrum of one sound source to a second sound source. The first input signal is analyzed for its frequency content, with the result used as the parameters of a filter applied to the second input signal. See, for example, (Moore 1990) or (Buchla 1974) for an explanation of this technique. A vocoder might consist of the components shown in Figure 1. The first input signal passes through the analysis filter bank and envelope followers, which produce a set of continuously varying control signals. These signals represent the amplitude of the control signal in each frequency region. The control signals are then used to control the gains of the components of a filter bank. [Figure 1 would go here if this weren't the ASCII version] One interesting variant on vocoding is applying some transformation to the control signals between the time when they are produced by the analysis portion and when they are used to control the output filter bank (Buchla 1974). For example, one might want to accentuate the effect of the first input signal, e.g., by squaring the control signals. Another interesting transformation would be to have the overall amplitude of the first input signal control the brightness imposed by the output filter bank, by selectively increasing the control signals for the higher-frequency filters. Yet another possibility would be to frequency-shift the control signals, so that each control signal would control the gain of the next higher filter. These kinds of transformations are easy if the two halves of the vocoder communicate via ZIPI. Suppose we have a vocoder with 20 filter segments, spaced in some reasonable way across the frequency spectrum. We would then use ZIPI to send the 20 control signals. For the usual vocoder behavior, we would simply connect the control ZIPI stream directly to the synthesis filter bank. For more elaborate behaviors, we would apply some transformation to the ZIPI signal, presumably with a computer program. Exactly what would we send over ZIPI? In the best case, the analysis filter bank would produce logarithmic amplitudes and the synthesis filter banks would expect a logarithmic control signal, so we could get by with 8-bit data. Since all 20 of these numbers would be produced at once, there is no need to set them individually, so we would probably send them as a single 20-data-byte note descriptor. We would need to pick an undefined n-byte note descriptor, e.g., hexadecimal C9 for this. We might sample the envelope followers in the analysis bank every 5 msec, so we would update this note descriptor every 5 msec. The update packet would have seven bytes of overhead, a 3-byte address, a 1-byte note descriptor ID, another two bytes to indicate 20 data bytes, and the data bytes themselves, for a total of 33 bytes. So at 250 kBaud, it would take just over a millisecond to send this packet, thus using about 1/5 of ZIPI's bandwidth. If the filter banks used linear control signals instead of logarithmic ones, eight bits would not be enough resolution, so we would switch to 16-bit control signals. Thus, our packets would be 53 data bytes, taking 1.7 msec to send. In the worst case, the two filter banks would use different units, requiring that there be some mapping function, e.g., linear-to-logarithmic conversion. Sample Dumps ============ ZIPI will have a separate application layer for audio samples, with a fixed header format for specifying the sampling rate, number of channels, etc. This header is likely to be the same as an existing sound file standard (van Rossum 1993), but might be specially designed for ZIPI. In most cases, sample dumps are a type of packet that would be sent with "guaranteed delivery," meaning that the receiving device has to acknowledge successful receipt of the packet. The longest legal ZIPI packet has 4096 data bytes, plus seven overhead bytes, so samples will have to be broken up into multiple ZIPI packets. After getting one of these packets, the receiving device would have to send back an acknowledgment of around 8 bytes. Therefore, it takes 4111 bytes over the network to reliably transmit 4096 data bytes---99.64 percent efficiency. ZIPI's lower network levels do a hardware CRC checksum, so if there is any corruption of the data the receiving device will know it. (Of course, if there is data corruption it will be necessary to re-transmit those data bytes, slowing down the process. But that is better than a garbled sample file!) Assuming 1-bit monophonic PCM data, 4096 bytes is 2048 samples. At a 44.1 kHz sampling rate, that would be about 46 msec of sound. In a ZIPI network running at 1 MBaud, it would take about 33 msec to transmit 4111 bytes, so it would be about 40 percent faster than real time to transmit CD-quality samples. Of course, a ZIPI network running at 20 MBaud would be 20 times faster, transmitting those 4111 bytes in 1.64 msec---28 times faster than real-time. Real-time Digital Audio ======================= Real-time digital audio has something in common with sample dumps---PCM audio data is sent at high speeds over ZIPI---but there are also other issues that must be resolved. First of all, obviously, the network must be able to send the data at faster than real-time! Variations in network latency also add complication to real-time digital audio. To solve this, the sending device should time-tag all outgoing audio messages. The receiving device can then buffer data before playing it, imposing a small but fixed delay on the audio stream. Anderson and Kuivila (1986) give an explanation of this process. For example, consider the case of transmitting CD-quality monophonic digital audio over a 1 MBaud ZIPI network. As mentioned above, there's more than enough bandwidth, by 40 percent, for this. What are the expected and worst-case latencies? A maximum-length ZIPI packet and its acknowledgment require 33 msec to transmit, so that is the expected (and best-case) latency, but if a packet is lost, it will require another 33 msec to re-transmit. To be conservative, let's say that the worst-case latency is 200 msec. The receiving device would allocate a buffer of 8820 samples, i.e., 200 msec. When it receives a packet of samples, it copies them into the buffer at the appropriate location, based on the packet's time stamp. It would read data out of the buffer 200 msec after the device sends it, so the buffer will usually be close to full as long as no packets are lost and there is no unexpected network activity. If a packet is lost, the buffer will continue to be emptied by the playing device, but will temporarily stop being filled by the sending device. However, we expect that the sending device will get some data through before the 200 msec buffer empties entirely. Since the network can transmit at 40 percent faster than real-time, the sending synthesizer can use that 40 percent to catch up, eventually nearly filling the buffer again. Note that the maximum packet size of 4096 bytes is not the only possible packet size for digital audio. Smaller packet sizes would require more bandwidth (since the amount of overhead per packet is constant), but would give less latency. 256-byte packets would reduce efficiency to 94.5 percent, but would reduce the expected latency to 2.168 msec. In an extreme case, 8-data-byte packets would reduce efficiency to under 35 percent, but would give 0.184 msec latency. References ========== Anderson, D., and R. Kuivila. 1986. "Accurately Timed Generation of Discrete Musical Events." *Computer Music Journal* 10(3): 48-56. Buchla, D. 1974. *Model 296 Spectral Processor Manual.* Berkeley, California: Buchla and Associates. Moore, F. R. 1990. *Elements of Computer Music*. Englewood Cliffs, New Jersey: Prentice Hall. van Rossum, Guido. 1993. *FAQ: Audio File Formats (version 2.10)*. From Internet server mitpress.mit.edu, file /pub/Computer-Music-Journal/Documents/SoundFiles/AudioFormats2.10.t.