Ok, I know that this is, uh, LATE. Most of you will have been waiting for the continuation of this article series for more than a year, and most of those most of you may already have given up or coded an own synthesizer without my help or with my help in one of the thousands of e-mail messages I got and tried to reply.
For those who are still waiting, let's begin.... no, NOT coding a synth. These are not the NeHe tutorials, these are tutorials for people knowing how to code and people knowing what a synthesizer might be. Sorry. Let's instead begin with one of the most feared and hated things when coding: THINKING about what we're going to do next.
Oh, and if you were really waiting for the last year, just reread the previous articles, there are some references to them below :)
4.1 - The synth interface
So, first, we should clarify the interface between the audio/MIDI system and the synth itself. I decided to have the following base function set:
As I didn’t do any dynamic memory allocations, a deinit/close function or a destructor weren’t necessary.)
All controlling of the synthesizer functions is done via MIDI commands. This can go as far as sending the patch bank over via a SysEx stream, but I preferred giving a pointer to the patch bank memory in the synthInit() function. Remind that if you do this, the synth should react to changes in the patch memory instantly, so your sound editing software can simply link a slider value to a memory address and the synth will follow if you twist your knobs while editing sounds. Or do it the „hard way“ and send all patch changes via MIDI controllers or SysEx commands - it’s up to you :)
The synchronisation of MIDI commands and sample output is done by my sound system’s render() loop in the following way:
While still samples to render
The synth editing app, though, just had a loop that rendered 256 samples and then sent whatever arrived at my MIDI in port in the mean time to the synth. This led to 6ms latency jittering and is generally considered „inprecise as shit“, but as said, I only had two days for the whole GUI app, and it was enough for me. As soon as I find out a better way for real-time processing, I’ll let you know.
Update (..."I'll let you know"): VST2 instrument plugins rock ;)
4.2 - The synth’s structure
So, what happens now if synthRender() is called? Well, let’s have a look at the following ‘1337 45C11 4R7 picture:
/-------\
| |
| | /-------\ /----\ /--------\ /----\
| |add |Channel| |Chan| add |Main mix| |Post| /---\
| VOICE |===>|Buffer |=>| FX |==========>| buffer |==>|FX |==>|OUT|
| POOL | \-------/ \----/ || \--------/ \----/ \---/
| | \_________________/ || /\
| | 16 Channels \/ ||add
| | /-------\ /-------\
\-------/ | AUX | | AUX |
32 vces |buffers|=>|effects|
\-------/ \-------/
\_____________________/
2x
First, we have the voice pool (a detailed description will come later). For each played note, one physical voice is allocated from the pool and assigned to the MIDI channel the note is played on. Now, we accumulate all voices for each channel in a buffer which we call „channel buffer“, and apply the per-channel effects (such as distortion or chorus) to it. This channel buffer is then added to the main mix buffer and an arbitrary number of AUX buffers (just like the AUX sends/returns of a mixing console), which then are processed with other effects (such as delay and reverb) and mixed to the main mix buffer again. And after we’ve processed this main mix buffer with optional post processing effects (such as compressor or EQs), we have our final signal which we can give back to the sound system.
I’ve just mentioned the word mixing. This may be the wrong place, but I will most likely repeat the following thoroughly capitalized sentence anyway:
WHEN MIXING THINGS, DON’T (!!!) DECREASE THE VOLUME OF THE VOICES ACCORDING TO THEIR NUMBER, IT HAS NO SENSE AT ALL AND WILL SEVERELY F**K UP YOUR AUDIO SIGNAL!
This means, if you’ve got three voices, never ever try silly things like dividing the resulting signal by three, or advanced versions like „I heard that the volume increases logarithmically, so I’ll do something like dividing by ln(voices).“ or similar sh*t. "But", you’ll say, "if you mix two voices at full volume, your resulting signal will be too loud" - this is of course true, but avoiding signal clipping should be completely in the hand of the person doing the mix (read: the musician), not of a program changing volumes around in a completely unreasonable way. So, DON’T make all voices full volume, but a reasonable volume.
Imagine a string quartet. The violin player starts to play his theme, and after a few bars, the cello player also starts. Does the violin get more silent because there’s a cello sound now? Or does the violin get louder as soon as the other instrumentalists leave the stage? Think.
Imagine a mixing desk with only one channel turned on. If you turn a second channel up, does the first one get softer or does it stay at its current volume?
If you’re afraid of clipping, and you know you have several voices, keep the master volume down to a reasonable fixed maximum which won’t produce clipping too early. And then, leave the rest to the musician. But DON’T decrease the gain of your channels as soon as they get more. No. No no no no no no no. Seriously. No.
Ok. Back to some more constructive things: a pseudo code version of the synthRender() function
(please remind that though in the ASCII sketch above there’s a "16 channels" below the channel buffer, there’s actually only one channel buffer which gets reused for all 16 channels)
The voices themselves are quite simple: There is one global "voice buffer" which first gets cleared and then filled with the sound of each voice and processed. It is then added to the channel buffer with additional volume adjustments (that means how loud the voice is)
4.3 - Modulation, MIDI processing, fragmentation et al.
A central question which arises when designing a real-time synthesizer is how to process the modulation sources (such as Envelope Generators or LFOs), or better: how often to process them. The most common solutions are:
As I really had to go for little CPU usage, I chose the per-frame approach with a frame size of 256 samples (yes, beat me for not following my own advice, next time I’ll do it better :)) and did linear interpolation for all volume related parameters to reduce most of the clicks.
So, all modulation stuff (read: everything that doesn’t involve generating and processing audio data) is processed at a low rate, about 182Hz in my case. And soon the next question pops up: if everything runs at this low rate, wouldn’t it make sense to align the MIDI data processing to this rate?
Answer: Yes and no. I chose not to do it, and in a certain way, I regret it. Aligning MIDI processing to the frame rate will dramatically increase timing jittering, and a granularity of 256 samples is too much in my opinion, but remember that regardless whatever you send to the synth, the result will most probably be heard no sooner than the beginning of the next frame anyway. My solution was to make note-ons and note-offs affect the envelope generators and the oscillator frequency instantly, so the voices COULD start the new notes in time, but sadly, the volume and all parameters only get updated with the next frame, so the sound will come later, but at least with some more correct timing that with aligning. On the other hand, as soon the synth runs out of voices (and playing voices have to be re-allocated), you may hear some artefacts resulting from half-set-up voices between the note-on and the next frame. Play around with it, my next try will most probably the frame length to 128 or better 120 samples (256 are too long for good drum sounds) and align the MIDI processing (3ms timing jitters aren’t too bad, and hey, being slightly inaccurate gives a synth this nice bitchy „analogue“ feel).
Update (a few months ater i've written this *g*): I've decreased the frame length to 120 samples and aligned the MIDI processing to the frame size - and it sounds nearly perfect. Only downside is that the voice on/off clicking became somewhat more audible, but you can compensate that with standard mixer click removal techniques (the OpenCP source code is a good example :).
4.4 - Voice allocation
On to the voice allocation. As said, whenever a note-on occurs, one voice from the voice pool will be assigned to a channel. And as soon as a voice has finished playing, it will be freed again. So far, so good. The problems arise as soon as all voices are busy playing and there’s a note-on event happening. Let’s assume that playing new notes is better than sustaining old ones. So - which voice do we throw away to make place for the new one?
Let’s first refine the term „no voice free“ - in the fr08 synth, I can specify the maximum polyphony for a channel, that is how many notes can be played simultaneously on one channel. This is great to keep the maximum polyphony down and reduce unnecessary CPU usage (be warned that voices will most likely continue to be allocated even if they’ve already faded out below anything audible), but complicates the case somewhat more. Now we have two different cases: Either the whole synth runs out of voices, or the maximum polyphony of the channel is reached. In the first case, the following algorithm applies to all voices, in the second, only to voices belonging to the channel we’re trying to play a new note on.
The scheme used by me was simply a check of the following rules in top-to-bottom order:
If you’re clever, you can generate a magic number from the volume, the channel and the "same note" check, and simply take the voice with the lowest/highest number. This also allows easy refining of the rules if you find out that your scheme is likely to interrupt the "wrong" voices.
Freeing the voices again is simple, at least with my chosen architecture: One of the envelope generators is hard-wired to the oscillator volume (more on this later), and every frame the voice allocation algorithm checks if this EG is in "finished" state (which came after the release state, when the output level was below -150dB). In this case the voice is simply freed and added to the pool again.
4.5 - Parameters, Controllers and the Modulation Matrix
Whatever you do, whatever capabilities your synthesizer will have, it will have a lot of sound parameters to play with. And you will also want to modify those parameters while playing, because twisting knobs and screwing around with the filter cutoff isn't only mandatory in most newer electronic music styles, but it's first of all helluva fun. And further than that, being able to tweak sound parameters makes the music sound much more alive and much less artificial (See also my short rant on "why trackers suck" in chapter one :).
Another thing we know from chapter one (and I assume that most of you have forgotten about that at this point) is that we wanted to save memory. My goal was to fit the synth and the music into 16K, so I had to think carefully about how to store all those parameters.
If you have worked with real (that is: hardware) synths, you might have noticed that most of them only use a 7bit resolution for most parameters. And for most purposes, it's perfectly sufficient. So, let‘s first state: We use a range from 0 to 127 (or less) for any parameter and store all parameters as bytes. Should be small enough.
If you worry about the resolution: 7bit for the parameters aren't too bad. We won't process them in this form, and even small controller changes (which would require more resolution) won't sound too bad. And if you're of the opinion that they do, simply do something like eg. the Virus series (made by the German company Access) does (and I can only recommend getting this kickass synth if you've got the money): Smooth between the values. I can't tell you any algorithm, as i haven't done this so far, but any nice envelope follower or adaptive low pass filter scheme will most probably do. Just if you've found out, please tell me. :)
Keeping all parameters in the range between 0 and 127 has another big advantage: You can already bind them to MIDI controllers or update them via SysEx commands. So if you want to twist your cutoff knob now, feel free to do so.
Still, that's quite clumsy. In most cases, you simply don't want to modify your cutoff frequency over the whole range. And you want to have envelope generators and LFOs and all that neato stuff, and want them to modify your sound parameters, too. This is where the modulation matrix comes into play.
To keep it short, the modulation matrix cares about which modulation source (such as the velocity, or a MIDI controller, or an LFO) affects what parameter, and how much it does so. The two most often used schemes are:
So I had to come up with something different. And as I wanted to do a software synth (the following solution is in my opinion very unpractical for hardware devices), i came up with this:
My modulation matrix is in fact a list of "source, target, amount" records. This means that I use only three bytes for each modulation (plus one byte telling the number of used modulations), and still have the full flexibility. The first byte is an index into a source table which looks like this:
0 - Note velocity
1 - Channel pressure
2 - Pitch bend
3 - Poly pressure
4 - Envelope Generator 1
5 - Envelope Generator 2
6 - LFO 1
7 - LFO 2
8..15 - reserved for future use
16..127 - MIDI controllers 0-111
Note that all modulation sources have been normalized to a float range of 0.0 .. 1.0 before, and that the EGs/LFOs run at full float resolution instead of 7bit. The one big exception is the Pitch Bend wheel, which covers a range from -1.0 to 1.0 , and has 14bit resolution instead of 7bit, according to the MIDI standard. The Poly Pressure can safely be neglected, as you'd need a poly aftertouch capable keyboard to make use of this, and I haven't seen such a thing with a price tag below "much too high" in my whole life.
You could also say that the note number is a modulation source. I didn't (yet :), but if you want to do it, simply add a preset modulation of e.g. "note number * 1.0 to voice transpose", and you've not only eliminated a few lines of code, but can also play around with ¼th tone scales and "keyboard tracking" for other parameters.
The second byte of a modulation definition is simply an index into my sound parameters in the order in which they're defined. To make sure that no parameter is affected which can't be modulated was the task of my GUI which simply doesn't allow to set non-modulatable parameters as destination. Easy as that.
The third byte is the amount, in the usual 0..127 range. Note that this is a signed value and that you've got to do a "realamount=2*(amount-64)" to get the real amount (-128.0 .. 126.0).
The modulation matrix now simply has to compile an array of all modulation sources, make another array with float representations of all parameters (to get rid of the 7bit resolution) and process the modulation list. For each modulation, value_of_source*amount is calculated and added to the target parameter (which is then clamped to the [0,128] range). And then, the parameters are sent to the synth modules to update their internal states or whatever. Once per frame.
4.6 - Make the parameters feel good
This may sound strange, but is really important for the quality of the music coming out of your synthesizer. What i'm talking about is the effective range and scale of your parameters. The easiest way would be linear scaling/translating of the 0..127 range to whatever range your signal processing algorithms need, and using this value.
But: This won't work.
Ok, in theory it will, and of course the code won't crash, and what the heck, maybe there's even music coming from your speakers, but don't think you'll get ANYTHING really usable out of your synth if you handle your parameters that way. Let me clear up a few things: Our complete feeling of pitch, volume and sound is highly nonlinear. In fact, almost everything we perceive, we perceive in an exponential scale. Have you e.g. ever noticed that most volume knobs of whatever seem to have a drastic effect in their lower areas, while eg. from half turned up to fully turned up, there's not too much change? Well, that's because we don't hear "signal amplitude doubled", but rather "volume turned up a certain bit" (which is about 6dB in this case). Try it - if you've got a volume knob on a cheap stereo, going from ¼ to ½ is about the same step as going from ½ to full volume. or from 1/8 to ¼ or whatever (assuming a linear potentiometer, of course).
Same with our perception of pitch. An average human is able to hear frequencies from 16Hz to something in the lines of 12 to 20 KHz, depending on his/her age. Let's make this 16Hz .. 16384Hz. This is factor 1024, or let's call it factor 2^10, or better: 10 octaves (which is 120 semitones; that fits our typical 0..127 range quite well, don't you think?). And again, doubling the pitch doesn't make us think "the pitch has doubled", but rather "the sound is one octave higher than before".
Even for time values, like envelope attack/release times etc, an exponential scale is often better. But this heavily depends.
In a nutshell, use an exponential function for EVERYTHING dealing with pitch or frequencies, be it the oscillator pitch, be it a filter cutoff or even the operating rate of an LFO. You will most definitely regret anything else. For volumes, it depends on what you want to achieve, and for times, use an exponential scale for everything that has to "feel" right, like envelope release times, and a linear scale for everything that has to be precise, like chorus/flanger delays or compressor lookahead times or whatever.
And if neither a linear nor an exponential scale fits your need, feel free to experiment with things like square or square root, or x^(constant), or cos²(x), or atan(x). The better it "feels", the more you're able to adjust the parameter to produce something that sounds good. And this is exactly what we want.
This also affects the way that you're calculating things. If you e.g. do everything with exponential values, you'll get a hell of a lot of pow() operations. This takes time, especially when you have short frame times or even calculate the modulations sample-wise. In this case, you might want to fake the pow() function by approximating it with linear interpolation or even strange tricks like handing values from the integer to the floating point domain without conversion, and so on. Just don't do that with frequencies, these have to be as precise as possible. But for everything else (volumes/times/etc), a deviation of <5% will remain almost unnoticed and you'll have saved lots of precious CPU time.
Just remind that "if it sounds good, it is good", and the easier it is for the musician to produce good sounds, the better the music will get. Just don't let the coder's laziness get in the way of the musician's creativity.
4.7 - Various things to consider
If you want to calculate the pitch from the note number, use this formula:
pitch_in_hz=440.0*pow(2.0,(note_number-45)/12.0);
An octave consists of 12 semitones. An octave is the doubling of the frequency, so if you multiply the frequency by 2^(1/12), you get one semitone higher. The "chamber note" A-4 is located at exactly 440Hz (change this value to implement something like a "master tune" function), and is MIDI note #45, considering a sound playing in the 8' octave, which is quite common. So, the above formula should be clear. "note_number" is of course a float value, and you can get "between" the semitones with it. This also means that this formula is the LAST thing to calculate. Every modification of the pitch, be it through modulations, pitch bending, transposing or whatever, must be applied to the note_number variable (by simply adding it) before. Thus, the pitch will always be correct. Hint: Do a range check of this value before using it. It can't go below zero because of the pow() function, but it shouldn't be allowed to go above the nyqvist frequency, which is samplerate/2 (so, 22050 in most cases).
Then, many amplitude values are given in dB (Decibels). Decibels are a relative unit, so you define a certain level as 0dB, and the actual dB value is a factor with an exponential scale.
Warning: the formula presented below isn't really exact, but about 0.1% off. But as most audio devices and software I ran across treat +6dB as "double", this fits IMO more than the "official" definition.
real_level=base_level*pow(2.0,decibels/6.0);
so, +6dB is twice the level, +12dB is four times the level, and -24dB is 1/8 of the level. Simple as that. And yes, you can't reach zero with it, but there isn't a zero anyway, just use "too little to be useful" for that.
Then, let me tell you something about panning.
Falsely quoting a commercial ad on German TV: "The history of proper panning is maybe one of the least understood chapters of mankind". I could start ranting now how panning envelopes have killed music, and how unreal panning slides are, and that I don't know any good "real world" song which plays around with panning more than setting it (apart from sound effects, of course)... anyway, different story.
But did you ever recognize that in common tracker music most sounds seem to get louder if the panning is more at the sides and softer if it's in the middle? Well, surprise, they ARE. And that's because from the very beginning, tracker coders forgot about one simple thing: What we percieve as loudness is the energy coming from the speakers to our ears, not the voltage.
Energy is W=P*t, so it's power multiplied by time. Let's forget about time and say, what we hear is the power coming from the speakers.
The power P coming from a stereo speaker system is P=Uleft*Ileft+Uright*Iright, the voltage U multiplied with the current , and both speakers added together. Speakers normally have a constant impedance or resistance (not completely, but let's assume that), and the current I is I=U/R (voltage divided by the resistance). So, if we insert I=U/R into our term for P and assume that both speakers have the same impedance, we get P=(Uleft²+Uright²)/R. Let's also forget about R (because it's constant), so we can say: The percieved volume is dependent on the sum of the squares of the signal levels.
Ok, most people (and sadly all tracker coders) simply do something like left=sample*(1.0-panning); right= sample*panning; ... which seems ok, because left+right is our original sample again. BUT: Let's try our power formula for 100% left panning and then for middle panning.
For 100% left panning we get:
left = sample
right= 0
power= left²+right² = sample²
And for middle panning we get:
left = sample*0.5
right= sample*0.5
power= left²+right² = 0.25*sample²+0.25*sample² = 0.5*sample²
... Which is only half of the power we got for 100% left panning, or to apply something we learnt above, if we revert the square again, exactly 3dB softer.
The solution is simple: do so-called EQP panning, or "EQual Power" panning. Just take into consideration that the power is about the square of the level, and do something like this:
left = sample*sqrt(1.0-panning);
right= sample*sqrt(panning);
So, for middle panning we get the correct total power and thevolume won't be dependant of the pan position anymore. The problem about this now is that it won't work with mono signals. As said above, with the "simple" way, the signal "left+right" (which is mono) isn't dependant of the panning, but it is with EQP panning. So, EQP sucks for stereo signals which are likely to be converted to mono. This either means that you should take a "mono mode" into consideration when designing the synth architecture (completely letting out any panning), or that you try to find a good middle way, eg. replacing the sqrt(x) by pow(x,0.75) (remind that the original solution is nothing more than pow(x,1.0) and EQP is pow(x, 0.5)).
EQP is also a good solution in other cases when you mix or split two signals with varying weights, first of all for cross-fades. If you ever tried the cross fade function of FastTracker 2's sample editor, you may have noticed that in most cases the volume drops to about half the original volume in the middle of the fade. With EQP fades instead of linear ones, this would have worked. And most professional software supports EQP fading for cross-fading audio files into each other among other things.
So, conforming to the last paragraph, let's rephrase all we've learnt in one three-words sentence:
Linear is bad.