The redesign of the output code is a nice chance to document it. Here we go.

I had quite some pain getting reliable audio output without glitches, especially with ALSA (and Tru64/mme ... yeah...). In the end, I hacked some complicated thing that allows optional threaded double buffering.

It might be interesting to keep the switch between an additional writer thread for buffering and simpler I/O inside a worker thread. Let's see.

What it should be:
1. Code specific to a certain type of output collected in a simple output_file.
   This one contains only serial code for loading/play/unload (also seek? that might make sense for controlling some output transport... or writing to a specific portion of a file).
2. Management code collected in the outchannel class, just like the inchannel.

In the future, one should merge code that is obviously identical in output and input channels.


About the threads: We need one (optional?) thread for double (multi) buffering, one for work. Or not?

How does loading an output currently work? Yeah... no threading at all. I have notify stuff in place, but actually the loading happens in mixer time.
That is bad. We need some decoupling here. Although, live output device switching with low impact on the overall operation gets most interesting when dermixd can work with multiple output devices properly. That includes handling of clock drift between output devices.

Yes, for the purpose of supporting funky live mixing action including combining cheap (USB) audio interfaces for cue mixes, one should be prepared for audio clock drift. That means dermixd needs to monitor accumulating differences in the buffer fill of outputs and apply resampling to keep the time ... or, for performance's sake: just duplicate or cut out some samples on each buffer on slave devices -- there should be a way to mark one output as master, to prevent any messing with the quality on that one due to pseudo resampling.

But: Only outputs that are for realtime playback need such hackery. A file writer should be faster than any other output, naturally -- an empty buffer is no sign of any trouble there. Anyhow: The tricky part is diagnosing the skew... once I got a handle on that, the fix is not that hard.

How to measure that? There's the time the writer thread spends waiting for the next buffer, after finishing sending one to the output... aaagh, let's think that through later. First, get less embarrassing output code at all.

Threads. Worker thread... 

Shall one be able to pause an output? Well... let's just treat PAUSED state as un-loaded device... start loads it again ... So, the load action for outputs is rather cheap, it's the start that might take time.

I would like to be able to accurately script starting of an output ... that would mean instant activation... well, I guess it makes more sense to rather keep the output rolling and do a scripted bind instead of start.
Yes... output start/pause can take time. As quick-response option there's re-routing, which is cheap.
I could also include output gain... another variant to silence an output.

So... we got subtle change in semantics / behaviour. Between input and output. Re-think that. Do I want the complication with starting/pausing outputs at all?
What about an output simply being PLAYING after loading a file ... and IDLE when no file is loaded. Having an extra start action that really opens the resource wouldn't help scripting because this action would be of the feedback-type, just like input loading, and thus not be allowed in a script.

So... an output file has to do these things:

- load == open + activate
- unload == deactivate + release resources + forget everything
- play == play a buffer / add to queue

Appropriate states are thusly IDLE and PLAYING. Nothing else. If there is nothing active, there's nothing loaded.

This is a bit less flexible than now, but not really less powerful. Simplicity is power.

OK.. simplified it now. For outputs, there's only load and eject now. That, and binding of input channels, of course. Now I can think of getting the threading stuff right.

Currently, there's a simple output mode without an extra thread and the option of a double-buffered mode with a writer thread. I'll cut on the option of having no thread for the output... I need at least one thread for the loading action not blocking other operation. One imaginable application is to seamlessly switch to another output device (internal audio to USB adapter... firewire device ... whatever).
You want to load the new device in the background, without making the current output stutter. When the load is finished... the bindings are set to mirror the old output's mix (hint: there should be an outbind command that names all inputs to bind in one step) ... the old output is ejected ... the playback continuing on the new one (an external mixer would have plenty time to cross-fade without anyone noticing...

So, I motivated that one needs a worker thread. OK. Now, can this worker thread double in the function as double(multi)buffered writer? Would that be tricky?
It might be easier to make one thread for one task, but the worker thread would have to communicate with the writer anyway -- any action should stop when beginning to mess with the output setup.
So, the threads would be exclusive in any case... so one can think about making it one thread.

The simple worker thread would spend most of its time blocking on a work queue... acting when some action is pushed through. The writer thread would block on the availability of new buffers to write and on the action of actually writing the buffer.
The availability of buffers can be communicated via the work queue... an action can be created for every new buffer. There'd be a fixed number of buffers, the entry routine that hands the mix over to the worker thread would need to block until some new buffer is free to use for the next mix. That is implemented already anyway.

One point I realize now, in the context of my experience with abysmal OpenMP scaling of my fluddynamics model on Solaris, is the danger of having heap allocations during the course of normal computation.
They should be avoided in regular work. That counts for the input side with the play action and the output side as well: The creation of a new action data structure might introduce an additional mutex... serializing my parallel application in points where I do not expect it.
But then, creating a new string triggers that already. So, optimizing for that would be pointless with client-triggered actions, that even includes parsing of strings... creation of several data structures.
But the internal actions triggered on every mix loop (play on both sides) should do their best to avoid explicit heap allocations.

That can be achieved by having one (or more?) action instance in each channel that is used in communicating with the worker. A new action can only be sent when the old one is done with, so that the data structure can be re-used. The simplest case, that I should start with, really only employs one instance and a semaphore that is posted when the worker thread finished working on it ... the dispatch method waits on that semaphore before issuing a new action.
It is a trivial step to go from that setup to a batch of action instances to have the possibility of issuing an action while another one is still being worked on. It's not totally sure if that brings any benefit... There is rarely more than one action at once.

I shall implement that for the inputs now... then come back to reworking the outputs.
Wait. Having only one action instance turns the work queue into a joke. And: The work queue also means memory allocation when pushing an entry... unless the optimization of the std::vector kicks in to keep some excess memory allocated. Hrmpf. Am I hitting a wall here that wants to tell me that I should separate the playback requests from the mixer from the client requests? At least the latter justify the action queue: Multiple clients can ask something from one channel... they'll need to wait for each other's work being done in any case, but having the mixer thread blocking in the attempt to re-use the single action instance is a very bad idea.
Though, I do already block elaborate actions from being triggered from multiple clients at once. You cannot issue a scan while another scan is running... the channel is in an exclusive state. It would be a PITA to open up this restriction because the mixer thread needs to know when scan operations end... when multiple operations are issued, there needs to be a counter that tells the mixer to change the state of the channel back to where it can resume normal operation, while chaining up normal operations like start/pause doesn't seem beneficial... I rather have a quick response that the channel is busy right now and I should try later than having some inexplicable delay until the action is executed.
Well, chaining up heavy operations could be done... but how much gain is there compared to "busy now, try again later"? But what about light operations that can be stacked up without trouble? Like equalizer changes... (inserting filters .. that might block playback... ?)

Hm. Think hard. Do I want to keep the chain? Then  I should start with a number of action instances to use, not just one. A configurable number, perhaps. One semaphore is exactly what one needs for signalliing availablity of a number of some kind of resource. So, there's an action vector in each channel class... heck, that vector can be right inside the work queue! The push operation can wrap the waiting for a free slot.
This even simplifies code... possibly. Just need to properly handle the handing over of arguments from an input action.

OK. Took some advice... tried to look at actual impact of constructing the action anew, always... and, at least on the dual core linux box there is no distinct measurable impact. I have doubts if there will be on a 16-core solaris box (that can be tested during lunch break...). Main reason is this: What extra impact may a synchronization point in malloc have when there's already the obvious synchronization point of the action queue itself? I mean, it's whole purpose is thread messaging with synchronization.
And then, the most prominent use of malloc appears to be the pushing of a single integer (the sample count) to the play action... not the action itself. So, it is questionable if there's any sense in optimizing that single action allocation, be it periodic or not.
Let's have a look at a box with more cores, but I think this issue can be shoved aside.

Backpedal: Don't confuse the synchronisation between mixer and worker thread with a global lock for memory management. It could really be an issue with mixing 16 channels on a many-core machine.

I tested the input_channel_benchmark on a 16-core Solaris box, and, while it does not scale up to using the 16 cores, the synchronization time really seems firmly grounded in the actual waiting on the semaphores, not implicit locks from malloc(). So, I shall forget that issue for now. Just alloc actions and push them.

Back to the basics: How should the work of an output channel be structured?
Is one thread enough for playback and actions? Let's try.
As playback and actions (like load/eject) are rather exclusive, this should not hurt?
The double-buffering needs to work. The worker thread writes one buffer while the other buffer is prepared by the mixer. When the mixer finishes the next buffer, it simply puts a playback action on the chain. If there's some other action to push, it's time to interrupt playback anyway.

Eh... now back to the question of playback. How to organize the double buffering?
I need to shove one buffer in... shove the next buffer in ... and then wait for the first buffer being available again...
Hm, that sounds like a buffer pool. There's one list of available buffers for filling, one list of full buffers for writing. The mixer writes to one from the free buffer list, pushes to the full buffer list... waits until another free one is available...
