First a disclaimer, I do not have a lot of experience with inter-process communication (I have a bit, though), and no direct experience with mmaped files.

Communication through mmaped files, or shared memory in general, is inherently asynchronous, so lack of multiple OS threads should not be a blocker if the protocol is well designed.

I expect you would need some sort of out-of-band signalling so polling is not necessary. Pipes or sockets could be used for that, slower for data transfer (more copying), but they are compatible with select() and other OS polling features.

You could build synchronisation primitives on top of sockets, too. But building things like mutexes would be inefficient. I guess you could use pipes for all the  message sending and memory management, and shared memory for bulk data transfer. Large number of messages could be handled in batch this way:

allocate a message area (socket round trip)
fill the message area
signal to process a block of messages (socket one way)
optimistically check for incoming messages in shared memory
wait for socket input

Not very efficient for messages that depend on one another. But I am not sure you could do better without things like interprocess semaphores or a FFI.

