Thursday, December 6, 2012

Recording VoIP calls using pulseaudio and avconv

For ages, I've wanted an option in Skype or Empathy to record my video and voice calls1. Text is logged constantly because it doesn't cost much in the form of resources, but voice and video are harder.

In lieu of integrated support inside Empathy, and also because I mostly use Skype (for various reasons), the workaround I have is to do an X11 screen grab and encode it to a file. This is not hard at all. A cursory glance at the man page of avconv will tell you how to do it:

avconv -s:v [screen-size] -f x11grab -i "$DISPLAY" output_file.mkv

[screen-size] is in the form of 1366x768 (Width x Height), etc, and you can extend this to record audio by passing the -f pulse -i default flags to avconv2but that's not quite right, is it? Those flags will only record your own voice! You want to record both your own voice and the voices of the people you're talking to. As far as I know, avconv cannot record from multiple audio sources, and hence we must use Pulseaudio to combine all the voices into a single audio source!

As a side note, I really love Pulseaudio for the very flexible way in which you can manipulate audio streams. I'm baffled by the prevailing sense of dislike that people have towards it! The level of script-level control you get with Pulseaudio is unparallelled compared to any other general-purpose audio server3. One would expect geeks to like such a tool—especially since all the old bugs with it are now fixed.

So, the aim is to take my voice coming in through the microphone, and the voices of everyone else coming out of my speakers, and mix them into one audio stream which can be passed to avconv, and encoded into the video file. In technical terms, the voice coming in from the microphone is exposed as an audio source, and the audio for the speakers is going to an audio sink. Pulseaudio allows applications to listen to the audio going into a sink through a monitor source. So in effect, every sink also has a source attached to it. This will be very useful in just a minute.

The work now boils down to combining two sources together into one single source for avconv. Now, apparently, there's a Pulseaudio module to combine sinks but there isn't any in-built module to combine sources. So we route both the sources to a module-null-sink, and then monitor it! That's it.

pactl load-module module-null-sink sink_name=combined
pactl load-module module-loopback sink=combined source=[voip-source-id]
pactl load-module module-loopback sink=combined source=[mic-source-id]
avconv -s:v [screen-size" -f x11grab -i "$DISPLAY" -f pulse -i combined.monitor output_file.mkv

Here's a script that does this and more (it also does auto setup and cleanup). Run it, and it should Just Work™.


1. It goes without saying that doing so is a breach of the general expectation of privacy, and must be done with the consent of all parties involved. In some countries, not getting consent may even be illegal.
2. If you don't use Pulseaudio, see the man page of avconv for other options, and stop reading now. The cool stuff requires Pulseaudio. :)
3. I don't count JACK as a general-purpose audio system. It's specialized for a unique pro-audio use case.