Wednesday, July 26, 2023

What is WebRTC? Why does it need ‘Signalling’?

If you’re new to WebRTC, you must’ve heard that it’s a way to do video calls in your browser without needing to install an app. It’s pretty great!

However, it uses a bunch of really arcane terminology because it builds upon older technologies such as RTP, RTCP, SDP, ICE, STUN, etc. To understand what WebRTC Signalling is, you must first understand these foundational technologies.

Readers who are well-versed in this subject might find some of the explanations annoyingly simplistic to read. They will also notice that I am omitting a lot of of detail, leading to potentially misleading statements.

I apologize in advance to these people. I am merely trying to avoid turning this post into a book. If you find a sub-heading too simplistic, please feel free to skip it. :-)

RTP

Real-time Transport Protocol is a standardized way of taking video or audio data (media) and chopping it up into “packets” (you can literally think of them as packets / parcels) that are sent over the internet using UDP. The purpose is to try and deliver them to the destination as quickly as possible.

UDP (user datagram protocol) is a packet-based alternative to TCP (transmission control protocol), which is connection-based. So when you send something to a destination (IP address + port number), it will be delivered if possible but you have no protocol-level mechanism for finding out if it was received, unlike, say, TCP ACKs.

You can think of this as chucking parcels over a wall towards someone whom you can’t see or hear. A bunch of them will probably be lost, and you have no straightforward way to know how many were actually received.

UDP is used instead of TCP for a number of reasons, but the most important ones are:

  1. TCP is designed for perfect delivery of all data, so networks will often try too hard to do that and use ridiculous amounts of buffering (sometimes 30 seconds or more!), which leads to latencies that are too large for two people to be able to talk over a call.

  2. UDP doesn’t have that problem, but the trade-off is that it gives no guarantees of delivery at all!

    You’d be right to wonder why nothing new has been created to be a mid-way point between these two extremes. The reason is that new transport protocols don’t get any uptake because existing systems on the Internet (operating systems, routers, switches, etc) don’t (want to) support them. This is called Protocol ossification, and it's a big problem for the Internet.

    Due to this, new protocols are just built on top of UDP and try to add mechanisms to detect packet loss and such. One such mechanism is…

RTCP

RTP Control Protocol refers to standardized messages (closely related to RTP) that are sent by a media sender to all receivers, and also messages that are sent back by the receiver to the sender (feedback). As you might imagine, this message-passing system has been extended to do a lot of things, but the most important are:

  1. Receivers use this to send feedback to the sender about how many packets were actually received, what the latency was, etc.
  2. Senders send information about the stream to receivers using this, for instance to synchronize audio and video streams (also known as lipsync), to tell receivers that the stream has ended (a BYE message), etc.

Similar to RTP, these messages are also sent over UDP. You might ask “what if these are lost too”? Good question!

RTCP packets are sent at regular intervals, so you’d know if you missed one, and network routers and switches will prioritize RTCP packets over other data, so you’re unlikely to lose too many in a row unless there was a complete loss of connectivity.

Peer

WebRTC is often called a “peer-to-peer” (P2P) protocol. You might’ve heard that phrase in a different context: P2P file transfer, such as Bittorrent.

The word “peer” contrasts with “server-client” architectures, in which “client” computers can only talk to (or via) “server” computers, not directly to each other.

We can contrast server-client architecture with peer-to-peer using a real-world example:

  • If you send a letter to your friend using a postal service, that’s a server-client architecture.
  • If you leave the letter in your friend’s mailbox yourself, that’s peer-to-peer.

But what if you don’t know what kind of messages the recipient can receive or understand? For that we have…

SDP

Stands for Session Description Protocol which is a standardized message format to tell the other side the following:

  • Whether you want to send and/or receive, audio and/or video
  • How many streams of audio and/or video you want to send / receive
  • What formats you can send or receive, for audio and/or video

This is called an “offer”. Then the other peer uses the same message format to reply with the same information, which is called an “answer”.

This constitutes media “negotiation”, also called “SDP exchange”. One side sends an “offer” SDP, the other side replies with an “answer” SDP, and now both sides know what to do.

As you might expect, there’s a bunch of other technical details here, and you can know all about them at this excellent page that explains every little detail. It even explains the format for ICE messages! Which is…

ICE

Interactive Connectivity Establishment is a standardized mechanism for peers to tell each other how to transmit and receive UDP packets. The simplest way to think of it is that it’s just a list of IP address and port pairs.

Once both sides have successfully sent each other (“exchanged”) ICE messages, both sides know how to send RTP and RTCP packets to each other.

Why do we need IP address + port pairs to know how to send and receive packets? For that you need to understand…

How The Internet Works

If you’re connected to the internet, you always have an IP address. That’s usually something like 192.168.1.150 – a private address that is specific to your local (home) network and has no meaning outside of that. Having someone’s private IP address is basically like having just their house number but no other parts of their address, like the street or the city. Useful if you're living in the same building, but not otherwise.

Most personal devices (computer or phone or whatever) with access to the Internet don’t actually have a public IP address. Picking up the analogy from earlier, a public IP address is the internet equivalent of a full address with a house number, street address, pin code, country.

When you want to connect to (visit) a website, your device actually talks to an ISP (internet service provider) router, which will then talk to the web server on your behalf and ask it for the data (website in this case) that you requested. This process of packet-hopping is called “routing” of network packets.

This ISP router with a public address is called a NAT (Network Address Translator). Like the name suggests, its job is to translate the addresses embedded in packets sent to it from public to private and vice-versa.

Let’s say you want to send a UDP packet to www.google.com. Your browser will resolve that domain to an IP address, say 142.250.205.228. Next, it needs a port to send that packet to, and both sides have to pre-agree on that port. Let’s pick 16789 for now.

Your device will then allocate a port on your device from which to send this packet, let’s say 11111. So the packet header looks a bit like this:

From To
192.168.1.150:11111 142.250.205.228:16789

Your ISP’s NAT will intercept this packet, and it will replace your private address and port in the From field in the packet header to its own public address, say 169.13.42.111, and it will allocate a new sender port, say 22222:

From To
169.13.42.111:22222 142.250.205.228:16789

Due to this, the web server never sees your private address, and all it can see is the public address of the NAT.

When the server wants to reply, it can send data back to the From address, and it can use the same port that it received the packet on:

From To
142.250.205.228:16789 169.13.42.111:22222

The NAT remembers that this port 22222 was recently used for your From address, and it will do the reverse of what it did before:

From To
142.250.205.228:16789 192.168.1.150:11111

And that’s how packets are send and received by your phone, computer, tablet, whatever when talking to a server.

Since at least one side needs to have a public IP address for this to work, how can your phone send messages to your friend’s phone? Both only have private addresses.

Solution 1: Just Use A Server As A Relay

The simplest solution is to have a server in the middle that relays your messages. This is how all text messaging apps such as iMessage, WhatsApp, Instagram, Telegram, etc work.

You will need to buy a server with a public address, but that’s relatively cheap if you want to send small messages.

For sending RTP (video and audio) this is accomplished with a TURN (Traversal Using Relays around NAT) server.

Bandwidth can get expensive very quickly, so you don’t want to always use a TURN server. But this is a fool-proof method to transmit data, so it’s used a backup.

Solution 2: STUN The NAT Into Doing What You Want

STUN stands for “Simple Traversal of UDP through NATs”, and it works due to a fun trick we can do with most NATs.

Previously we saw how the NAT will remember the mapping between a “port on its public address” and “your device’s private address and port”. With many NATs, this actually works for any packet sent on that public port by anyone.

This means if a public server can be used to create such mappings on the NATs of both peers, then the two can send messages to each other from NAT-to-NAT without a relay server!

Let’s dig into this, and let’s substitute hard-to-follow IP addresses with simple names: AlicePhone, AliceNAT, BobPhone, BobNAT, and finally STUNServer:19302.

First, AlicePhone follows this sequence:

  1. AlicePhone sends a STUN packet intended for STUNServer:19302 using UDP

    From To
    AlicePhone:11111 STUNServer:19302
  2. AliceNAT will intercept this and convert it to:

    From To
    AliceNAT:22222   STUNServer:19302
  3. When STUNServer receives this packet, it will know that if someone wants to send a packet to AlicePhone:11111, they could use AliceNAT:22222 as the To address. This is an example of an ICE candidate.

  4. STUNServer will then send a packet back to AlicePhone with this information.

Next, BobPhone does the same sequence and discovers that if someone wants to send a packet to BobPhone:33333 they can use BobNAT:44444 as the To address. This is BobPhone’s ICE candidate.

Now, AlicePhone and BobPhone must exchange these ICE candidates.

How do they do this? They have no idea how to talk to each other yet.

The answer is… they Just Use A Server As A Relay! The server used for this purpose is called a Signalling Server.

Note that these called “candidates” because this mechanism won’t work if one of the two NATs changes the public port also based on the public To address, not just the private From address. This is called a Symmetric NAT, and in these (and other) cases, you have to fallback to TURN.

Signalling Server

Signalling is a technical term that simply means: “a way to pass small messages between peers”. In this case, it’s a way for peers to exchange SDP and ICE candidates.

Once these small messages have been exchanged, the peers know how to send data to each other over the internet without needing a relay.

Now open your mind: you could use literally any out of band-mechanism for this. You can use Amazon Kinesis Video Signalling Channels. You can use a custom websocket server or a ProtoBuf server.

Heck, Alice and Bob can copy/paste these messages into iMessage on both ends. In theory, you can even use carrier pigeons — it’ll just take a very long time to exchange messages 😉

That’s it, this is what Signalling means in a WebRTC context, and why it’s necessary for a successful connection!

What a Signalling Server gives you on top of this is state management: checking whether a peer is allowed to send messages to another peer, whether a peer is allowed to join a call, can be invited to a call, which peers are in a call right now, etc.

Based on your use-case, this part can be really easy to implement or really difficult and heavy in corner-cases. Most people can get away with a really simple protocol, just by adding authorization to this multi-party protocol I wrote for the GStreamer WebRTC multiparty send-receive examples. More complex setups require a more bespoke solution, where all peers aren’t equal.

Tuesday, September 29, 2020

Building GStreamer on Windows the Correct Way

For the past 4 years, Tim and I have spent thousands of hours on better Windows support for GStreamer. Starting in May 2016 when I first wrote about this and then with the first draft of the work before it was revised, updated, and upstreamed.

Since then, we've worked tirelessly to improve Windows support in GStreamer  with patches to many projects such as the Meson build system, GStreamer's Cerbero meta-build system, and writing build files for several non-GStreamer projects such as x264, openh264, ffmpeg, zlib, bzip2, libffi, glib, fontconfig, freetype, fribidi, harfbuzz, cairo, pango, gtk, libsrtp, opus, and many more that I've forgotten.

More recently, Seungha has also been working on new GStreamer elements for Windows such as d3d11, mediafoundation, wasapi2, etc. Sometimes we're able to find someone to sponsor all this work, but most of the time it's on our own dime.

Most of this has been happening in the background; noticed only by people who follow GStreamer development. I think more people should know about the work that's been happening upstream, and the official and supported ways to build GStreamer on Windows. Searching for this on Google can be a very confusing experience with the top results being outdated links or just plain clickbait.

So here's an overview of your options when you want to use GStreamer on Windows:

Installing GStreamer on Windows

 
GStreamer has released MinGW binary installers for Windows since the early 1.0 days using the Cerbero meta-build system which was created by Andoni for the non-upstream "GStreamer SDK" project, which was based on GStreamer 0.10.
 
Today it supports building GStreamer with both MinGW and Visual Studio, and even supports outputting UWP packages. So you can actually go and download all of those from the download page:


This is the easiest way to get started with GStreamer on Windows.
 

Building GStreamer yourself for Deployment

 
If you need to build GStreamer with a custom configuration for deployment, the easiest option is to use Cerbero, which is a meta-build system. It will download all the dependencies for you (including most of the build-tools), build them with Autotools, CMake, or Meson (as appropriate), and output a neat little MSI installer.
 
The README contains all the information you need, including screenshots for how to set things up:


As of a few days ago, after months of work the native Cerbero Windows builds have also been integrated into our Continuous Integration pipeline that runs on every merge request, which further improves the quality of our Windows support. We already had native Windows CI using gst-build, but this increases our coverage.

Contributing to GStreamer on Windows

 
If you want to contribute to GStreamer from Windows, the best option is to clone the gstreamer monorepo (derived from gst-build which was created by Thibault), which is basically a meson 'wrapper' project that has all the gstreamer repositories aggregated as subprojects. Once again, the README file is pretty easy to follow and has screenshots for how to set things up:


This is also the method used by all GStreamer developers to hack on gstreamer on all platforms, so it should work pretty well out of the box, and it's tested on the CI. If it doesn't work, come poke us on #gstreamer on OFTC IRC (or the same channel via Matrix) or on the gstreamer mailing list.
 

It's All Upstream.

 
You don't need any special steps, and you don't need to read complicated blog posts to build GStreamer on Windows. Everything is upstream.

This post previously contained examples of such articles and posts that are spreading misinformation, but I have removed those paragraphs after discussion with the people who were responsible for them, and to keep this post simple. All I can hope is that it doesn't happen again.

Monday, August 31, 2020

GStreamer 1.18 supports the Universal Windows Platform

tl;dr: The GStreamer 1.18 release ships with UWP support out of the box, with official GStreamer binary releases for it. Try out the 1.17.90 pre-release 1.18.0 release and let us know how it goes! There's also an example gstreamer app for UWP that showcases OpenGL support (via ANGLE), audio/video capture, hardware codecs, and WebRTC.

Short History Lesson

 
Last year at the GStreamer Conference in Lyon, I gave a talk (slides) about how “Firefox Reality” for the Microsoft HoloLens 2 mixed-reality headset is actually Servo, and it uses GStreamer for all media handling: WebAudio, HTML5 Video, and WebRTC.

I also spoke about the work we at Centricular did to port GStreamer to the HoloLens 2. The HoloLens 2 uses the new development target for Windows Store apps: the Universal Windows Platform. The majority of win32 APIs have been deprecated, and apps have to use the new Windows Runtime, which is a language-agnostic API written from the ground up.

So the majority of work went into making sure that Win32 code didn't use deprecated APIs (we used a bunch of them!), and making sure that we could build using the UWP toolchain. Most of that involved two components:
  • GLib, a cross-platform low-level library / abstraction layer used by GNOME (almost all our win32 code is in here)
  • Cerbero, the build aggregator used by GStreamer to build binaries for all platforms supported: Android, iOS, Linux, macOS, Windows (MSVC, MinGW, UWP)
The target was to port the core of GStreamer, and those plugins with external dependencies that were needed to do playback in <audio> and <video> tags. This meant that the only external plugin dependency we needed was FFmpeg, for the gst-libav plugin. All this went well, and Firefox Reality successfully shipped with that work.

Upstreaming and WebRTC

 
Building upon that work, for the past few months we've been working on adding support for the WebRTC plugin, and also upstreaming as much of the work as possible. This involved a bunch of pieces:
  1. Use only OpenSSL and not GnuTLS in Cerbero because OpenSSL supports targeting UWP. This also had the advantage of moving us from two SSL stacks to one.
  2. Port a bunch of external optional dependencies to Meson so that they could be built with Meson, which is the easiest way for a cross-platform project to support UWP. If your Meson project builds on Windows, it will build on UWP with minimal or no build changes.
  3. Rebase the GLib patches that I didn't find the time to upstream last year on top of 2.62, split into smaller pieces that will be easier to upstream, update for new Windows SDK changes, remove some of the hacks, and so on.
  4. Rework and rewrite the Cerbero patches I wrote last year that were in no shape to be upstreamed.
  5. Ensure that our OpenGL support continues to work using Servo's ANGLE UWP port
  6. Write a new plugin for audio capture called wasapi2, great work by Seungha Yang.
  7. Write a new plugin for video capture called mfvideosrc as part of the media foundation plugin which is new in GStreamer 1.18, also by Seungha.
  8. Write a new example UWP app to test all this work, also done by Seungha! 😄
  9. Run the app through the Windows App Certification Kit
And several miscellaneous tasks and bugfixes that we've lost count of.

Our highest priority this time around was making sure that everything can be upstreamed to GStreamer, and it was quite a success! Everything needed for WebRTC support on UWP has been merged, and you can use GStreamer in your UWP app by downloading the official GStreamer binaries starting with the 1.18 release.

On top of everything in the above list, thanks to Seungha, GStreamer on UWP now also supports:

Try it out!

 
The example gstreamer app I mentioned above showcases all this. Go check it out, and don't forget to read the README file!
 

Next Steps

 
The most important next step is to upstream as many of the GLib patches we worked on as possible, and then spend time porting a bunch of GLib APIs that we currently stub out when building for UWP.

Other than that, enabling gst-libav is also an interesting task since it will allow apps to use FFmpeg software codecs in their gstreamer UWP app. People should use the hardware accelerated d3d11 decoders and mediafoundation encoders for optimal power consumption and performance, but sometimes it's not possible because codec support is very device-dependent. 

Parting Thoughts

 
I'd like to thank Mozilla for sponsoring the bulk of this work. We at Centricular greatly value partners that understand the importance of working with upstream projects, and it has been excellent working with the Servo team members, particularly Josh Matthews, Alan Jeffrey, and Manish Goregaokar.

In the second week of August, Mozilla restructured and the Servo team was one of the teams that was dissolved. I wish them all the best in their future endeavors, and I can't wait to see what they work on next. They're all brilliant people.

Thanks to the forward-looking and community-focused approach of the Servo team, I am confident that the project will figure things out to forge its own way forward, and for the same reason, I expect that GStreamer's UWP support will continue to grow.

Sunday, April 21, 2019

GStreamer's Meson and Visual Studio Journey


Almost 3 years ago, I wrote about how we at Centricular had been working on an experimental port of GStreamer from Autotools to the Meson build system for faster builds on all platforms, and to allow building with Visual Studio on Windows.

At the time, the response was mixed, and for good reason—Meson was a very new build system, and it needed to work well on all the targets that GStreamer supports, which was all major operating systems. Meson did aim to support all of those, but a lot of work was required to bring platform support up to speed with the requirements of a non-trivial project like GStreamer.

The Status: Today!

After years of work across several components (Meson, Ninja, Cerbero, etc), GStreamer is being built with Meson on all platforms! Autotools is scheduled to be removed in the next release cycle (1.18). Edit: as of October 2019, Autotools has been removed.

The first stable release with this work was 1.16, which was released yesterday. It has already led to a number of new capabilities:
  • GStreamer can be built with Visual Studio on Windows inside Cerbero, which means we now ship official binaries for GStreamer built with the  MSVC toolchain.
  • From-scratch Cerbero builds are much faster on all platforms, which has aided the implementation of CI-gated merge requests on GitLab.
  • The developer workflow has been streamlined and is the same on all platforms (Linux, Windows, macOS) using the gst-build meta-project. The meta-project can also be used for cross-compilation (Android, iOS, Windows, Linux).
  • The Windows developer workflow no longer requires installing several packages by hand or setting up an MSYS environment. All you need is Git, Python 3, Visual Studio, and 15 min for the initial build.
  • Profiling on Windows is now possible, and I've personally used it to profile and fix numerous Windows-specific performance issues.
  • Visual Studio projects that use GStreamer now have debug symbols since we're no longer mixing MinGW and MSVC binaries. This also enables usable crash reports and symbol servers.
  • We can ship plugins that can only be built with MSVC on Windows, such as the Intel MSDK hardware codec plugin, Directshow plugins, and also easily enable new Windows 10 features in existing plugins such as WASAPI.
  • iOS bitcode builds are more correct, since Meson is smart enough to know how to disable incompatible compiler options on specific build targets.
  • The iOS framework now also ships shared libraries in addition to the static libraries.
Overall, it's been a huge success and we're really happy with how things have turned out!

You can download the prebuilt MSVC binaries, reproduce them yourself, or quickly bootstrap a GStreamer development environment. The choice is yours!

Further Musings

While working on this over the years, what's really stood out to me was how this sort of gargantuan task was made possible through the power of community-driven FOSS and community-focused consultancy.

Our build system migration quest has been long with valleys full of yaks with thick coats of fur, and it would have been prohibitively expensive for a single entity to sponsor it all. Thanks to the inherently collaborative nature of community FOSS projects, people from various backgrounds and across companies could come together and make this possible.

There are many other examples of this, but seeing the improbable happen from the inside is something special.

Special shout-outs to ZEISS, Barco, Pexip, and Cablecast.tv for sponsoring various parts of this work!

Their contributions also made it easier for us to spend thousands more hours of non-sponsored time to fill in the gaps so that all the sponsored work done could be upstreamed in a form that's useful for everyone who uses GStreamer. This sort of thing is, in my opinion, an essential characteristic of being a community-focused consultancy, and we make sure that it always has high priority.

Tuesday, April 10, 2018

A simple method of measuring audio latency

In my previous blog post, I talked about how I improved the latency of GStreamer's default audio capture and render elements on Windows.

An important part of any such work is a way to accurately measure the latencies in your audio path.

Ideally, one would use a mechanism that can track your buffers and give you a detailed breakdown of how much latency each component of your system adds. For instance, with an audio pipeline like this:

audio-capture → filter1 → filter2 → filter3 → audio-output

If you use GStreamer, you can use the latency tracer to measure how much latency filter1 adds, filter2 adds, and so on.

However, sometimes you need to measure latencies added by components outside of your control, for instance the audio APIs provided by the operating system, the audio drivers, or even the hardware itself. In that case it's really difficult, bordering on impossible, to do an automated breakdown.

But we do need some way of measuring those latencies, and I needed that for the aforementioned work. Maybe we can get an aggregated (total) number?

There's a simple way to do that if we can create a loopback connection in the audio setup. What's a loopback you ask?

Ouroboros snake biting its tail

Essentially, if we can redirect the audio output back to the audio input, that's called a loopback. The simplest way to do this is to connect the speaker-out/line-out to the microphone-in/line-in with a two-sided 3.5mm jack.

photo of male-to-male 3.5mm jack connecting speaker-out to mic-in

Now, when we send an audio wave down to the audio output, it'll show up on the audio input.

Hmm, what if we store the current time when we send the wave out, and compare it with the current time when we get it back? Well, that's the total end-to-end latency!

If we send out a wave periodically, we can measure the latency continuously, even as things are switched around or the pipeline is dynamically reconfigured.

Some of you may notice that this is somewhat similar to how the `ping` command measures latencies across the Internet.

screenshot of ping to 192.168.1.1


Just like a network connection, the loopback connection can be lossy or noisy, f.ex. if you use loudspeakers and a microphone instead of a wire, or if you have (ugh) noise in your line. But unlike network packets, we lose all context once the waves leave our pipeline and we have no way of uniquely identifying each wave.

So the simplest reliable implementation is to have only one wave traveling down the pipeline at a time. If we send a wave out, say, once a second, we can wait about one second for it to show up, and otherwise presume that it was lost.

That's exactly how the audiolatency GStreamer plugin that I wrote works! Here you can see its output while measuring the combined latency of the WASAPI source and sink elements:


The first measurement will always be wrong because of various implementation details in the audio stack, but the next measurements should all be correct.

This mechanism does place an upper bound on the latency that we can measure, and on how often we can measure it, but it should be possible to take more frequent measurements by sending a new wave as soon as the previous one was received (with a 1 second timeout). So this is an enhancement that can be done if people need this feature.

Hope you find the element useful; go forth and measure!