October 18, 2025
Street Food
Facebook 360 audio encoding and rendering can achieve immersive experience
From 360-degree video to Oculus, Facebook’s 360 audio encoding and rendering technology delivers an immersive experience with fewer channels and less than 0.5 milliseconds of rendering latency. This advancement ensures that users feel fully engaged in the audiovisual world they're exploring.
One of the key innovations is a new 360-degree spatial audio encoding and rendering technology that maintains high-quality sound throughout the entire process—from the editor to the end user. This technology is set to be deployed on a large scale for the first time, offering creators and developers a powerful tool to enhance their content.
We also support a technique called “hybrid higher-order ambisonics,†which allows spatialized sound to remain high quality during processing. This system uses 8-channel audio with advanced rendering and optimization, delivering superior stereo quality while using fewer channels and saving bandwidth.
Our audio system supports both spatialized audio and head-oriented audio. In spatial audio, sounds respond dynamically based on the user's position within a 360-degree video. With head-oriented audio, elements like dialogue and background music stay fixed relative to the user’s head. This is the first time we’ve achieved simultaneous rendering of high-end stereo and head-oriented audio in the industry.
The spatial audio rendering system offers a real-time experience with sub-0.5 millisecond latency, ensuring smooth and responsive audio playback. The FB360 encoder tool can transfer processed audio across multiple platforms, and the SDK for audio rendering integrates seamlessly into Facebook and Oculus Video, providing a consistent experience from production to release.
Facebook’s 360-degree panoramic video experience is already immersive, but adding 360-degree spatial audio enhances it further. When users engage with spatial audio, each sound appears to come from its correct position in space—just as it would in real life. A helicopter flying above the camera sounds like it's overhead, and actors in front of the camera seem to be right in front of the user. As the user looks around, the system adjusts the audio in real-time to match their head movements, ensuring a truly realistic spatial experience.
To achieve this, we developed an audio processing system that creates a sense of immersion without needing a large room or speaker setup. One of the main challenges is structuring an audio environment that reflects the real world and presenting it through headphones with high resolution, while tracking the user’s visual orientation. Traditional stereo sound through headphones may help users distinguish left or right, but it doesn’t convey depth, height, or direction accurately.
Creating and commercializing spatial audio at scale requires many new technologies. While research in this area is ongoing, there hasn't been a reliable end-to-end solution for consumer markets. Recently, we introduced new tools and rendering methods that allow us to bring high-quality spatial audio to a broad audience. These techniques are used in a new platform called the "Space Audio Workstation," enabling creators to add spatial audio to 360 videos. The system is also available in the Facebook app, so users can enjoy the same immersive audio that creators upload.
These improvements help video producers deliver more realistic experiences across devices and platforms. In this article, we’ll explore some of the technical details behind these advancements. But first, let’s take a look at the history and evolution of spatial audio.
Spatial audio has evolved significantly over the years. Head-related transfer functions (HRTFs) allow users to perceive sound as if it's coming from specific directions, even when using headphones. HRTFs help developers create filters that make sounds appear to come from specific positions—front, back, or beside the listener. While HRTFs are typically used in anechoic chambers, they can also be applied through other methods.
To make users hear spatial sound while watching a 360 video, developers must place sounds correctly in the audio scene. Object-based spatial audio is one method where each sound source (like a helicopter or actor) is saved as a discrete stream with positional metadata. This approach is common in games, where the position of each sound changes dynamically.
Ambisonics is another spatial audio technique that represents the entire sound field. Think of it as a panoramic photo of sound. Multi-channel audio streams can represent the full sound field, making it easier to transcode and stream compared to object-based systems. The order of the ambisonic sound field determines the number of channels—first-order produces four channels, while third-order gives 16. Higher orders mean better quality and accuracy.
For creators, the Spatial Audio Workstation is a powerful tool designed to help them design spatial audio for 360 videos and VR experiences. It offers more advanced audio processing than traditional workstations, allowing developers to place sounds in 3D space and preview them in real-time on VR headsets. This creates a seamless end-to-end workflow from creation to distribution.
Our system outputs eight audio channels, optimized for VR and 360 videos. We call this a hybrid higher-order ambisonic system. It’s tuned to maximize sound quality and positional accuracy while minimizing performance requirements and latency. Additionally, the workstation can output two head-oriented audio channels, which remain locked to the user’s head position, ideal for narration or background music.
Our Spatial Audio Renderer condenses years of technological development into a flexible system that can adapt to various configurations while maintaining high quality. It uses parameterization and representation of HRTFs to balance speed and quality, achieving sub-0.5 millisecond latency. This makes it ideal for real-time applications like 360 videos with head tracking.
This flexibility allows spatial audio to be widely adopted across desktops, mobile devices, and browsers. Our renderers are optimized for different platforms, ensuring a consistent user experience. Consistency is crucial for high-quality spatial audio, as differences in performance can lead to variations in sound quality.
The renderer is part of the Audio360 engine, which spatializes mixed high-order ambisonic and head-oriented audio. Written in C++, it uses optimized vector instructions for each platform and runs efficiently with multithreading. It also works directly with native audio systems like OpenSL on Android, CoreAudio on iOS/macOS, and WASAPI on Windows, minimizing latency and maximizing efficiency.
For the web, the audio engine is compiled into asm.js using Emscripten, allowing the same code to run across all platforms. The WebAudio API handles the spatialized audio, ensuring consistent quality. Although the JavaScript version runs slower than the native C++ version, it’s fast enough for real-time processing.
As device and browser performance improves, the flexibility and cross-platform nature of our system will continue to evolve, enhancing audio quality over time.
From encoding to client, the process of spatial audio delivery is streamlined. The Spatial Workstation encoder prepares 8-channel spatial audio and stereo head-oriented audio, packaging them with 360 videos for upload to Facebook. This ensures that users can enjoy high-quality spatial audio on any device.
We faced challenges in selecting a suitable file format. We chose to use three tracks in MP4: two 4-channel tracks and one 2-channel stereo track. Encoding at a high bit rate minimizes quality loss, as these tracks will be re-encoded on the server.
We also developed a metadata solution using XML inside MP4 boxes, allowing for easy updates and compatibility with tools like MP4Box. Metadata is stored per track and globally, ensuring accurate channel layout mapping.
The Space Workstation Encoder integrates the video into the final file without transcoding, writing appropriate metadata for 360 video recognition.
We also support YouTube’s four-channel format for first-order stereo effects.
Once uploaded, the videos undergo transcoding to various formats for different clients. Audio metadata is extracted and mapped to the required formats, with stereo binaural rendering as a fallback option.
Audio and video can be processed separately and delivered via adaptive streaming protocols.
Different clients have varying capabilities, so we prepare multiple formats for iOS, Android, and web browsers. We prefer MP4 for iOS and WebM for Android and web, due to better compatibility and performance.
Opus is being explored as a codec for spatial audio due to its efficient compression and faster decoding. While not widely supported in MP4, we’re working on integrating Opus support with FFmpeg.
Transcoding the three-track format into a single 10-channel Opus soundtrack posed challenges, but we managed it by using undefined channel mapping families and transmitting layout information in a manifest file.
Looking ahead, we aim to develop a file format that stores all audio in one track with lossless encoding. We’re also exploring improvements in Opus for spatial audio, including adaptive bitrate and channel layout techniques to enhance user experience under varying bandwidth conditions.
In 2018, the multimedia landscape was dynamic, with codecs, AI, and blockchain driving innovation. Events like LiveVideoStack Meet highlighted emerging trends and best practices in the industry.
As we move forward, we remain committed to pushing the boundaries of spatial audio, creating richer, more immersive experiences for users worldwide.
Heat Sink Material,Gold Plated Alloy,Gold Plated Turned Parts,Molybdenum Sheet Metal
Shaanxi Xinlong Metal Electro-mechanical Co., Ltd. , https://www.cnxlalloys.com