SAOP

Logo

Synthetic Audio Optimization by feature driven importance.Official page for the analysis and modeling of the SAOP project.

The perceptual quality of synthetic sound models remains a significant challenge for sound designers, primarily due to the lack of standardized objective evaluation methods and the limited availability of diverse synthetic samples. These limitations complicate the identification of perceptual deficiencies in synthesized audio. While classification models can effectively differentiate between real and synthetic sounds, gaining insight into the underlying decision-making process can reveal which acoustic features need refinement.

FEATURE-DRIVEN BOTTLENECK FRAMEWORK FOR TREE-BASED

The model used in SOAP is comprised on the next image:

Image

Figure 1: Feature Concept BottleNeck.

The feature importance states how much a feature impacts the decision-making of a model. To extract the feature importance for each class, C models, where C is the number of classes, are trained on one versus all (binary) classification tasks to ensure the model learns to effectively disambiguate between the target class and all the other classes. The models are trained to converge on a uniform dataset split to ensure that the feature importance is not biased to a particular classification. For each classifier C we extract the top-K features required to classify based on global feature importance, these then form the concept labels for this class.

The second part of the feature-driven bottleneck framework is to predict each class’s top-k feature importance. Therefore, instead of training a model to predict the class based on the input features, we train a model to predict the top-k feature importance for the particular class, using the softmax top-k feature importance for the associated class from step 1.

The third part of the feature-driven bottleneck framework takes the predictions from step 2 and trains a model to use the predicted top-k feature importance to predict the associated class.

Step 2 and Step 3 are combined to create the overall predictive process, Figure \ref{overal_process}, which takes in features and then predicts the top-k features’ importance, which is used as input to the class classification model. At test time, intervention can be made between steps 2 and 3, enabling feature importance error correction, which can hypothetically enable near-perfect accuracy, as shown in Figure 2. Human intervention is a core contribution, and, as far as we are aware, the first attempt to offer a bridge between machine learning and a human expert in audio classification.

Image Figure 2: Feature concept bottleneck process.

Sound categories

The SOAP project was performed in 30 sound categories. All the synthetic models used are listed here

We retrieved the top 5 features after using the Bottleneck framework. All the top 5 features of the 30 categories can be found here.

Sound category Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Classification
Applause Pitch salience pitch Confidence Attack start Dynamic complexity Amplitude 5 Temporal and spectral
Applause Pitch Confidence MFCC 3 Amplitude 3 Frequency 1 MFCC 6 Temporal and spectral
Church Bells Amplitude 2 Amplitude 4 Pitch salience Amplitude1 Max to total Temporal and perceptual
Church Bells Pitch Salience Spectral flux Dynamic Complexity Crest MFCC 3 Temporal and perceptual
Bubbles Log Attack Time Spectral flatnesss MFCC 1 LDB Intensity Temporal and spectral
Bubbles Crest Effective Duration frequency1 Pitch Confidence Attack start Temporal, perceptual and spectral
Droplets Loudness Effective Duration LDB RMS MFCC 1 Temporal and spectral
Droplets Loudness RMS LDB Median Spectral flux Temporal and spectral

Optimization

For the optimization, we provide the whole list of both sound categories, top 5 important features and the optimizations made for the subjective test.

All the optimization parameters vary depending on the model used in the procedural audio online engine:Nemisindo

** The .zip file can be downloaded here: Dataset **

Class Optimization 1 Optimization 2 Optimization 3 Optimization 4
Applause EQ and reverb Distortion tone, more audience factor, and a distortion change of the knee Change in the audience size, mix width, and distortion More clapping rate and enthusiasm. Use of compressor and distortion (with a change in the knee)
Applause EQ (high pass filter), Reverb and bell pitch 3.81 Distortion Bell pitch 3.81 Distortion plus changes in the source and energy Low pass filter and compressor
Church Bells EQ High band frequencies (600-1000), Reverb (Small Impulse) Change in centre frequency 1140, Reverb (Small Impulse) Optimization 1, distortion, changes in the source and energy Optimization 3 plus High pass filter (EQ) on frequencies
Church Bells Surface wetness and viscosity Compressor Compressor plus flow amount Distortion
Bubbles EQ and reverb Distortion tone, audience factor, and distortion change of the knee Change in audience size, mix width, and distortion Changes in clapping rate and enthusiasm. Use of compressor and distortion (change in the knee)
Bubbles High frequencies, Reverb and bell pitch 3.81 Distortion Bell pitch 3.81 Distortion plus changes in the source and energy Low pass filter and compressor
Droplets EQ High band frequencies (600-1000), Reverb (Small Impulse) Change in centre frequency 1140, Reverb (Small Impulse) Optimization 1 plus Distortion, source and energy Optimization 3 plus EQ on high frequencies
Droplets Surface wetness and viscosity changes Compressor Compressor plus changes on the flow amount Distortion

For the whole list of all the modifications based on the top 5 features in the synthesis models, we used the post production effects and the parameters available in the online tool.

Here, we share a spreadsheet with the different parameter changes for our optimization samples, as seen in the table above you’ll find the following:

We show the difference in the top-5 categories with plot bars in the Applause category below: Image We can see that Pitch confidence was one of the most important features and could be the reason why it is perceived as more “real”.

Analysis

In this section you will find the samples used in the subjective evaluation for SAOP, Examples.