Nvidia has unveiled a new generative AI model that can create any combination of music, voices and sounds using text and audio as inputs. Called Fugatto, (Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts, using any combination of text and audio files. "While some AI models can compose a song or modify a voice, none have the dexterity of the new offering," said Nvidia in a blog post on Monday.
Also Read: Anthropic Unveils New AI Model with Computer Use Capability
What Can Fugatto AI Model Do?
Nvidia describes this model as a "Swiss Army knife for sound," one that allows users to control the audio output simply using text. Fugatto can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice and even let people produce sounds never heard before, the company explained.
"We wanted to create a model that understands and generates sound like humans do," said Rafael Valle, a manager of applied audio research at Nvidia.
Key Features of Fugatto
Supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions, Nvidia said.
"Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale," Valle added.
Also Read: Microsoft Launches Industry-Specific AI Models to Drive Business Transformation
Potential Use Cases for Fugatto AI
According to Nvidia, music producers could use Fugatto to quickly prototype or edit an idea for a song, trying out different styles, voices and instruments. They could also add effects and enhance the overall audio quality of an existing track.
An ad agency could apply Fugatto to quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers.
Additionally, Nvidia says language learning tools could be personalised to use any voice a speaker chooses. Imagine an online course spoken in the voice of any family member or friend.
Video game developers could use the AI model to modify prerecorded assets in their title to fit the changing action as users play the game. Or, they could create new assets easily from text instructions and optional audio inputs.
Also Read: Microsoft Announces New AI Models and Solutions for Healthcare
The Technology Behind Fugatto
Nvidia said Fugatto is a foundational generative transformer model that builds on prior work in areas such as speech modeling, audio vocoding and audio understanding. Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. "Their collaboration made Fugatto's multi-accent and multilingual capabilities stronger," said the company.
The full version used 2.5 billion parameters and was trained on a bank of Nvidia DGX systems, equipped with 32 Nvidia H100 Tensor Core GPUs.