Salesforce AI Research has introduced TACO, a family of multimodal large action models designed to improve performance on complex, multi-step problems that require multiple reasoning across various data types, such as images, text, and calculations. "We present TACO, a family of multi-modal large action models designed to improve performance on complex questions that require multiple capabilities and demand multi-step solutions," Salesforce said in a blog post on January 16, 2025.
Also Read: Meta Expands Access to Llama AI Models for US Government Use
Overcoming Limitations of Current AI Systems
According to the company, TACO tackles a significant limitation of current AI systems (open-source multi-modal models), which struggle to solve realistic complex problems in a step-by-step manner. For instance, when posed with a question like "How much gas can I buy with $50?" from a photo of a gas station sign, TACO can identify price information, extract the text using OCR, and perform the necessary calculations. This capability is powered by chains-of-thought-and-action (CoTA), where the model generates both reasoning and actionable steps to arrive at the correct answer.
"To answer such questions, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses," the company explained.
Also Read: Meta Unveils New AI Models and Tools to Drive Innovation
Training TACO
To train TACO, Salesforce said it created over 1 million synthetic CoTA traces through model-based and programmatic generation methods. These steps help the model learn to perform complex reasoning and execute external actions such as text recognition and mathematical operations.
Salesforce claims that TACO achieved 30-50 percent higher performance compared to models using traditional direct answers. It also outperformed baseline models by up to 20 percent on the MMVet benchmark.
Also Read: Microsoft, Dell, Google and Others Launch Initiatives to Propel AI Infrastructure and Innovation
Future Applications
With this framework, Salesforce AI hopes to pave the way for new multimodal models that can be applied across various domains, such as medical question answering and web navigation.
"With our framework, future works can train new models with different actions for other applications such as web navigation or for other domains such as medical question answering," Salesforce said.