Compared to other large language models like Amazon Nova Pro, Gemini Pro 1.5, and GPT-4o, new Llama 3.3 holds its own. For general knowledge (MMLU), Llama 3.3 matches Amazon Nova Pro at 86.0% in zero-shot tests and does better than GPT-4o (85.9%). In following instructions (IFEval), Llama 3.3 scores 92.1%, tying with Amazon Nova Pro and doing better than GPT-4o (84.6%) and Gemini Pro 1.5 (81.9%).
For coding tasks, Llama 3.3 scores 88.4% on HumanEval, slightly behind Amazon Nova Pro (89.0%) but ahead of GPT-4o (86.0%). It also handles math problems well, scoring 77.0%, which is better than Amazon Nova Pro (76.6%) and GPT-4o (76.9%) but not as strong as Gemini Pro 1.5 (82.9%).
In multilingual assignments, Llama 3.3 stands out with 91.1%, ahead of OpenAI GPT-4o (90.6%) and Gemini Pro 1.5 (89.6%). Most importantly, Llama 3.3 is the most cost-effective option, offering the lowest price per input and output token, making it a great choice for developers who want both high performance and affordability.