@sophiamyang - Sophia Yang, Ph.D.
Open-Source LLMs vs. ChatGPT: 1. General Capabilities: Llama-2-chat-70B variant exhibits enhanced capabilities in general conversational tasks, surpassing the performance of GPT-3.5-turbo; UltraLlama matches GPT-3.5-turbo’s performance in its proposed benchmark. 2. Agent Capabilities (using tools, self-debugging, following natural language feedback, exploring environment): Lemur-70B-chat surpasses the performance of GPT-3.5-turbo when exploring the environment or following natural language feedback on coding tasks. AgentLlama-70B achieves comparable performance to GPT-3.5-turbo on unseen agent tasks. Gorilla outperforms GPT-4 on writing API calls. 3. Logical Reasoning Capabilities: fine-tuned models (e.g., WizardCoder, WizardMath) and pre-training on higher quality data models (e.g., Lemur-70B-chat, Phi-1, Phi-1.5) show stronger performance than GPT-3.5-turbo. 4. Modeling Long-Context Capabilities: Llama-2-long-chat-70B outperforms GPT-3.5-turbo-16k on ZeroSCROLLS. 5. Application-specific Capabilities: - query-focused summarization (fine-tuning on training data is better) - open-ended QA (InstructRetro shows improvement over GPT3) - medical (MentalLlama-chat-13 and Radiology-Llama-2 outperform ChatGPT) - generate structured responses (Struc-Bench outperforms ChatGPT) - generate critiques (Shepherd is almost on-par with ChatGPT) 6. Trust-worthy AI: - hallucination: during finetuning - improving data quality during fine-tuning; during inference - specific decoding strategies, external knowledge augmentation (Chain-of-Knowledge, LLM-AUGMENTER, Knowledge Solver, CRITIC, Prametric Knowlege Guiding), and multi-agent dialogue. - safety: GPT-3.5-turbo and GPT-4 models remain at the top for safety evaluations. This is largely attributed to Reinforcement Learning with Human Feedback (RLHF). RL from AI Feedback (RLAIF) could help reduce costs for RLHF. 🔗https://arxiv.org/abs/2311.16989 Thanks to the authors for the great paper! @CaimingXiong @HailinChen3 @FangkaiJiao @qcwntu @XingxuanLi @RuochenZhao3 @MatRavox @JotyShafiq