TruthArchive.ai - Tweets Saved By @sophiamyang

Posted - November 30, 2023 at 6:17 AM

Saved - November 30, 2023 at 2:45 PM

reSee.it AI Summary

Open-source LLMs like Llama-2-chat-70B and UltraLlama showcase improved conversational abilities compared to GPT-3.5-turbo. Lemur-70B-chat and AgentLlama-70B excel in agent capabilities, while Gorilla outperforms GPT-4 in writing API calls. Fine-tuned models and pre-training on higher quality data models exhibit stronger logical reasoning abilities. Llama-2-long-chat-70B surpasses GPT-3.5-turbo-16k in modeling long-context capabilities. Application-specific capabilities include query-focused summarization, open-ended QA, medical tasks, generating structured responses, and critiques. Trustworthy AI is achieved through various techniques like improving data quality, decoding strategies, external knowledge augmentation, and multi-agent dialogue. GPT-3.5-turbo and GPT-4 excel in safety evaluations, with RL from AI Feedback offering cost reduction for reinforcement learning with human feedback.

@sophiamyang - Sophia Yang, Ph.D.

Open-Source LLMs vs. ChatGPT: 1. General Capabilities: Llama-2-chat-70B variant exhibits enhanced capabilities in general conversational tasks, surpassing the performance of GPT-3.5-turbo; UltraLlama matches GPT-3.5-turbo’s performance in its proposed benchmark. 2. Agent Capabilities (using tools, self-debugging, following natural language feedback, exploring environment): Lemur-70B-chat surpasses the performance of GPT-3.5-turbo when exploring the environment or following natural language feedback on coding tasks. AgentLlama-70B achieves comparable performance to GPT-3.5-turbo on unseen agent tasks. Gorilla outperforms GPT-4 on writing API calls. 3. Logical Reasoning Capabilities: fine-tuned models (e.g., WizardCoder, WizardMath) and pre-training on higher quality data models (e.g., Lemur-70B-chat, Phi-1, Phi-1.5) show stronger performance than GPT-3.5-turbo. 4. Modeling Long-Context Capabilities: Llama-2-long-chat-70B outperforms GPT-3.5-turbo-16k on ZeroSCROLLS. 5. Application-specific Capabilities: - query-focused summarization (fine-tuning on training data is better) - open-ended QA (InstructRetro shows improvement over GPT3) - medical (MentalLlama-chat-13 and Radiology-Llama-2 outperform ChatGPT) - generate structured responses (Struc-Bench outperforms ChatGPT) - generate critiques (Shepherd is almost on-par with ChatGPT) 6. Trust-worthy AI: - hallucination: during finetuning - improving data quality during fine-tuning; during inference - specific decoding strategies, external knowledge augmentation (Chain-of-Knowledge, LLM-AUGMENTER, Knowledge Solver, CRITIC, Prametric Knowlege Guiding), and multi-agent dialogue. - safety: GPT-3.5-turbo and GPT-4 models remain at the top for safety evaluations. This is largely attributed to Reinforcement Learning with Human Feedback (RLHF). RL from AI Feedback (RLAIF) could help reduce costs for RLHF. 🔗https://arxiv.org/abs/2311.16989 Thanks to the authors for the great paper! @CaimingXiong @HailinChen3 @FangkaiJiao @qcwntu @XingxuanLi @RuochenZhao3 @MatRavox @JotyShafiq

ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up? Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in LLMs have intensified, with new LLMs flourishing at frequent interval across academia and industry, including many start-ups focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's Claude) generally outperform their open-source counterparts, the progress on the latter has been rapid with claims of achieving parity or even better on certain tasks. This has crucial implications not only on research but also on business. In this work, on the first anniversary of ChatGPT, we provide an exhaustive overview of this success, surveying all tasks where an open-source LLM has claimed to be on par or better than ChatGPT. arxiv.org