[論文筆記] The Llama 3 Herd of Models (Llama 3.1) 持續更新

3 min readAug 10, 2024

論文好長，所以打算不規則慢慢看慢慢紀錄

3.1.2 Determining the Data Mix

To obtain a high-quality language model, it is essential to carefully determine the proportion of different data sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification and scaling law experiments.

我們開發了一個分類器來對網絡數據中包含的信息類型進行分類，以更有效地確定數據混合。我們使用這個分類器來降低網絡上過度代表的數據類別的採樣率，例如藝術和娛樂。

Knowledge classification. We develop a classifier to categorize the types of information contained in our web data to more effectively determine a data mix. We use this classifier to downsample data categories that are over-represented on the web, for example, arts and entertainment.

知識分類。我們開發了一個分類器來對網絡數據中包含的信息類型進行分類，以更有效地確定數據混合。我們使用這個分類器來降低網絡上過度代表的數據類別的採樣率，例如藝術和娛樂。

Scaling laws for data mix. To determine the best data mix, we perform scaling law experiments in which we train several small models on a data mix and use that to predict the performance of a large model on that mix (see Section 3.2.1). We repeat this process multiple times for different data mixes to select a new data mix candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of that model on several key benchmarks.

數據混合的縮放法則。為了確定最佳的數據混合，我們進行縮放法則實驗，在這些實驗中，我們在一個數據混合上訓練幾個小模型，並用它來預測大模型在該混合上的表現（見3.2.1節）。我們對不同的數據混合重複這個過程多次，以選擇新的數據混合候選。隨後，我們在這個候選數據混合上訓練一個更大的模型，並評估該模型在幾個關鍵基準測試上的表現。

Data mix summary. Our final data mix contains roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.

數據混合摘要。我們最終的數據混合包含大約50%的一般知識相關標記，25%的數學和推理標記，17%的代碼標記，以及8%的多語言標記。

3.1.3 Annealing Data

Annealing Data 是一個在機器學習和數據處理領域中使用的概念，靈感來自於冶金學中的退火過程。在機器學習中，這個術語通常指的是一種數據處理或訓練策略，旨在提高模型的性能和泛化能力。以下是對 Annealing Data 的詳細解釋：

定義：
Annealing Data 指的是在訓練過程中逐步調整或處理數據的方法，通常是從”簡單”到”複雜”的過程。

目的：
1. 提高模型的學習效率
2. 增強模型的泛化能力
3. 避免模型過早陷入局部最優解

實現方式：
1. 數據難度漸進：從簡單的數據樣本開始訓練，逐步引入更複雜的樣本。
2. 數據擾動：在訓練過程中逐步減少對數據的擾動或增強。
3. 課程學習：設計一個從簡單到複雜的學習”課程”。

優點：
1. 可以幫助模型更好地學習複雜的模式
2. 可能加速訓練過程
3. 有助於提高模型的最終性能

挑戰：
1. 定義和量化數據的”難度”可能具有挑戰性
2. 需要仔細設計退火策略以適應特定問題

與其他技術的關係：
1. 與課程學習（Curriculum Learning）密切相關
2. 可以與其他優化技術結合使用

Annealing Data 的核心思想是通過逐步增加訓練數據的複雜度或難度，來模擬人類學習的過程，從而實現更有效的機器學習。這種方法在許多應用中已經顯示出了提高模型性能和穩定性的潛力。
如果您想了解更多關於這個主題的具體應用或技術細節，我可以提供更多信息。

Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks. Akin to Li et al. (2024b), we perform annealing with a data mix that upsamples high-quality data in select domains. We do not include any training sets from commonly used benchmarks in our annealing data. This enables us to assess the true few-shot learning capabilities and out-of-domain generalization of Llama 3.

根據實證研究，我們發現在少量高品質的程式碼和數學數據上進行退火(見3.4.3節)可以提升預訓練模型在關鍵基準測試上的表現。類似於Li等人(2024b)的做法，我們使用一種數據混合方法進行退火，在特定領域中對高品質數據進行上採樣。我們的退火數據中不包含任何常用基準測試的訓練集。這使我們能夠評估Llama 3真正的少樣本學習能力和跨域泛化能力。

Following OpenAI (2023a), we evaluate the efficacy of annealing on the GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b) training sets in annealing. We find that annealing improved the performance of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%, respectively. However, the improvements on the 405B model are negligible, suggesting that our flagship model has strong in-context learning and reasoning capabilities and does not require specific in-domain training samples to obtain strong performance.

遵循OpenAI(2023a)的方法，我們評估了在GSM8k(Cobbe等人，2021)和MATH(Hendrycks等人，2021b)訓練集上進行退火的效果。我們發現，退火將預訓練的Llama 3 8B模型在GSM8k和MATH驗證集上的表現分別提高了24.0%和6.4%。然而，對405B模型的改進微乎其微，這表明我們的旗艦模型具有強大的上下文學習和推理能力，不需要特定領域的訓練樣本就能獲得出色的表現。

Using annealing to assess data quality. Similar to Blakeney et al. (2024), we find that annealing enables us to judge the value of small domain-specific datasets. We measure the value of such datasets by annealing the learning rate of a 50% trained Llama 3 8B model linearly to 0 on 40B tokens. In those experiments, we assign 30% weight to the new dataset and the remaining 70% weight to the default data mix. Using annealing to evaluate new data sources is more efficient than performing scaling law experiments for every small dataset.

使用退火評估數據質量。與Blakeney等人(2024)類似，我們發現退火使我們能夠判斷小型特定領域數據集的價值。我們通過將50%訓練完成的Llama 3 8B模型的學習率在40B個標記上線性退火至0來衡量這些數據集的價值。在這些實驗中，我們將30%的權重分配給新數據集，其餘70%的權重分配給默認數據混合。使用退火來評估新數據源比為每個小數據集進行縮放法則實驗更有效率。

[論文筆記] The Llama 3 Herd of Models (Llama 3.1) 持續更新

3.1.2 Determining the Data Mix

3.1.3 Annealing Data

Written by 猛男麗莎的微笑