MM-Food-100K: Experiment Iteration and the Deep Dive into Data Value

Community Article Published August 20, 2025

In the field of AI, building a high-quality dataset is just as crucial as training a powerful model. We understand this deeply. We recently published a paper titled "MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance", aiming to introduce our work to the community: a large-scale multimodal food intelligence dataset, MM-Food-100K, and the innovative data protocol behind it. Our paper is available on arXiv at: https://arxiv.org/abs/2508.10429.

In the paper, we conducted a preliminary experiment to show that fine-tuning Large Vision-Language Models (LVLMs) on MM-Food-100K significantly improves their performance on food intelligence tasks. Although new models like GPT-5 have been released, the most advanced SFT (Supervised Fine-Tuning) services currently available are still based on GPT-4o. For this reason, we chose ChatGPT-4o as our benchmark, alongside Qwen-VL-MAX, to ensure our experiments were rigorous and representative of the current state of the art.

However, a quick preliminary study is just the beginning. To more deeply understand the impact of data scale on model performance, we embarked on a more detailed experimental iteration.


Experimental Design and Hyperparameter Configuration

To explore the relationship between data volume and model performance, we conducted an iterative experiment. By using different data subsets (100, 1,000, 10,000, and 50,000 samples), we were able to plot the performance curve and reveal the "data scaling law."

To ensure the reproducibility of our experiments, we meticulously recorded all hyperparameters used during the fine-tuning process. All experiments maintained the same parameter configurations to ensure that performance differences were solely attributable to changes in training data size.

Qwen-VL-MAX Hyperparameter Settings

The Qwen-VL-MAX model fine-tuning hyperparameters we used are as follows:

  • Epochs: 3
  • Learning Rate: 3e-4
  • Batch Size: 16
  • Sequence Length: 8192
  • Validation Steps: 50
  • LoRa Rank: 8
  • LoRa Alpha: 32
  • LoRa Dropout: 0.1
  • Weight Decay: 0.01
  • Learning Rate Warmup Ratio: 0.05

GPT-4o Hyperparameter Settings

The GPT-4o model fine-tuning hyperparameters we used are as follows:

  • Epochs: 3
  • Batch size: 16
  • LR Multiplier: 2
  • Seed: 2

Experimental Results and In-Depth Analysis

Our extended experiments used Qwen-VL-MAX and ChatGPT-4o as base models, fine-tuning them on our different data subsets. We focused on two core tasks: calorie regression and multi-task classification.

Regression Task: Calorie Prediction (Kcal)

We used MAE (Mean Absolute Error) and to measure prediction accuracy.

Model Training Data Size MAE (kcal) ↓ RMSE (kcal) ↓ R² ↑
Qwen-VL-MAX 0 (Base) 126.5 185.3 0.521
100 125.4 184.2 0.525
1,000 123.8 181.5 0.539
10,000 107.5 159.1 0.612
50,000 104.2 154.5 0.638
GPT-4o 0 (Base) 98.7 148.1 0.685
100 98.4 147.8 0.687
1,000 97.9 147.1 0.69
10,000 96.2 144.9 0.702
50,000 95.8 144.3 0.706

image/png

Key Findings:

  • The Threshold for Initial Gains: Fine-tuning with a small amount of data (100 or 1,000 samples) resulted in very little performance improvement. Both Qwen-VL-MAX and GPT-4o showed only a minimal decrease in MAE. This suggests that for powerful base models, there exists a data threshold that must be surpassed before the data can effectively drive significant performance gains.

  • Non-linear Scaling Effect: We observed a clear inflection point in performance when the training data size was increased from 1,000 to 10,000 samples. Qwen-VL-MAX's MAE plummeted from 123.8 kcal to 107.5 kcal, a significant drop of 13.1%. This huge leap clearly demonstrates the non-linear value that high-quality data can unlock once it reaches a certain scale, aligning more with an S-curve growth model than a simple linear scaling law.

  • Sustained Gains and Unexplored Frontiers: As the data size continued to increase to 50,000, both models showed continuous improvement. Qwen-VL-MAX's overall MAE dropped by 17.6%. However, combining our experimental observations with industry experience, we hypothesize that the data gain curve follows an S-shape. While we have observed sustained gains, the rate of return has noticeably slowed. Due to the current limitations on the number of training samples supported by large model fine-tuning services, we cannot yet conduct larger-scale experiments to determine the upper limit of this S-curve. We look forward to more community research to help explore this frontier.

Classification Task: Win Rate Comparison

We also analyzed the models' performance on classification tasks like dish names, ingredients, and cooking methods.

To quantify the actual effect of fine-tuning, we designed the Win Rate metric, which we will explain in detail here.

Metric Explanation: The Win Rate is calculated by performing a pairwise comparison between the fine-tuned SFT model and its un-fine-tuned Base model on the same test set. A win is recorded if the SFT model's answer is more accurate than the Base model's. The final Win Rate is the percentage of wins for the SFT model out of the total number of comparisons.

Metric Limitation: It is crucial to note that the Win Rate can only be used to measure the effect of fine-tuning on a single base model at different data volumes. For example, Qwen-VL-MAX's Win Rate with 10,000 samples is calculated by comparing it against its own Base model, and GPT-4o's Win Rate is calculated by comparing it against its own Base model. Directly comparing the Win Rate values between GPT-4o and Qwen-VL-MAX is meaningless, as their baselines (Base Models) are different. The value of this metric lies in quantifying the performance improvement of a single model as a function of data scale.

Model Training Data Size Dish Name (Win Rate) Ingredients (Win Rate) Cooking Method (Win Rate)
Qwen-VL-MAX 100 50.5% 50.2% 50.3%
1,000 51.5% 51.3% 51.2%
10,000 55.4% 57.2% 56.1%
50,000 57.9% 60.2% 58.7%
GPT-4o 100 50.2% 50.1% 50.3%
1,000 50.5% 50.4% 50.6%
10,000 50.8% 50.8% 50.6%
50,000 51.1% 51.4% 51.2%

image/png

Key Findings:

  • The classification results align with our regression findings. The performance gains were minimal with 100 and 1,000 samples but accelerated significantly once the data size crossed the 10,000-sample mark. This further validates the "data threshold effect": a model's true potential can only be unlocked when it's exposed to a sufficiently large amount of high-quality data.
  • These results demonstrate that the true value of MM-Food-100K lies in its scale. It not only provides high-quality annotations but, more importantly, offers a sufficient number of samples to unlock a model's full potential, enabling a significant performance breakthrough on domain-specific tasks.

Conclusion: Dataset Scale Determines True Value

Through this experimental iteration, we not only re-validated the value of the MM-Food-100K dataset but, more importantly, proved that its scale is its most crucial asset. A small amount of data may have a limited impact on a large model, but once the data volume reaches a certain level, the gains become non-linear, allowing the model to achieve a significant performance leap in a specific domain.

Although our public dataset contains 100,000 samples, we could only evaluate the effects up to 50,000 samples due to the current fine-tuning service limitations of large models. This raises a question for us: What would the performance ceiling be if we could fine-tune with 100,000, 500,000, or even more high-quality data points? We are confident that once these technical limitations are overcome, the true potential of large-scale datasets like MM-Food-100K will be fully unleashed.

We hope this post inspires more members of the community to explore this area. We look forward to seeing more research on the impact of large-scale data on model performance as we collectively explore the endless possibilities that data offers for AI development.

Community

Sign up or log in to comment