ydshieh commited on
Commit
9a97c24
·
1 Parent(s): 212ecaa

improve doc

Browse files
Files changed (1) hide show
  1. run_image_captioning_flax.py +8 -4
run_image_captioning_flax.py CHANGED
@@ -114,10 +114,14 @@ class TrainingArguments:
114
  )
115
  _block_size_doc = \
116
  """
117
- Split a dataset into chunks of size `block_size`. On each block, images are transformed by the feature extractor
118
- and are kept in memory, and the batches of size `batch_size` are yield before processing the next block.
119
- The default value `0` indicates we don't use blocks, and the whole dataset will be preprocessed
120
- (tokenization + feature extraction) and cached before training.
 
 
 
 
121
  """
122
  block_size: int = field(
123
  default=0,
 
114
  )
115
  _block_size_doc = \
116
  """
117
+ The default value `0` will preprocess (tokenization + feature extraction) the whole dataset before training and
118
+ cache the results. This uses more disk space, but avoids (repeated) processing time during training. This is a
119
+ good option if your disk space is large enough to store the whole processed dataset.
120
+ If a positive value is given, the captions in the dataset will be tokenized before training and the results are
121
+ cached. During training, it iterates the dataset in chunks of size `block_size`. On each block, images are
122
+ transformed by the feature extractor with the results being kept in memory (no cache), and batches of size
123
+ `batch_size` are yielded before processing the next block. This could avoid the heavy disk usage when the
124
+ dataset is large.
125
  """
126
  block_size: int = field(
127
  default=0,