ydshieh
commited on
Commit
·
9a97c24
1
Parent(s):
212ecaa
improve doc
Browse files
run_image_captioning_flax.py
CHANGED
@@ -114,10 +114,14 @@ class TrainingArguments:
|
|
114 |
)
|
115 |
_block_size_doc = \
|
116 |
"""
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
|
|
|
|
|
|
|
|
121 |
"""
|
122 |
block_size: int = field(
|
123 |
default=0,
|
|
|
114 |
)
|
115 |
_block_size_doc = \
|
116 |
"""
|
117 |
+
The default value `0` will preprocess (tokenization + feature extraction) the whole dataset before training and
|
118 |
+
cache the results. This uses more disk space, but avoids (repeated) processing time during training. This is a
|
119 |
+
good option if your disk space is large enough to store the whole processed dataset.
|
120 |
+
If a positive value is given, the captions in the dataset will be tokenized before training and the results are
|
121 |
+
cached. During training, it iterates the dataset in chunks of size `block_size`. On each block, images are
|
122 |
+
transformed by the feature extractor with the results being kept in memory (no cache), and batches of size
|
123 |
+
`batch_size` are yielded before processing the next block. This could avoid the heavy disk usage when the
|
124 |
+
dataset is large.
|
125 |
"""
|
126 |
block_size: int = field(
|
127 |
default=0,
|