What was the underlying training data distribution?

#2
by treehugg3 - opened

There are many base models trained, but few of them give an in-depth description of the type of content these models looked at in training, and what filtering was done on the dataset which may have biased the model. Having no insight into that means we just have to guess and benchmark using outputs alone what it might be good at. It would be really helpful if the model card, or a technical report about this project, clarifies what the training data distribution is.

And of course, I appreciate this model was released without synthetic data.

yeah I'm looking forward to the tech report and hope there is info in there

+1

It doesn't have to be specific if that's an issue, even just vague hints about what was picked or filtered would help.

First off, kudos to this base model. It will do coherent raw completions at 120K context. In 24GB VRAM! Most base models, even those that claim such long context, fall apart well before that.

Second, this is how it completed part of the last chapter of Journey to the Center of the Earth, at that long context:

So much honor necessarily created for him many envious enemies. Of
course they existed, and as his theories, supported by certain facts,
contradicted the system of science upon the question of central heat, he
was attacked by a number of learned men.

But the Professor was not a man to be easily silenced. He defended his
views with the greatest energy, and the question was warmly discussed.

It is quite unnecessary to say that the Professor was victorious. The
new theory was accepted, and the old exploded.

As for myself, I have become a professor of geology, and have married
Gretchen.

Hans, the guide, has returned to his native land, and has become a
prosperous farmer.

As for the compass, it is still in my possession. I have tried it
several times, and it invariably points to the north.

Nevertheless, I am convinced that it is wrong, and that it is a
strange, an extraordinary, and an inexplicable phenomenon.

But, after all, what matters it?

The journey is over.




End of Project Gutenberg's Journey to the Center of the Earth

"Project Gutenberg" was nowhere in the context! So it must be a emphasized part of the pretraining data, which is neat.

Sign up or log in to comment