When should you pretrain an LLM?
Many teams are pretraining general-purpose LLMs by learning from internet text.
- May take $10s of millions, many months, huge amount of data;
For building a specific application:
- Option of last resort;(given the time and expense of pre-training a model from scratch)
- Could help if have a highly specialized domain; (have a highly specialized domain and a lot of data)
unless you have a huge amount of resources and a huge amout of data, it may be more parctical to start with an LLM that someone else had pre-trained, say a general purpose LLM that's learned from a lot of Internet data and that someone has open-source, and then to fine-tune that to your own data