Skip to content

FAQ

What models are supported?

The Models page has details on the (many) models you can use with Lamini.

Will my training / tuning job time out?

We have a default 4-hour timeout for all tuning jobs. If your job times out, it will be automatically added back to the queue and run from the last checkpoint - Lamini automatically saves checkpoints so your progress isn't lost. This is to allow other jobs to run. If you want to run longer jobs, consider requesting more GPUs via gpu_config for a speed up or contact us for a dedicated instance.

Why is my job queued?

Lamini On-Demand plan uses shared resources. We queue tuning jobs in order to serve all users. To reserve your own dedicated compute or run on your own GPUs, please contact us.

I'm getting a missing key error! What do I do?

The Authenticate page has details on getting and setting your Lamini API key.

Does Lamini run on Windows?

Lamini has not been tested on Windows and is not officially supported. While it may be possible to run Lamini on Windows, we cannot guarantee its functionality or stability. If you are using Windows, we strongly recommend using Docker to run Lamini on a Linux-based image.

What systems can run Lamini?

Lamini is tested and developed on Linux-based systems. Specifically, we recommend using Ubuntu 22.04 or later with Python 3.10, 3.11, or 3.12.

Can I turn off memory tuning / MoME?

Yes, if your use case needs a more qualitative output, like summarization where there are many possible answers, then memory tuning may be hurting performance (but still worth trying). You can set the following finetuning args to disable MoME:

finetune_args={
  "batch_size": 1,
  "index_ivf_nlist": 1,
  "index_method": "IndexFlatL2",
  "index_max_size": 1,
}

Does Lamini use LoRAs?

Yes, Lamini tunes LoRAs (low-rank adapters) on top of a pretrained LLM to get the same performance as finetuning the entire model, but with 266x fewer parameters and 1.09 billion times faster model switching. Read our blog post for more details.

Lamini applies this optimization (and others) automatically - you don't have to configure anything.

Can I run my job on an MI300?

Yes! Lamini On-Demand currently runs on MI250s and we have MI300s available for our Lamini Reserved plans. Please contact us to learn more about Lamini Reserved and our MI300 cluster.

Does model loading happens for every inference request?

Loading model weights to GPU memory takes time that is proportional to model parameter count. Loading tokenizer (to CPU memory, not confident about the destination of loading tokenizer) is similar, although tokenzier is small enough that it's negligable compared to LLM model weights

Model loading to GPU only happens once, and will be kept in GPU memory except for failure or other unexpected events. Tokenizer loading (to CPU, again, not sure) is the same.

Thus model loading wont happen repeatedly for each and every inference request.

The above descriptions applies to only base model (if there is any).

MoME adapter cache is not available yet.