hidden_size = intermediate_size * 0.25
(still working on this one, don't use.) num_attention_heads = 64 / ( intermediate_size / 1024)
intermediate_size = num_attention_heads * 128
Usually people use multiple of 2 like 128, 256 and 512 for intermediate_size. Higher the intermediate_size the better the capturing of information but more the time required for training
((size of LLM in GB) / 8) * 4 = (8-bit model size in gb)
(size of LLM in GB)* 2.5 = (ram needed to run llm in gb)
usually you times by a number between 2-4
2.5 is a good number to find the ram needed for a ggml LLM
(S * 4) * (E / 1024) = (gigs of vram needed to fine-tune the model with a batch size of 1)
S=size of model in GB
E=max_position_embeddings
(T * D)+(12 * N * D^2)= (parameter count for the ai model)
T=vocab_size
D=n_positions
N=n_layer
source for the parameter count math is
kipp.ly/transformer-param-countor if that isn't up anymore then you can view it
here