Conversation
|
Will pull that and do some reference training (B/32, various lr and bs) to see if that goes well. |
|
@JeniaJitsev sounds good, there's ideas from diff papers there and they wouldn't necessarily all make sense in combo ... qk_norm and scaled_cosine_attn are explicitly disabled together, does not make sense but also using qk_norm in combo with inner norm probably not worthwhile using scale_head + inner norm together not worthwhile scale head + scale_attn ('outer' norm) was part of normformer so works well together scale fc (norm in mlp) with any of the attn focused options could be worthwhile |
|
I've found layer scale to benefit quite a few vit training regimes... that's been supported for awhile but not sure if any from-scratch runs use it.... ls init values usually set btw 1e-5 and 0.1 |
|
@JeniaJitsev also, do confirm the model architecture changed according to the config flags set :) I ran through several of them but might have missed one and you don't want to go through train and find out it didn't change the arch. |
|
@rwightman So, it seems it makes sense to test 2 setups:
Hope 1) and 2) I can get by having a corresponding json conf for each setup without further source code changes in this PR. Will check then with you before starting training whether confs make sense. |
|
The print of the model from the main script after creation gives a quick overview, you can ensure it's a CustomResidualAttentionBlock and that the norm layers you intended to enable are there and not nn.Identity(). head_scale is a parameter that does not appear in print so need to check that one manually... |
My first 'omg I need a stable model' trials would be
And would probably enable layer scale for 1. and 2. by default just because I like it, hah. |
…_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling
add qk_norm, and enable qk_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling via config