Skip to content

Comments

Wire up custom attention block via config#1086

Merged
rwightman merged 3 commits intomainfrom
custom_block
Jul 22, 2025
Merged

Wire up custom attention block via config#1086
rwightman merged 3 commits intomainfrom
custom_block

Conversation

@rwightman
Copy link
Collaborator

add qk_norm, and enable qk_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling via config

@JeniaJitsev
Copy link
Contributor

Will pull that and do some reference training (B/32, various lr and bs) to see if that goes well.

@rwightman
Copy link
Collaborator Author

@JeniaJitsev sounds good, there's ideas from diff papers there and they wouldn't necessarily all make sense in combo ...

qk_norm and scaled_cosine_attn are explicitly disabled together, does not make sense

but also using qk_norm in combo with inner norm probably not worthwhile

using scale_head + inner norm together not worthwhile

scale head + scale_attn ('outer' norm) was part of normformer so works well together

scale fc (norm in mlp) with any of the attn focused options could be worthwhile

@rwightman
Copy link
Collaborator Author

I've found layer scale to benefit quite a few vit training regimes... that's been supported for awhile but not sure if any from-scratch runs use it.... ls init values usually set btw 1e-5 and 0.1

@rwightman
Copy link
Collaborator Author

@JeniaJitsev also, do confirm the model architecture changed according to the config flags set :) I ran through several of them but might have missed one and you don't want to go through train and find out it didn't change the arch.

@JeniaJitsev
Copy link
Contributor

@rwightman So, it seems it makes sense to test 2 setups:

  1. qk norm active (otherwise everything else standard training)
  2. scale head + scale_attn (as already employed in normformer)

Hope 1) and 2) I can get by having a corresponding json conf for each setup without further source code changes in this PR. Will check then with you before starting training whether confs make sense.

@rwightman
Copy link
Collaborator Author

The print of the model from the main script after creation gives a quick overview, you can ensure it's a CustomResidualAttentionBlock and that the norm layers you intended to enable are there and not nn.Identity(). head_scale is a parameter that does not appear in print so need to check that one manually...

  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x CustomResidualAttentionBlock(
        (ln_1): LayerNormFp32((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (ln_q): LayerNormFp32((64,), eps=1e-05, elementwise_affine=True)
          (ln_k): LayerNormFp32((64,), eps=1e-05, elementwise_affine=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (ln_inner): Identity()
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
          (out_drop): Dropout(p=0.0, inplace=False)
        )
        (ln_attn): Identity()
        (ls_1): Identity()
        (ln_2): LayerNormFp32((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (ln): Identity()
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
        (ls_2): Identity()
      )

@rwightman
Copy link
Collaborator Author

rwightman commented Jun 10, 2025

@rwightman So, it seems it makes sense to test 2 setups:

  1. qk norm active (otherwise everything else standard training)
  2. scale head + scale_attn (as already employed in normformer)

Hope 1) and 2) I can get by having a corresponding json conf for each setup without further source code changes in this PR. Will check then with you before starting training whether confs make sense.

My first 'omg I need a stable model' trials would be

  1. qk norm + scale_fc (maybe start w/o scale_fc and add if still unstable?)
  2. scale_attn_inner + scale_fc (magneto / kosmos-x.. in torchscale)
  3. scale_attn + scale_head + scale_fc (norm former)

And would probably enable layer scale for 1. and 2. by default just because I like it, hah.

@rwightman rwightman merged commit bbc6558 into main Jul 22, 2025
0 of 4 checks passed
@rwightman rwightman deleted the custom_block branch July 22, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants