Wire up custom attention block via config by rwightman · Pull Request #1086 · mlfoundations/open_clip

rwightman · 2025-06-09T23:27:51Z

add qk_norm, and enable qk_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling via config

JeniaJitsev · 2025-06-09T23:38:52Z

Will pull that and do some reference training (B/32, various lr and bs) to see if that goes well.

rwightman · 2025-06-09T23:55:38Z

@JeniaJitsev sounds good, there's ideas from diff papers there and they wouldn't necessarily all make sense in combo ...

qk_norm and scaled_cosine_attn are explicitly disabled together, does not make sense

but also using qk_norm in combo with inner norm probably not worthwhile

using scale_head + inner norm together not worthwhile

scale head + scale_attn ('outer' norm) was part of normformer so works well together

scale fc (norm in mlp) with any of the attn focused options could be worthwhile

rwightman · 2025-06-09T23:56:59Z

I've found layer scale to benefit quite a few vit training regimes... that's been supported for awhile but not sure if any from-scratch runs use it.... ls init values usually set btw 1e-5 and 0.1

rwightman · 2025-06-10T00:00:41Z

@JeniaJitsev also, do confirm the model architecture changed according to the config flags set :) I ran through several of them but might have missed one and you don't want to go through train and find out it didn't change the arch.

JeniaJitsev · 2025-06-10T00:05:43Z

@rwightman So, it seems it makes sense to test 2 setups:

qk norm active (otherwise everything else standard training)
scale head + scale_attn (as already employed in normformer)

Hope 1) and 2) I can get by having a corresponding json conf for each setup without further source code changes in this PR. Will check then with you before starting training whether confs make sense.

rwightman · 2025-06-10T00:06:25Z

The print of the model from the main script after creation gives a quick overview, you can ensure it's a CustomResidualAttentionBlock and that the norm layers you intended to enable are there and not nn.Identity(). head_scale is a parameter that does not appear in print so need to check that one manually...

  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x CustomResidualAttentionBlock(
        (ln_1): LayerNormFp32((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (ln_q): LayerNormFp32((64,), eps=1e-05, elementwise_affine=True)
          (ln_k): LayerNormFp32((64,), eps=1e-05, elementwise_affine=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (ln_inner): Identity()
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
          (out_drop): Dropout(p=0.0, inplace=False)
        )
        (ln_attn): Identity()
        (ls_1): Identity()
        (ln_2): LayerNormFp32((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (ln): Identity()
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
        (ls_2): Identity()
      )

rwightman · 2025-06-10T00:10:43Z

@rwightman So, it seems it makes sense to test 2 setups:

qk norm active (otherwise everything else standard training)

scale head + scale_attn (as already employed in normformer)

Hope 1) and 2) I can get by having a corresponding json conf for each setup without further source code changes in this PR. Will check then with you before starting training whether confs make sense.

My first 'omg I need a stable model' trials would be

qk norm + scale_fc (maybe start w/o scale_fc and add if still unstable?)
scale_attn_inner + scale_fc (magneto / kosmos-x.. in torchscale)
scale_attn + scale_head + scale_fc (norm former)

And would probably enable layer scale for 1. and 2. by default just because I like it, hah.

…_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling

rwightman added 3 commits July 22, 2025 09:48

Wire up custom attention block via config, add qk_norm, and enable qk…

5371b21

…_norm, scaled_cosine_attn, scale_heads, attn and inner attn norm scaling

Fix qk norm dim, needs head_dim

69dd47a

Simply reference weight handling for custom vs default blocks

0463ffd

rwightman force-pushed the custom_block branch from c54ee52 to 0463ffd Compare July 22, 2025 17:30

rwightman merged commit bbc6558 into main Jul 22, 2025
0 of 4 checks passed

rwightman deleted the custom_block branch July 22, 2025 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Wire up custom attention block via config#1086

Wire up custom attention block via config#1086
rwightman merged 3 commits intomainfrom
custom_block

rwightman commented Jun 9, 2025

Uh oh!

JeniaJitsev commented Jun 9, 2025

Uh oh!

rwightman commented Jun 9, 2025

Uh oh!

rwightman commented Jun 9, 2025

Uh oh!

rwightman commented Jun 10, 2025

Uh oh!

JeniaJitsev commented Jun 10, 2025

Uh oh!

rwightman commented Jun 10, 2025

Uh oh!

rwightman commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

rwightman commented Jun 9, 2025

Uh oh!

JeniaJitsev commented Jun 9, 2025

Uh oh!

rwightman commented Jun 9, 2025

Uh oh!

rwightman commented Jun 9, 2025

Uh oh!

rwightman commented Jun 10, 2025

Uh oh!

JeniaJitsev commented Jun 10, 2025

Uh oh!

rwightman commented Jun 10, 2025

Uh oh!

rwightman commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rwightman commented Jun 10, 2025 •

edited

Loading