- ½¹µãÊÖÒÕ
- ÒÔÔ´´ÊÖÒÕϵͳΪ»ù±¾£¬£¬£¬£¬£¬SenseCoreÉÌÌÀAI´ó×°ÖÃΪ½¹µã»ù×ù£¬£¬£¬£¬£¬½á¹¹¶àÁìÓò¡¢¶àÆ«ÏòÇ°ÑØÑо¿£¬£¬£¬£¬£¬
¿ìËÙÂòͨAIÔÚ¸÷¸ö±ÊÖ±³¡¾°ÖеÄÓ¦Ó㬣¬£¬£¬£¬ÏòÐÐÒµ¸³ÄÜ¡£¡£¡£¡£¡£¡£¡£
NeurIPS 2021 _ MST- ÓÃÓÚTransformerÊÓ¾õ±íÕ÷µÄMasked×Ô¼àÊÓ½â¶Á
MST: Masked Self-Supervised Transformer for Visual Representation
Part 1 ÎÊÌâºÍÌôÕ½
Yann LeCunÔøËµ¡°ÈôÊÇÈ˹¤ÖÇÄÜÊÇÒ»¸öµ°¸â£¬£¬£¬£¬£¬Ôòµ°¸âµÄÖ÷ÒªÒòËØ¾ÍÊÇÎÞ¼àÊÓϰ¡±¡£¡£¡£¡£¡£¡£¡£Õâ¾ä»°·´Ó¦ÁËÎÞ¼àÊÓѧϰÔÚÉî¶ÈѧϰÁìÓòÊÎÑÝ×ÅÖ÷ÒªµÄ×÷Óᣡ£¡£¡£¡£¡£¡£Ä¿½ñÆÕ±éµÄÒªÁ켯ÖÐÔÚÔõÑùÉè¼ÆÓÐÓõÄÊðÀíʹÃüÀ´¶ÔÎÞ±ê×¢µÄÊäÈëÊý¾Ýѧϰµ½ºÃµÄÊÓ¾õ±íÕ÷¡£¡£¡£¡£¡£¡£¡£ÔÚÅÌËã»úÊÓ¾õÉÏ£¬£¬£¬£¬£¬ÏÖÔÚ½ÏÁ¿Ê¢ÐеÄÖ±½ÓÓÐÓõÄÒªÁìÊDZÈÕÕѧϰ£¬£¬£¬£¬£¬½«ÑµÁ·Êý¾ÝµÄÿ¸öʵÀýµ±×ö¼òµ¥µÄ·ÖÀà¡£¡£¡£¡£¡£¡£¡£»£»£»£»£»£»£»ùÓÚÕâ¸öʵÁ¦Åб𣬣¬£¬£¬£¬Ðí¶à×Ô¼àÊÓÒªÁìÔÚ·ÖÀàʹÃüÉÏ»ñµÃÁËÓÐÓõÄÌáÉý¡£¡£¡£¡£¡£¡£¡£ËûÃÇÀÖ³ÉÌî²¹ÁË×Ô¼àÊÓÒªÁìºÍ¼àÊÓÒªÁìµÄ´ú¹µ¡£¡£¡£¡£¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬ÕâÏîʹÃüÈÔÈ»¾ßÓÐÌôÕ½£º
a. ÑÚÂëÓïÑÔÄ£×ÓÔÚ×ÔÈ»ÓïÑÔÁìÓò»ñµÃÁËÆÕ±éµÄÓ¦Óᣡ£¡£¡£¡£¡£¡£Í¼ÏñÊǸßÎ¬ÌØÕ÷£¬£¬£¬£¬£¬¶àÔëÉùÇÒÏà±ÈÓÚÎı¾ÐÎÊ½ÖØ´ó¡£¡£¡£¡£¡£¡£¡£ÔÚÊÓ¾õÁìÓòÖУ¬£¬£¬£¬£¬Í¼ÏñµÄÖ÷ÒªÐÅÏ¢»á±»Ëæ»ú·Öµ½²î±ðµÄtokenÖУ¬£¬£¬£¬£¬ÈôÊÇÕâЩtoken±»Ëæ»úmaskedµô£¬£¬£¬£¬£¬½«»áµ¼ÖºܲîµÄÌåÏÖ¡£¡£¡£¡£¡£¡£¡£Õâ¸öËæ»úÑÚÂëÓïÑÔÄ£×ÓÈÝÒ×ÑÚÊÎͼÏñµÄÒªº¦ÇøÓòµÄtoken£¬£¬£¬£¬£¬ÕâÑù»áµ¼ÖÂÎóÅÐÇÒ²»ÊʺÏÖ±½ÓÓ¦ÓÃÓÚ×Ô¼àÊÓÊÓ¾õTransformers¡£¡£¡£¡£¡£¡£¡£
b. Ðí¶à×Ô¼àÊÓÒªÁìÊÇʹÓÃÈ«¾ÖÌØÕ÷ѧϰͼÏñ¼¶±ðÕ¹Íû£¬£¬£¬£¬£¬¹ØÓÚÏñËØ¼¶±ðÕ¹ÍûÓÅ»¯È±·¦¡£¡£¡£¡£¡£¡£¡£Ä¿½ñ×Ô¼àÊÓѧϰҪÁìÒ²Ðí¶ÔͼÏñ·ÖÀàʹÃüÌ«¹ýÄâºÏ£¬£¬£¬£¬£¬¶ÔÏÂÓÎ÷缯ʹÃüÕ¹ÍûÌåÏÖЧ¹ûÇ·ºÃ¡£¡£¡£¡£¡£¡£¡£
Part 2 ÒªÁìÏÈÈÝ
Õë¶ÔÒÔÉÏÌá³öµÄÎÊÌ⣬£¬£¬£¬£¬ÎÒÃÇÌá³öÑÚÂëTransformer×Ô¼àÊÓѧϰҪÁ죬£¬£¬£¬£¬ÈçÏÂͼËùʾ¡£¡£¡£¡£¡£¡£¡£MST´´Á¢ÐÔµÄÒýÈëÁË×¢ÖØÁ¦ÌØÕ÷ͼָµ¼ÑÚÂëÕ½ÂÔ²¢Ê¹ÓÃÑÚÂëÌØÕ÷À´»Ö¸´È«¾ÖͼÏñÌØÕ÷ʹÃü¡£¡£¡£¡£¡£¡£¡£ÎÒÃǽ«ÏÈÈÝÔõÑùʹÓÃ×¢ÖØÁ¦ÌØÕ÷Ö¸µ¼ÑÚÂëÕ½ÂÔ×ÊÖúÑÚÂëÓïÑÔÄ£×ÓÓ¦Óõ½ÊÓ¾õÁìÓò¡£¡£¡£¡£¡£¡£¡£×îºóÎÒÃǽ«ÏÈÈÝÍøÂçµÄ½á¹¹ºÍʵÑéϸ½Ú¡£¡£¡£¡£¡£¡£¡£

ͼ1 MSTÕûÌåÈ«Á÷³Ìͼ
1. ×Ô¼àÊÓÒªÁìÍøÂç½á¹¹
ÎÒÃÇÆ¾Ö¤¶àÖØ²Ã¼ôÔÚ¶àÖÖͨÓõÄÊý¾ÝÔöǿϣ¬£¬£¬£¬£¬ÎªÃ¿¸öͼÏñÌìÉú¶à¸öÊÓͼ¡£¡£¡£¡£¡£¡£¡£ÕâÑù²Ù×÷¿ÉÒÔ»ñµÃÁ½¸ö±ê×¼Çø·ÖÂʵIJüôͼƬ ºÍ (ÌåÏÖÈ«¾ÖÊÓͼ)ºÍN¸öµÍÇø·ÖÂʲüôÑù±¾£¨ÌåÏÖ¾Ö²¿ÊÓͼ£©¡£¡£¡£¡£¡£¡£¡£Èçͼ1Ëùʾ£¬£¬£¬£¬£¬ÕûÌåÒªÁìÊÇÓÉÁ½¸ö±àÂëÀ´¾ÙÐбàÂ룬£¬£¬£¬£¬Î÷Ï¯ÍøÂç ºÍѧÉúÍøÂç £¬£¬£¬£¬£¬²ÎÊý»®·ÖÊÇ ºÍ ¡£¡£¡£¡£¡£¡£¡£Á½¸ö±àÂëÆ÷¶¼ÊÇTransformerµÄbackboneºÍprojection head×é³É¡£¡£¡£¡£¡£¡£¡£Î÷Ï¯ÍøÂçµÄ±àÂëÆ÷²ÎÊý ÊÇÓÉѧÉúÍøÂçµÄ±àÂëÆ÷²ÎÊý À´¶¯Ì¬Æ½¾ù¸üС£¡£¡£¡£¡£¡£¡£¸üй«Ê½È繫ʽ1Ëùʾ£º
![]()
ÆäÖУºmÊǶ¯Á¿ÏµÊý¡£¡£¡£¡£¡£¡£¡£
¸ø¶¨Ò»¸öÀο¿µÄÎ÷Ï¯ÍøÂç £¬£¬£¬£¬£¬Ñ§ÉúÍøÂç ͨ¹ý×îС»¯½»Ö¯ìØËðʧÈ繫ʽ2ËùʾÀ´Ñ§Ï°²ÎÊý £º

2. ÊÓ¾õÑÚÂëÄ£×ÓµÄÑÚÂëtokenÕ½ÂÔ

ͼ2 MSTµÄ×¢ÖØÁ¦Ö¸µ¼Ãæ¾ßÕ½ÂԵIJåͼ¡£¡£¡£¡£¡£¡£¡£ÓëÔÊ¼Ëæ»úÑÚÂëÏà±È£¬£¬£¬£¬£¬Í¨¹ý±£´æÍ¼ÏñÖеÄÒªº¦ÇøÓòÀ´Ë¢Ð¡£¡£¡£¡£¡£¡£¡£´Ó×óµ½ÓÒÐÎòͼÏñ: (a)ÊäÈëͼÏñ£¬£¬£¬£¬£¬(b)ͨ¹ý×Ô×¢ÖØÄ£¿£¿£¿£¿£¿£¿£¿é»ñµÃµÄ×¢ÖØÁ¦Õù£¬£¬£¬£¬£¬(c)¿ÉÄܵ¼ÖÂÒªº¦ÌØÕ÷ɥʧµÄËæ»úÑÚÂëÕ½ÂÔ£¬£¬£¬£¬£¬(d) MSTµÄ×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔ£¬£¬£¬£¬£¬½öÕÚÑÚ·ÇÐë񻂿Óò¡£¡£¡£¡£¡£¡£¡£ÊÂʵÉÏ£¬£¬£¬£¬£¬ÑÚÂëÕ½ÂÔÊÇÕÚ±ÎToken¡£¡£¡£¡£¡£¡£¡£
Ëæ»úÑÚÂëÕ½ÂÔ: Êܵ½×ÔÈ»ÓïÑÔ´¦Öóͷ£ÁìÓòÑÚÂëÓïÑÔÄ£×ÓÕ½ÂÔµÄÆô·¢£¬£¬£¬£¬£¬ÎÒÃÇÓ¦ÓÃÕâ¸öËæ»úÑÚÂëÕ½ÂÔµ½×Ô¼àÊÓѧϰ¡£¡£¡£¡£¡£¡£¡£Æ¾Ö¤¹«Ê½3Ëùʾ£¬£¬£¬£¬£¬Ö÷񻂿ÓòµÄtokenºÍ·ÇÖ÷񻂿ÓòµÄtokenÓÐͬÑùµÄ¸ÅÂÊ¿ÉÄܱ»ÑÚÂë¡£¡£¡£¡£¡£¡£¡£Èçͼ2Ëùʾ£¬£¬£¬£¬£¬ÎÒÃÇ¿ÉÒÔÊÓ²ìµ½Ëæ»úÑÚÂëÕ½ÂÔ»áÏû³ýÖ÷񻂿ÓòµÄtokens£¬£¬£¬£¬£¬µ¼ÖÂÄÑÒÔÇø·ÖÊäÈëͼÏñµÄÓïÒåÐÅÏ¢¡£¡£¡£¡£¡£¡£¡£Õâ¸öËæ»ú²ÉÑùÕ½ÂÔ»áÒÖÖÆÊäÈëͼÏñµÄÖ÷񻂿Óò£¬£¬£¬£¬£¬µ¹ÔËÓÚÍøÂçµÄʶ±ðÄÜÁ¦¡£¡£¡£¡£¡£¡£¡£Õâ¸öÕ½ÂÔ²»ÊʺÏÖ±½ÓÓ¦Óõ½×Ô¼àÊÓÊÓ¾õTransformer£¬£¬£¬£¬£¬ÈôÊÇÑÚÂëÕ½ÂÔδ׼ȷµ÷ÖÆ£¬£¬£¬£¬£¬ÔòÕûÌåÐÔÄÜ»á¶ñ»¯¡£¡£¡£¡£¡£¡£¡£

ÆäÖУºm´ú±íµÄÑÚÂëÇøÓò£¬£¬£¬£¬£¬pÊÇÑÚÂëµÄ¸ÅÂÊ£¨Ä¬ÒÔΪ0.15£©£¬£¬£¬£¬£¬probÊÇËæ»ú±¬·¢µÄ¸ÅÂÊÖµ¡£¡£¡£¡£¡£¡£¡£
×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔ: ÎÒÃÇÌá³ö×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔ¶¯Ì¬¿ØÖÆÑÚÂëÇøÓò²¢½µµÍÖ÷񻂿Óò±»ÑÚÂëµÄ¸ÅÂÊ¡£¡£¡£¡£¡£¡£¡£Í¬Ê±£¬£¬£¬£¬£¬Õâ¸öÒªÁì²¢²»»áÔöÌíÌØÁíÍâÅÌËãʱ¼ä£¬£¬£¬£¬£¬ÕûÌåÒªÁì½á¹¹Î±´úÂëËùʾ¡£¡£¡£¡£¡£¡£¡£È繫ʽ4Ëùʾ£¬£¬£¬£¬£¬ÎÒÃÇÆ¾Ö¤ÉýÐò¶ÔÿÕÅͼÏñµÄ²î±ðpatchµÄ×¢ÖØÁ¦ÌØÕ÷¾ÙÐÐÅÅÐò£¬£¬£¬£¬£¬²¢½«ÅÅÐòºóµÄ×¢ÖØÁ¦ÌØÕ÷Öµ×÷ΪãÐÖµ£¬£¬£¬£¬£¬ÔòµÍÓÚµÄÇøÓò×÷ΪÑÚÂëµÄºòÑ¡ÇøÓò¡£¡£¡£¡£¡£¡£¡£Ñ§Éú·ÖÖ§Ä£×ÓÊÕµ½²î±ðpatchesµÄÖ÷ÒªÐÔ£¬£¬£¬£¬£¬Æ¾Ö¤¸ÅÂÊÌìÉúÑÚÂë¡£¡£¡£¡£¡£¡£¡£
![]()
ÆäÖУºAttn´ú±í×¢ÖØÁ¦ÌØÕ÷¡£¡£¡£¡£¡£¡£¡£
×ñÕÕBERTÒªÁ죬£¬£¬£¬£¬ÑÚÂëÇøÓòÌî³äÓпÉѧϰµÄÑÚÂëǶÈë [MASK]¡£¡£¡£¡£¡£¡£¡£È繫ʽ5Ëùʾ£¬£¬£¬£¬£¬×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔ°ü¹Ü¸ß·ÖµÄpatch²»±»ÑÚÂë¡£¡£¡£¡£¡£¡£¡£
![]()

×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔ¹ØÓÚԤѵÁ·Ä£×ÓÓÐÁ½µãÒæ´¦£º
a. Ä£×ÓʹÓÃÓïÒåÐÅÏ¢À´Ñ§Ï°µ½²î±ðpatchesÖ®¼äµÄ¹ØÏµ¡£¡£¡£¡£¡£¡£¡£Ê¹µÃÄ£×Ӽȱ£´æÁËͼÏñÈ«¾ÖÓïÒåÐÅÏ¢£¬£¬£¬£¬£¬Í¬Ê±ÔÚͼÏñµÄ¾Ö²¿Ï¸½ÚÓйØ×¢¡£¡£¡£¡£¡£¡£¡£
b. ¹þ¹þ(haha)ÌåÓýÕ½ÂÔ¿ÉÒÔ×èÖ¹ÑÚÊÎÒªº¦ÇøÓò£¬£¬£¬£¬£¬Í¬Ê±ÓÿÉѧϰµÄÑÚÂëÌØÕ÷£¬£¬£¬£¬£¬Ê¹µÃÄ£×ÓרעÓÚÒªº¦ÇøÓò¡£¡£¡£¡£¡£¡£¡£
3. ÊÓ¾õTransformerÑÚÂë½âÂëÒªÁì
ÔÚÑÚÂëÓïÑÔÄ£×ÓÖУ¬£¬£¬£¬£¬Ê¹Ó÷ÇÑÚÂëÇøÓòÌØÕ÷À´Õ¹ÍûÑÚÂëtoken¡£¡£¡£¡£¡£¡£¡£Çø±ðÓÚÔʼµÄÑÚÂëÓïÑÔÄ£×Ó£¬£¬£¬£¬£¬¹þ¹þ(haha)ÌåÓýÒªÁìʹÓ÷ÇÑÚÂëµÄÇøÓòÌØÕ÷À´»Ö¸´ÖØÐÞÔʼµÄÊäÈëͼƬ¡£¡£¡£¡£¡£¡£¡£ÎªÁËÊ¹ÍøÂçÄܹ»ÔÚÊÓ¾õʹÃüÉÏʵÏÖÏñËØ¼¶»Ö¸´£¬£¬£¬£¬£¬´Ó¶øÔöǿͼÏñµÄÏñËØ¼¶ÐÅÏ¢ÌØÕ÷ºÍϸÁ£¶È¿Õ¼ä½á¹¹µÄÄÜÁ¦¡£¡£¡£¡£¡£¡£¡£Ê¹Óþí»ýµÄÆ«ÖÃÐÔÌØÕ÷£¬£¬£¬£¬£¬ÖØÐÞʹÃüʹÓþí»ýÉñ¾ÍøÂç×÷Ϊ½âÂëÆ÷£¬£¬£¬£¬£¬Ê¹Óþí»ý²ãºÍÉϲÉÑù²Ù×÷½»Ìæ¶Ñµþ¡£¡£¡£¡£¡£¡£¡£ÈçϹ«Ê½6չʾÁËÖØÐÞËðʧº¯Êý£º
![]()
ÆäÖУºx´ú±íÊäÈëͼƬ£¬£¬£¬£¬£¬g´ú±í½âÂëÆ÷£¬£¬£¬£¬£¬´ú±íѧÉú·ÖÖ§µÄ±àÂëÆ÷£¬£¬£¬£¬£¬ ´ú±íѧÉú·ÖÖ§µÄ²ÎÊý£¬£¬£¬£¬£¬ ´ú±í½âÂëÆ÷²ÎÊý¡£¡£¡£¡£¡£¡£¡£
È«ÌåËðʧº¯ÊýÈ繫ʽ7Ëùʾ£º
![]()
ÆäÖУº¦Ë´ú±íÈ¨ÖØÏµÊý¡£¡£¡£¡£¡£¡£¡£
Part 3 ʵÑéЧ¹û
ÎÒÃÇʹÓòî±ðµÄtransformer½á¹¹ÔÚImageNet benchmarkÉÏѵÁ·ÁËԤѵÁ·Ä£×Ó£¬£¬£¬£¬£¬È»ºóÑéÖ¤ËûÃǵÄǨáãÄÜÁ¦ÔÚÏÂÓÎʹÃüÉÏÀýÈçÄ¿µÄ¼ì²âºÍÓïÒåÖ§½â£¬£¬£¬£¬£¬ÒÔ¼°²î±ðÑÚÂë²ÎÊý¶ÔÄ£×ÓµÄÓ°Ïì¡£¡£¡£¡£¡£¡£¡£
1. ImageNet BenchmarkµÄ±ÈÕÕ
Èç±í1ËùʾΪ¹þ¹þ(haha)ÌåÓýÒªÁìÓëÄ¿½ñÖ÷Òª×Ô¼àÊÓËã·¨½ÏÁ¿¡£¡£¡£¡£¡£¡£¡£ËùÓеÄÕâЩҪÁìÓµÓÐÏàͬµÄbackboneÓÃÓÚ¹«Õý½ÏÁ¿¡£¡£¡£¡£¡£¡£¡£ÎÒÃÇÒªÁìµÄ300-epochÄ£×ÓʵÏÖ76.9% ÔÚÏßÐÔÆÀ¹ÀµÄtop-1µÄ¾«¶È¡£¡£¡£¡£¡£¡£¡£¹þ¹þ(haha)ÌåÓýÒªÁìÔÚÏàͬµÄѵÁ·epochsÏÂÓâÔ½Æäʱ×îºÃµÄ×Ô¼àÊÓÒªÁìDINOԼĪ1.7%£¬£¬£¬£¬£¬ÉõÖÁ¿¿½üDINO¸ü³¤ÑµÁ·Õ½ÂÔµÄÌåÏÖЧ¹û£¨77.0%µÄ800epoch£©¡£¡£¡£¡£¡£¡£¡£ÐèҪǿµ÷µÄÊÇ£¬£¬£¬£¬£¬¹þ¹þ(haha)ÌåÓýËã·¨»º½âÁË×Ô¼àÊÓѧϰ¶Ô¼«³¤ÑµÁ·Ê±¼äµÄÐèÇ󣬣¬£¬£¬£¬²¢ÇÒÄܹ»ÔÚ½ö100¸öepochsµÄÇéÐÎÏ»ñµÃ²»´íµÄЧ¹û£¨75.0%£©¡£¡£¡£¡£¡£¡£¡£
MSTÊÇͨÓõÄÒªÁì¿ÉÒÔÓ¦Óõ½ÈκλùÓÚTransformer½á¹¹µÄ×Ô¼àÊÓÒªÁì¡£¡£¡£¡£¡£¡£¡£ÕâÀïÎÒÃÇʹÓÃÊ¢ÐеÄSwin-T×÷ΪʾÀý£¬£¬£¬£¬£¬Ëü¾ßÓÐÓëDeiT-SÏàËÆµÄ²ÎÊýÊýÄ¿¡£¡£¡£¡£¡£¡£¡£Ê¹ÓÃÏàͬµÄѵÁ·epochs£¬£¬£¬£¬£¬MSTµÄÐÔÄܱÈMoBYºá¿ç1.8%£¬£¬£¬£¬£¬ÕâÊÇÒ»ÖÖΪSwin-TÈ«ÐÄÉè¼ÆµÄ×Ô¼àÊÓѧϰҪÁì¡£¡£¡£¡£¡£¡£¡£Swin-TÓëDeiT-S¹²ÏíÏàͬµÄ³¬²ÎÊý£¬£¬£¬£¬£¬ËüÈÔÈ»¿ÉÒÔͨ¹ý½øÒ»³ÌÐòÕûÀ´Ë¢Ð¡£¡£¡£¡£¡£¡£¡£

±í1 ImageNetÉÏÊ¢ÐеÄ×Ô¼àÊÓѧϰҪÁìµÄ½ÏÁ¿
2. Ä¿µÄ¼ì²âºÍʵÀýÖ§½âÏÂÓÎʹÃü
ÔÚ±í2ÖУ¬£¬£¬£¬£¬ÎÒÃÇÏÔʾÁËͨ¹ý²î±ðµÄ×ÔÎÒ¼àÊÓÒªÁìºÍ¼àÊÓѵÁ·Ñ§Ï°µÄÌåÏÖµÄÌåÏÖ¡£¡£¡£¡£¡£¡£¡£ÎªÁ˹«Õý½ÏÁ¿£¬£¬£¬£¬£¬ËùÓÐÕâЩҪÁì¶¼Ô¤ÏÈѵÁ·ÁË100¸öepochs¡£¡£¡£¡£¡£¡£¡£ÎÒÃÇÊӲ쵽£¬£¬£¬£¬£¬¹þ¹þ(haha)ÌåÓýÒªÁìÒÔ42.7%µÄbbox mAPºÍ38.8%µÄÑÚÄ£mAPµÖ´ïÁË×î¼ÑЧ¹û¡£¡£¡£¡£¡£¡£¡£Ëü±ÈImageNet¼àÊÓÄ£×Óºá¿ç1.2%ºÍ0.5%£¬£¬£¬£¬£¬MoBYЧ¹ûÔÚͳһʱÆÚ»®·Ö±ÈImageNet¼àÊÓÄ£×Óºá¿ç1.2%ºÍ0.5%¡£¡£¡£¡£¡£¡£¡£Ð§¹ûÅú×¢£¬£¬£¬£¬£¬MST²»µ«ÔÚͼÏñ·ÖÀàʹÃüÉÏÌåÏÖÓÅÒ죬£¬£¬£¬£¬²¢ÇÒÔÚÏÂÓÎ÷缯չÍûʹÃüÖÐÌåÏÖÓÅÒì¡£¡£¡£¡£¡£¡£¡£Òò´ËËü¾ßÓкÜÇ¿µÄǨáãÄÜÁ¦¡£¡£¡£¡£¡£¡£¡£

±í2 ÔÚ MS COCO ÉÏ΢µ÷µÄ¹¤¾ß¼ì²âºÍʵÀý·Ö¶ÎЧ¹û
Èç±íËùʾ£¬£¬£¬£¬£¬Ëü˵Ã÷Îú¼àÊÓÒªÁì¡¢DINOºÍ¹þ¹þ(haha)ÌåÓýÒªÁìÔÚ¸ÃÆÀ¹ÀÉϵĽÏÁ¿¡£¡£¡£¡£¡£¡£¡£¹þ¹þ(haha)ÌåÓýÒªÁìʵÏÖÁË×î¸ßµÄmIoU74.7%ºÍmAcc82.35%¡£¡£¡£¡£¡£¡£¡£ËüµÄÌåÏÖÓÅÓÚ¼àÊÓЧ¹û£¨+2.71%mIoUºÍ+2.05%mAcc£©ºÍDINOԤѵÁ·Ð§¹û£¨+1.08% mIoUºÍ +1.03%mAcc£©¡£¡£¡£¡£¡£¡£¡£¹þ¹þ(haha)ÌåÓýÄ£×ÓÒ²ÊÊÓÃÓÚÓïÒåÖ§½âʹÃüµÄǨáã¡£¡£¡£¡£¡£¡£¡£

±í3 ÔÚ¶¼»á¾°¹ÛÉÏ΢µ÷µÄÓïÒåÖ§½âЧ¹û
3. ²î±ðMaskÕ½ÂÔµÄÓ°Ïì
±í4ÏÔʾÁ˲î±ðÑÚÂëÕ½ÂÔµÄÓ°Ïì¡£¡£¡£¡£¡£¡£¡£ÎÒÃÇʹÓÃËæ»úÑÚÂëÕ½ÂÔ£¬£¬£¬£¬£¬×¢ÖØÁ¦Ö¸µ¼µÄÑÚÂëÕ½ÂÔºÍÎÞÑÚÂëÀ´ÑµÁ·DeiT-S¡£¡£¡£¡£¡£¡£¡£ÎªÁ˹«Õý½ÏÁ¿£¬£¬£¬£¬£¬ËùÓÐÒªÁì¶¼ÒÔÏàͬµÄ¸ÅÂÊp¾ÙÐÐÑÚÂë¡£¡£¡£¡£¡£¡£¡£¿£¿£¿£¿£¿£¿£¿ÉÒÔÊÓ²ìµ½Ëæ»úÑÚÂëÕ½ÂÔµÄÐÔÄÜϽµ¡£¡£¡£¡£¡£¡£¡£´ËÕ½ÂÔ¿ÉÄÜ»áÒÖÖÆÊ¶±ðͼÏñÄÜÁ¦£¨´Ó 73.1 µ½ 63.2£©¡£¡£¡£¡£¡£¡£¡£Ëæ»úÑÚÂëÕ½ÂÔ¿ÉÄÜ»áÆÆËðÔʼͼÏñÒªº¦ÇøÓòµÄtokens£¬£¬£¬£¬£¬ÕâЩtokens¹ØÓÚʶ±ð¹¤¾ß¿ÉÄÜÊDZز»¿ÉÉٵġ£¡£¡£¡£¡£¡£¡£±»ÆÁÕϵÄÊäÈë¿ÉÄܰüÀ¨²»ÍêÕûÉõÖÁÎóµ¼ÐÔµÄÐÅÏ¢¡£¡£¡£¡£¡£¡£¡£Ïà·´£¬£¬£¬£¬£¬¹þ¹þ(haha)ÌåÓý×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔµÄÌåÏÖÎȲ½Ìá¸ß£¨´Ó73.1µ½73.7£©¡£¡£¡£¡£¡£¡£¡£»£»£»£»£»£»£»ù±¾ÇøÓò´ó¶à±»±£´æÏÂÀ´£¬£¬£¬£¬£¬Õâ¿ÉÄÜÊÇÎÒÃǼÙÉèµÄÓÐÁ¦Ö¤¾Ý¡£¡£¡£¡£¡£¡£¡£

±í4 ²î±ðÑÚÄ£Õ½ÂÔ£¨DeiT-S£©µÄÏßÐÔ̽ÕëЧ¹û
4. ²î±ðÑÚÂ볬²ÎÊýµÄÓ°Ïì
±í5ÑéÖ¤ÁË×¢ÖØÁ¦Ö¸µ¼ÑÚÂëÕ½ÂÔϲî±ðÑÚÂ볬²ÎÊýµÄÐÔÄÜ¡£¡£¡£¡£¡£¡£¡£ÎÒÃǰ´ÉýÐò¶Ôÿ¸öͼÏñµÄ²î±ðpatchµÄ×¢ÖØÁ¦Õù¾ÙÐÐÅÅÐò£¬£¬£¬£¬£¬²¢½«Ç°1/num¸öpatch²ð·ÖΪ±»ÑÚÂëµÄºòÑ¡¡£¡£¡£¡£¡£¡£¡£É¾³ýÕâЩºòÑ¡¿ÉÒÔÆÈÊ¹ÍøÂç´ÓÏàÁÚµÄpatchÖÐѧϰ¾Ö²¿ÌØÕ÷£¬£¬£¬£¬£¬´Ó¶øÔÚ²»ÆÆËðÓïÒåµÄÇéÐÎÏÂÔöÇ¿¶Ô¾Ö²¿ÉÏÏÂÎľÙÐн¨Ä£µÄÄÜÁ¦¡£¡£¡£¡£¡£¡£¡£ÕâЩºòѡƾ֤¸ÅÂÊp¾ÙÐÐÆÁÕÏ¡£¡£¡£¡£¡£¡£¡£ImageNetÉÏÏßÐÔÆÀ¹ÀµÄTop-1¾«¶ÈÈç±íϱíËùʾ¡£¡£¡£¡£¡£¡£¡£µ±numÉèÖÃΪ8ʱ£¬£¬£¬£¬£¬ÈκÎÑ¡Ôñp¶¼¿ÉÒÔ»ñµÃÒ»¸ö¿É¿¿µÄЧ¹û£¬£¬£¬£¬£¬ÕâÅú×¢×îºóµÄ1/8¸öpatch×÷ΪÑÚÂëºòÑ¡ÊÇÏà¶ÔÇå¾²µÄ¡£¡£¡£¡£¡£¡£¡£

±í5 »ùÓÚ×¢ÖØÁ¦µÄÑÚÂëÕ½ÂԵij¬²ÎÊýÉèÖÃ
5. Óë BERT µÄÇø±ð
ÔÚ±í6ÖУ¬£¬£¬£¬£¬ÎÒÃÇʹÓô¿ÑÚÂëÓïÑÔÄ£×ÓºÍDeiT-SÔÚ100¸öepochsϾÙÐÐʵÑ飬£¬£¬£¬£¬Ð§¹ûÔÚÏàͬµÄʵÑéÉèÖÃÏÂԼΪ40%¡£¡£¡£¡£¡£¡£¡£È»ºóÎÒÃǽøÒ»³ÌÐòÕûÆäѧϰÂÊºÍÆäËû³¬²ÎÊý£¬£¬£¬£¬£¬×î¼ÑЧ¹û½öΪ61%£¬£¬£¬£¬£¬Ô¶µÍÓÚDINOµÄ10.6%£¨DINOЧ¹ûΪ71.6%£©£¬£¬£¬£¬£¬Ò²±È¼àÊÓЧ¹ûµÍ7.7%£¨¼àÊÓЧ¹ûΪ68.7%£©¡£¡£¡£¡£¡£¡£¡£Åú×¢´¿ÑÚÂëÓïÑÔÄ£×ÓÒªÁì¿ÉÄܲ»ÊʺÏÅÌËã»úÊÓ¾õʹÃü¡£¡£¡£¡£¡£¡£¡£±ðµÄ£¬£¬£¬£¬£¬ÎÒÃÇÓñÈÕÕËðʧ+ BERT½â¾ö¼Æ»®£¨¼´DINO+´¿ÑÚÂëÓïÑÔÄ£×Ó£©¾ÙÐÐʵÑ飬£¬£¬£¬£¬ÏßÐÔЧ¹ûΪ71.9%¡£¡£¡£¡£¡£¡£¡£¹þ¹þ(haha)ÌåÓýÒªÁì±ÈÆäЧ¹ûºá¿ç2.0%£¨73.9%£©¡£¡£¡£¡£¡£¡£¡£Ð§¹û֤ʵ¹þ¹þ(haha)ÌåÓýÒªÁì±ÈÔÀ´µÄЧ¹ûÒªÁì¸üºÃ¡£¡£¡£¡£¡£¡£¡£Í¬Ê±£¬£¬£¬£¬£¬ÎÒÃǽøÒ»²½¾ÙÐÐÁËʵÑ飬£¬£¬£¬£¬½ö½«[mask] tokenÌæ»»Îª´¿ÑÚÂëÓïÑÔÄ£×ÓÕ½ÂÔ£¬£¬£¬£¬£¬ÏßÐÔЧ¹ûΪ73.5%£¬£¬£¬£¬£¬ÕâÒ²ÂäÎéÓÚ¹þ¹þ(haha)ÌåÓýЧ¹û¡£¡£¡£¡£¡£¡£¡£ÕâЩЧ¹û³ä·ÖչʾÁËMLM¶ÔÅÌËã»úÊÓ¾õµÄ¸üºÃÉèÖ㬣¬£¬£¬£¬²¢½øÒ»²½Í»³öÁËÎÒÃÇÂÛÎĵÄÊÖÒÕТ˳¡£¡£¡£¡£¡£¡£¡£

±í6Óë BERT µÄÇø±ð
Part 4 ½áÓï
±¾ÎÄ̽ÌÖÁËÄ¿½ñÊÓ¾õ×ÔÎÒ¼àÊÓѧϰµÄÁ½¸öÎÊÌ⣬£¬£¬£¬£¬¼´È±·¦¾Ö²¿ÐÅÏ¢ÌáÈ¡ºÍ¿Õ¼äÐÅϢɥʧ¡£¡£¡£¡£¡£¡£¡£ÎªÁËսʤÉÏÊöÎÊÌ⣬£¬£¬£¬£¬ÎÒÃÇÌá³öÁËÒ»ÖÖ»ùÓÚTransformerµÄÐÂÐÍ×Ô¼àÊÓѧϰҪÁ죬£¬£¬£¬£¬³ÆÎªMST¡£¡£¡£¡£¡£¡£¡£ MSTʹÓÃ×¢ÖØÁ¦Ö¸µ¼µÄÑÚÂëÕ½ÂÔÀ´²¶»ñpatchÖ®¼äµÄ¾Ö²¿¹ØÏµ£¬£¬£¬£¬£¬Í¬Ê±±£´æÈ«¾ÖÓïÒåÐÅÏ¢¡£¡£¡£¡£¡£¡£¡£ÐèÒª×¢ÖØµÄÊÇ£¬£¬£¬£¬£¬×¢ÖØÁ¦Ö¸µ¼µÄÑÚÂëÕ½ÂÔÊÇ»ùÓÚ´ÓÎ÷ϯģ×ÓÖÐÌáÈ¡µÄMulti-Head×Ô×¢ÖØÌØÕ÷ͼ£¬£¬£¬£¬£¬²»»áÔì³ÉÌØÁíÍâÅÌË㱾Ǯ¡£¡£¡£¡£¡£¡£¡£±ðµÄ£¬£¬£¬£¬£¬ÔÚ×¢ÖØÁ¦Ö¸µ¼ÑÚÄ£Õ½ÂÔÏ£¬£¬£¬£¬£¬½øÒ»²½Ê¹ÓÃÈ«¾ÖͼÏñ½âÂëÆ÷À´»Ö¸´Í¼ÏñµÄ¿Õ¼äÐÅÏ¢£¬£¬£¬£¬£¬Õâ¹ØÓÚ÷缯µÄÕ¹ÍûʹÃüÖÁ¹ØÖ÷Òª¡£¡£¡£¡£¡£¡£¡£¸ÃÒªÁìÔÚ¶à¸öÏÂÓÎÊÓ¾õʹÃüÖÐÌåÏÖ³öÓÅÒìµÄͨÓÃÐԺͿÉÀ©Õ¹ÐÔ¡£¡£¡£¡£¡£¡£¡£
Reference:
[1] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.N.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171¨C4186 (2018)
[2] Caron, M., Touvron, H., Misra, I., J¨¦gou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. arXiv: Computer Vision and Pattern Recognition
[3] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
[4]Xie, Z., Lin, Y., Yao, Z., Zhang, Z., Dai, Q., Cao, Y., Hu, H.: Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553 (2021)





·µ»Ø