Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

التفاصيل البيبلوغرافية
العنوان:	Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks
المؤلفون:	Beneventano, Pierfrancesco, Woodworth, Blake
سنة النشر:	2025
المجموعة:	Computer Science Mathematics Statistics
مصطلحات موضوعية:	Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
الوصف:	We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models. Comment: 23 pages, 3 figures
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2501.09137
رقم الانضمام:	edsarx.2501.09137
قاعدة البيانات:	arXiv

View record in Arxiv

ResultId	1
Header	edsarx arXiv edsarx.2501.09137 1147 3 Report report 1146.55163574219
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2501.09137&custid=s6537998&authtype=sso
FullText	Array ( [Availability] => 0 ) Array ( [0] => Array ( [Url] => http://arxiv.org/abs/2501.09137 [Name] => EDS - Arxiv [Category] => fullText [Text] => View record in Arxiv [MouseOverText] => View record in Arxiv ) )
Items	Array ( [Name] => Title [Label] => Title [Group] => Ti [Data] => Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks ) Array ( [Name] => Author [Label] => Authors [Group] => Au [Data] => <searchLink fieldCode="AR" term="%22Beneventano%2C+Pierfrancesco%22">Beneventano, Pierfrancesco</searchLink><br /><searchLink fieldCode="AR" term="%22Woodworth%2C+Blake%22">Woodworth, Blake</searchLink> ) Array ( [Name] => DatePubCY [Label] => Publication Year [Group] => Date [Data] => 2025 ) Array ( [Name] => Subset [Label] => Collection [Group] => HoldingsInfo [Data] => Computer Science<br />Mathematics<br />Statistics ) Array ( [Name] => Subject [Label] => Subject Terms [Group] => Su [Data] => <searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Mathematics+-+Optimization+and+Control%22">Mathematics - Optimization and Control</searchLink><br /><searchLink fieldCode="DE" term="%22Statistics+-+Machine+Learning%22">Statistics - Machine Learning</searchLink> ) Array ( [Name] => Abstract [Label] => Description [Group] => Ab [Data] => We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.<br />Comment: 23 pages, 3 figures ) Array ( [Name] => TypeDocument [Label] => Document Type [Group] => TypDoc [Data] => Working Paper ) Array ( [Name] => URL [Label] => Access URL [Group] => URL [Data] => <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2501.09137" linkWindow="_blank">http://arxiv.org/abs/2501.09137</link> ) Array ( [Name] => AN [Label] => Accession Number [Group] => ID [Data] => edsarx.2501.09137 )
RecordInfo	Array ( [BibEntity] => Array ( [Subjects] => Array ( [0] => Array ( [SubjectFull] => Computer Science - Machine Learning [Type] => general ) [1] => Array ( [SubjectFull] => Mathematics - Optimization and Control [Type] => general ) [2] => Array ( [SubjectFull] => Statistics - Machine Learning [Type] => general ) ) [Titles] => Array ( [0] => Array ( [TitleFull] => Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks [Type] => main ) ) ) [BibRelationships] => Array ( [HasContributorRelationships] => Array ( [0] => Array ( [PersonEntity] => Array ( [Name] => Array ( [NameFull] => Beneventano, Pierfrancesco ) ) ) [1] => Array ( [PersonEntity] => Array ( [Name] => Array ( [NameFull] => Woodworth, Blake ) ) ) ) [IsPartOfRelationships] => Array ( [0] => Array ( [BibEntity] => Array ( [Dates] => Array ( [0] => Array ( [D] => 15 [M] => 01 [Type] => published [Y] => 2025 ) ) ) ) ) ) )
IllustrationInfo