Report
Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results
العنوان: | Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results |
---|---|
المؤلفون: | Shimron, Efrat, Tamir, Jonathan I., Wang, Ke, Lustig, Michael |
سنة النشر: | 2021 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Machine Learning |
الوصف: | While open databases are an important resource in the Deep Learning (DL) era, they are sometimes used "off-label": data published for one task are used for training algorithms for a different one. This work aims to highlight that in some cases, this common practice may lead to biased, overly-optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data preprocessing pipelines. We describe two preprocessing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for Magnetic Resonance Imaging (MRI) reconstruction: Compressed Sensing (CS), Dictionary Learning (DictL), and DL. In this large-scale study we performed extensive computations. Our results demonstrate that the CS, DictL and DL algorithms yield systematically biased results when naively trained on seemingly-appropriate data: the Normalized Root Mean Square Error (NRMSE) improves consistently with the preprocessing extent, showing an artificial increase of 25%-48% in some cases. Since this phenomenon is generally unknown, biased results are sometimes published as state-of-the-art; we refer to that as subtle data crimes. This work hence raises a red flag regarding naive off-label usage of Big Data and reveals the vulnerability of modern inverse problem solvers to the resulting bias. Comment: 16 pages, 7 figures, two tables. Submitted to a journal |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2109.08237 |
رقم الانضمام: | edsarx.2109.08237 |
قاعدة البيانات: | arXiv |
ResultId |
1 |
---|---|
Header |
edsarx arXiv edsarx.2109.08237 1022 3 Report report 1021.65313720703 |
PLink |
https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2109.08237&custid=s6537998&authtype=sso |
FullText |
Array
(
[Availability] => 0
)
Array ( [0] => Array ( [Url] => http://arxiv.org/abs/2109.08237 [Name] => EDS - Arxiv [Category] => fullText [Text] => View record in Arxiv [MouseOverText] => View record in Arxiv ) ) |
Items |
Array
(
[Name] => Title
[Label] => Title
[Group] => Ti
[Data] => Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results
)
Array ( [Name] => Author [Label] => Authors [Group] => Au [Data] => <searchLink fieldCode="AR" term="%22Shimron%2C+Efrat%22">Shimron, Efrat</searchLink><br /><searchLink fieldCode="AR" term="%22Tamir%2C+Jonathan+I%2E%22">Tamir, Jonathan I.</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Ke%22">Wang, Ke</searchLink><br /><searchLink fieldCode="AR" term="%22Lustig%2C+Michael%22">Lustig, Michael</searchLink> ) Array ( [Name] => DatePubCY [Label] => Publication Year [Group] => Date [Data] => 2021 ) Array ( [Name] => Subset [Label] => Collection [Group] => HoldingsInfo [Data] => Computer Science ) Array ( [Name] => Subject [Label] => Subject Terms [Group] => Su [Data] => <searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink> ) Array ( [Name] => Abstract [Label] => Description [Group] => Ab [Data] => While open databases are an important resource in the Deep Learning (DL) era, they are sometimes used "off-label": data published for one task are used for training algorithms for a different one. This work aims to highlight that in some cases, this common practice may lead to biased, overly-optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data preprocessing pipelines. We describe two preprocessing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for Magnetic Resonance Imaging (MRI) reconstruction: Compressed Sensing (CS), Dictionary Learning (DictL), and DL. In this large-scale study we performed extensive computations. Our results demonstrate that the CS, DictL and DL algorithms yield systematically biased results when naively trained on seemingly-appropriate data: the Normalized Root Mean Square Error (NRMSE) improves consistently with the preprocessing extent, showing an artificial increase of 25%-48% in some cases. Since this phenomenon is generally unknown, biased results are sometimes published as state-of-the-art; we refer to that as subtle data crimes. This work hence raises a red flag regarding naive off-label usage of Big Data and reveals the vulnerability of modern inverse problem solvers to the resulting bias.<br />Comment: 16 pages, 7 figures, two tables. Submitted to a journal ) Array ( [Name] => TypeDocument [Label] => Document Type [Group] => TypDoc [Data] => Working Paper ) Array ( [Name] => URL [Label] => Access URL [Group] => URL [Data] => <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2109.08237" linkWindow="_blank">http://arxiv.org/abs/2109.08237</link> ) Array ( [Name] => AN [Label] => Accession Number [Group] => ID [Data] => edsarx.2109.08237 ) |
RecordInfo |
Array
(
[BibEntity] => Array
(
[Subjects] => Array
(
[0] => Array
(
[SubjectFull] => Computer Science - Machine Learning
[Type] => general
)
)
[Titles] => Array
(
[0] => Array
(
[TitleFull] => Subtle Data Crimes: Naively training machine learning algorithms could lead to overly-optimistic results
[Type] => main
)
)
)
[BibRelationships] => Array
(
[HasContributorRelationships] => Array
(
[0] => Array
(
[PersonEntity] => Array
(
[Name] => Array
(
[NameFull] => Shimron, Efrat
)
)
)
[1] => Array
(
[PersonEntity] => Array
(
[Name] => Array
(
[NameFull] => Tamir, Jonathan I.
)
)
)
[2] => Array
(
[PersonEntity] => Array
(
[Name] => Array
(
[NameFull] => Wang, Ke
)
)
)
[3] => Array
(
[PersonEntity] => Array
(
[Name] => Array
(
[NameFull] => Lustig, Michael
)
)
)
)
[IsPartOfRelationships] => Array
(
[0] => Array
(
[BibEntity] => Array
(
[Dates] => Array
(
[0] => Array
(
[D] => 16
[M] => 09
[Type] => published
[Y] => 2021
)
)
)
)
)
)
)
|
IllustrationInfo |