FP8 Formats for Deep Learning

التفاصيل البيبلوغرافية
العنوان:	FP8 Formats for Deep Learning
المؤلفون:	Micikevicius, Paulius, Stosic, Dusan, Burgess, Neil, Cornea, Marius, Dubey, Pradeep, Grisenthwaite, Richard, Ha, Sangwon, Heinecke, Alexander, Judd, Patrick, Kamalu, John, Mellempudi, Naveen, Oberman, Stuart, Shoeybi, Mohammad, Siu, Michael, Wu, Hao
سنة النشر:	2022
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Machine Learning
الوصف:	FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2209.05433
رقم الانضمام:	edsarx.2209.05433
قاعدة البيانات:	arXiv

View record in Arxiv

الوصف
الوصف غير متاح.