PubChem compounds before and after standardization

التفاصيل البيبلوغرافية
العنوان: PubChem compounds before and after standardization
المؤلفون: Miruna T. Cretu, Alessandra Toniato, Amol Thakkar, Amin Debabeche, Teodoro Laino, Alain C. Vaucher
بيانات النشر: Zenodo
سنة النشر: 2023
المجموعة: Zenodo
الوصف: This repository contains a subset of 200k PubChem compounds before and after standardization. This dataset is the basis of the PubChem-pretrained model presented in "Standardizing chemical compounds with language models" (available on ChemRxiv , see also the associated GitHub repository ). The associated pretrained model is also provided here, along with the splits used for training. The data is provided under the CDLA-Sharing-1.0 license. Provided files: README.md: General README. src_and_tgt_all.csv: All the compounds before and after standardization, in CSV format. src-train.txt: The tokenized compounds before standardization in the train split. tgt-train.txt: The tokenized compounds after standardization in the train split. src-valid.txt: The tokenized compounds before standardization in the validation split. tgt-valid.txt: The tokenized compounds after standardization in the validation split. src-test.txt: The tokenized compounds before standardization in the test split. tgt-test.txt: The tokenized compounds after standardization in the test split. LICENSE.md: The details of the CDLA-Sharing-1.0 license. pretrained_pubchem_step_120000.pt: the pretrained model.
نوع الوثيقة: other/unknown material
اللغة: English
Relation: https://doi.org/10.26434/chemrxiv-2022-14ztf-v2; https://github.com/rxn4chemistry/rxn-standardization; https://zenodo.org/communities/rxn4chemistry; https://doi.org/10.5281/zenodo.7842043; https://doi.org/10.5281/zenodo.7842044; oai:zenodo.org:7842044
DOI: 10.5281/zenodo.7842044
الاتاحة: https://doi.org/10.5281/zenodo.7842044
Rights: info:eu-repo/semantics/openAccess ; Community Data License Agreement Sharing 1.0 ; https://cdla.io/sharing-1-0
رقم الانضمام: edsbas.65C8D3A0
قاعدة البيانات: BASE