التفاصيل البيبلوغرافية
العنوان: |
PubChem compounds before and after standardization |
المؤلفون: |
Miruna T. Cretu, Alessandra Toniato, Amol Thakkar, Amin Debabeche, Teodoro Laino, Alain C. Vaucher |
بيانات النشر: |
Zenodo |
سنة النشر: |
2023 |
المجموعة: |
Zenodo |
الوصف: |
This repository contains a subset of 200k PubChem compounds before and after standardization. This dataset is the basis of the PubChem-pretrained model presented in "Standardizing chemical compounds with language models" (available on ChemRxiv , see also the associated GitHub repository ). The associated pretrained model is also provided here, along with the splits used for training. The data is provided under the CDLA-Sharing-1.0 license. Provided files: README.md: General README. src_and_tgt_all.csv: All the compounds before and after standardization, in CSV format. src-train.txt: The tokenized compounds before standardization in the train split. tgt-train.txt: The tokenized compounds after standardization in the train split. src-valid.txt: The tokenized compounds before standardization in the validation split. tgt-valid.txt: The tokenized compounds after standardization in the validation split. src-test.txt: The tokenized compounds before standardization in the test split. tgt-test.txt: The tokenized compounds after standardization in the test split. LICENSE.md: The details of the CDLA-Sharing-1.0 license. pretrained_pubchem_step_120000.pt: the pretrained model. |
نوع الوثيقة: |
other/unknown material |
اللغة: |
English |
Relation: |
https://doi.org/10.26434/chemrxiv-2022-14ztf-v2; https://github.com/rxn4chemistry/rxn-standardization; https://zenodo.org/communities/rxn4chemistry; https://doi.org/10.5281/zenodo.7842043; https://doi.org/10.5281/zenodo.7842044; oai:zenodo.org:7842044 |
DOI: |
10.5281/zenodo.7842044 |
الاتاحة: |
https://doi.org/10.5281/zenodo.7842044 |
Rights: |
info:eu-repo/semantics/openAccess ; Community Data License Agreement Sharing 1.0 ; https://cdla.io/sharing-1-0 |
رقم الانضمام: |
edsbas.65C8D3A0 |
قاعدة البيانات: |
BASE |