Skip to content

Latest commit

 

History

History
29 lines (23 loc) · 1.64 KB

README.md

File metadata and controls

29 lines (23 loc) · 1.64 KB

QM1B dataset (to be released soon)

QM1B is a low-resolution DFT dataset generated using PySCF IPU. It is composed of one billion training examples containing 9-11 heavy atoms. It was created by taking 1.09M SMILES strings from the GDB-11 database and computing molecular properties (e.g. HOMO-LUMO gap) for a set of up to 1000 conformers per molecule.

Dataset schema

QM1B dataset is stored in the open-source columnar Apache Parquet format, with the following schema:

  • smile: The SMILES string taken from GDB11. There are up to 1000 rows (i.e. conformers) with the same SMILES string.
  • atoms: String representing the atom symbols of the molecule, e.g. ”COOH”.
  • z: Integer representation of atoms used by SchNet (the atomic numbers).
  • energy: energy of the molecule computed by PySCF IPU (unit eV).
  • homo: The energy of the Highest Occupied Molecular Orbital (HOMO) (unit eV).
  • lumo: The energy of the Lowest occupied Molecular Orbital (LUMO) (unit eV).
  • N: The number of atomic orbitals for the specific DFT computation (depends on the basis set STO3G).
  • std: The standard deviation of the energy of the last five iterations of running PySCFIPU, used as convergence criteria std < 0.01 (unit eV).
  • y: The HOMO-LUMO Gap (unit eV).
  • pos: The atom positions (unit Bohr).

Dataset exploration

Dataset exploration can easily done using Pandas library. For instance, to load the validation set:

import pandas as pd

# 20m entries in the validation set.
print(pd.read_parquet("qm1b_val.parquet").head())