Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

1 Introduction

Propose a new dataset, namely, Alchemy. It contains 12 quantum mechanical properties of 119,487 molecules. A molecular property challenge on a new dataset (Alchemy).

3 Literature Survey about Molecular Machine Learning

3.1 Molecular Representations: SMILES and Fingerprint

Molecules的本质特征可以表示为 molecular graph，每一个vertex表示一个atom，而edge是bond。长期以来ML不擅长应付这种graph-structured data。两个应付的方法是SMILES和fingerprint。

SMILES是将2D graph变成1D string，基于这个有很多NLP的技术能应用在这上面，比如RNN、LSTM。但是SMILES并不是用于表示序列，因此有很多问题。 1. 如果生成molecules是通过生成SMILES的话，那么会有很多invalid SMILES strings。 2. 同一个分子可以有多个valid且非常不同的SMILES。哪怕能正则化，将同一个分子生成相同的SMILES，但是不同的算法会给我们不同的canonized SMILES。因此，基于SMILES的距离并不能很好的反应分子之间的不同。

3.2 Prediction of Molecular Properties by Deep Neural Networks

3.3 Graph Neural Networks

4 The Alchemy Dataset

4.1 Workflow for the Data Generation

使用模型B3LYP/6- 31G(2df,p)来计算分子特性：ground state equilibrium geometry, ground state electronic properties, and ground state thermochemical properties。

其中equilibrium geometry是通过三种方式优化 1. 首先使用OpenBabel来分解SMILES，然后使用MMFF94 force field optimization来构造对应的卡迪尔坐标。 2. 然后使用HF/STO- 3G theory来生成初步的geometry。 3. 最后就使用B3LYP/6-31G(2df,p)模型 with the density fitting approximation for electron repulsion integrals。

当计算ground state electronic properties和ground state thermochemical properties时候，density fitting techniques并未被采用。除了QM9最开始使用的分子特性，我们也报道dipole moment vector，the 3×3 polarizability tensor and the atomic charges obtained by the meta-Lowdin population analysis。使用meta-Lowdin计算的population analysis优势在于，得到的atom charges在不同的分子环境中有更好的transferability。

4.2 Comparison to the QM9 dataset

为了构造一个高质量的数据集，我们首先从Alchemy中选了21,310 molecules。然后我们从QM9中选了同样的molecules。我们比较这两个数据集的molecular properties，结果统计在Table 2中。其中的error都是表示我们的结果和原始的QM9 数据集的不同。可以看到我们生成的分子和QM9的分子agree well，基于这么一个事实：这些计算使用了不同的B3LYP范函定义。尽管如此，还是看到了几个不一致的地方。

Check Table 2!!!

5 The Benchmark Results

两个benchmarking experiments测试了这么几个baselines： Graph Convolution Networks，Chebyshev Networks，Graph Attention Network，Relational Graph Convolutional Networks，Gated Graph Neural Networks，Lanczos Networks，Graph Isomorphism Network，Message Passing Neural Networks。

6 Conclusions and Discussions

我们两个benchmarking results明显地展现了GNN模型地优越性，尤其是使用pairwise distance的GNN模型。

Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models