Learning Protein Structure with a Differentiable Simulator

1 Intro

Boltzman分布是很多系统的自然分布，比如brain，material，molecule。但是不怎么用它来拟合数据，因为Monte Carlo算法很慢。

NEMO (Neural energy modeling and optimization) can learn at scale to generate 3D protein structures consisting of hundreds of points directly from sequence information.

2 Model

2.1 Representation

Protein 多为线性的amino acids。然后放到solven中之后，会根据特定的反应，产生特定的3D结构。

Coordinate Representation 每一个氨基酸有5个position：4个backbone的heavy atoms和side chain R group的中心。实际上，differentiable simulator 会产生一个 initial coarse-grained structure （每一个氨基酸一个position），而loss function会针对于mass center和$$C_\alpha$$ carbon的中心点。

Sequence Conditioning 考虑两类输入特征。一个是one-hot encoding for amino acids。另一个是在此基础上，再加上a profile of evolutionarily related sequences。

Internal coordinates x是绝对位置，但是这里采用相对位置z。

$$b_i$$ 是和前面三个point的距离
$$a_i$$ 是bond angle
$$d_i$$ 是dihedral angle
$$z_i = { \tilde b_i, \tilde a_i, d_i}$$

$$x=F(z)$$, $$z=F^{-1}(x)$$ is simpler to compute，因为它只包含了局部信息。

2.2 Neural Energy Function

Deep Markov Random Field model x given s with the Boltzman distrubtion, i.e., $$p\theta(x|s) = \frac{1}{Z} \exp (-U\theta([x;s])$$, where $$U_\theta([x;s])$$ is a sequence conditioned energy function.

Decomposition: $$U_\theta([x;s]) = \sum_i l_i(s;\theta) f_i(x;\theta)$$

Markov Random Field with coefficients $$l_i(s;\theta)$$ computed by a sequence network
Structured features $$f_i(x;\theta)$$ computed by a structural network

Sequence Network one-dimensional sequence as input, 然后output

energetic coefficients
simulator initial state $$z_o$$
simulator hyper-parameters preconditioning matrix $$C$$
predicted secondary structure

Structure Network 定义energy function invariant to rigid body motions 通过使用 invariant base features，比如

Internal coordinates $$z$$ 所有的internal coordinates都是invariant to rotation and translation，然后会在loss function中被mask
Distances 任意两个point pair之间的distance，然后用4个（学习到的）radial basis function处理
Orientation vectors $$\hat v{ij}$$ 以$$\frac{x_i - x{i-1}}{ | xi - x{i-1} |}$$为base的基础上 relative position $$x{i-1}$$ of in local system of $$x{i}$$

2.3 Efficient Simulator

Langevin dynamics

$$ x^{t+\epsilon} = x^t - \frac{\epsilon}{2} \nabla_x U^t + \sqrt{\epsilon} p

Internal coordinate dynamics interleave the Cartesian Langevin dynamics with preconditioned Internal Coordinate dynamics

$$ z^{t+\epsilon} = z^t - \frac{\epsilon C}{2} \nabla_z U^t + \sqrt{\epsilon C} p

, where $$C$$ is a preconditioning matrix

Transform integrator

2.4 Atomic Imputation

$$ X{i,j} = x_i + e{i,j}(z;\theta) [\hat ui \, \hat n{i+1} \, \hat n{i+1} \times \hat u{i}] r_{i,j}(z;\theta)

where $$e{i,j}(z;\theta), r{i,j}(z;\theta)$$ are computed by a 1D neural network.

Learning Protein Structure with a Differentiable Simulator

Learning Protein Structure with a Differentiable Simulator