Learning Protein Structure with a Differentiable Simulator
Learning Protein Structure with a Differentiable Simulator
1 Intro
Boltzman分布是很多系统的自然分布,比如brain,material,molecule。但是不怎么用它来拟合数据,因为Monte Carlo算法很慢。
NEMO (Neural energy modeling and optimization) can learn at scale to generate 3D protein structures consisting of hundreds of points directly from sequence information.
2 Model
2.1 Representation
Protein 多为线性的amino acids。然后放到solven中之后,会根据特定的反应,产生特定的3D结构。
Coordinate Representation 每一个氨基酸有5个position:4个backbone的heavy atoms和side chain R group的中心。实际上,differentiable simulator 会产生一个 initial coarse-grained structure (每一个氨基酸一个position),而loss function会针对于mass center和$$C_\alpha$$ carbon的中心点。
Sequence Conditioning 考虑两类输入特征。一个是one-hot encoding for amino acids。另一个是在此基础上,再加上a profile of evolutionarily related sequences。
Internal coordinates x是绝对位置,但是这里采用相对位置z。
- $$b_i$$ 是和前面三个point的距离
- $$a_i$$ 是bond angle
- $$d_i$$ 是dihedral angle
- $$z_i = { \tilde b_i, \tilde a_i, d_i}$$
$$x=F(z)$$, $$z=F^{-1}(x)$$ is simpler to compute,因为它只包含了局部信息。
2.2 Neural Energy Function
Deep Markov Random Field model x given s with the Boltzman distrubtion, i.e., $$p\theta(x|s) = \frac{1}{Z} \exp (-U\theta([x;s])$$, where $$U_\theta([x;s])$$ is a sequence conditioned energy function.
Decomposition: $$U_\theta([x;s]) = \sum_i l_i(s;\theta) f_i(x;\theta)$$
- Markov Random Field with coefficients $$l_i(s;\theta)$$ computed by a sequence network
- Structured features $$f_i(x;\theta)$$ computed by a structural network
Sequence Network one-dimensional sequence as input, 然后output
- energetic coefficients
- simulator initial state $$z_o$$
- simulator hyper-parameters preconditioning matrix $$C$$
- predicted secondary structure
Structure Network 定义energy function invariant to rigid body motions 通过使用 invariant base features,比如
- Internal coordinates $$z$$ 所有的internal coordinates都是invariant to rotation and translation,然后会在loss function中被mask
- Distances 任意两个point pair之间的distance,然后用4个(学习到的)radial basis function处理
- Orientation vectors $$\hat v{ij}$$ 以$$\frac{x_i - x{i-1}}{ | xi - x{i-1} |}$$为base的基础上 relative position $$x{i-1}$$ of in local system of $$x{i}$$
2.3 Efficient Simulator
Langevin dynamics
$$ x^{t+\epsilon} = x^t - \frac{\epsilon}{2} \nabla_x U^t + \sqrt{\epsilon} p
$$
Internal coordinate dynamics interleave the Cartesian Langevin dynamics with preconditioned Internal Coordinate dynamics
$$ z^{t+\epsilon} = z^t - \frac{\epsilon C}{2} \nabla_z U^t + \sqrt{\epsilon C} p
$$
, where $$C$$ is a preconditioning matrix
Transform integrator
2.4 Atomic Imputation
$$ X{i,j} = x_i + e{i,j}(z;\theta) [\hat ui \, \hat n{i+1} \, \hat n{i+1} \times \hat u{i}] r_{i,j}(z;\theta)
$$
where $$e{i,j}(z;\theta), r{i,j}(z;\theta)$$ are computed by a 1D neural network.