adBuffer,直到完成一个batch_size试验,汇总梯度更新模型参数。
试验循环,最大循环次数total_episodes。batch 平均Reward达到100以上,Agent表现良好,调用env.render()展示试验环境。tf.reshape将observation变形策略网络输入格式,传入网络,sess.run执行probability获得网络输出概率tfprob,Action取值1的概率。(0,1)间随机抽样,随机值小于tfprob,令Action取1,否则取0,Action取值 1概率为tfprob。
输入环境信息添加到列表xs,制造虚拟label——y,取值与Action相反,y=1-Action,添加到列表ys。env.step执行一次Action,获取observation、reward、done、info,reward 累加到reward_sum,reward添加到列表drs。
done为True,一次试验结束,episode_number加1。np.vstack 将列表xs、ys、drs元素纵向堆叠,得到epx、epy、epr,将xs、ys、drs清空,下次试验用。epx、epy、epr,一次试验中获得的所有observation、label、reward列表。discount_rewards函数计算每步Action潜在价值,标准化(减去均值再除以标准差),得零均值标准差1分布。dicount_reward参与模型损失计算。
epx、epy、discounted_epr输入神经网络,newGrads求解梯度。获得梯度累加gradBuffer。
试验次数达到batch_size整倍数,gradBuffer累计足够梯度,用updateGrads将gradBuffer中梯度更新到策略网络模型参数,清空gradBuffer,计算下一batch梯度准备。一个batch梯度更新参数,每个梯度是使用一次试验全部样本(一个Action一个样本)计算,一个batch样本数 25(batch_size)次试验样本数和。展示当前试验次数episode_number,batch内每次试验平均reward。batch内每次试验平均reward大于200,策略网络完成任务终止循环。如没达目标,清空reward_sum,重新累计下一batch总reward。每次试验结束,任务环境env重置。
模型训练日志,策略网络200次试验,8个batch训练和参数更新,实现目标,batch内平均230 reward。可以尝试修改策略网络结构、隐含节点数、batch_size、学习速率参数优化训练,加快学习速度。
import numpy as np
import tensorflow as tf
import gym
env = gym.make('CartPole-v0')
env.reset()
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
env.render()
observation, reward, done, _ = env.step(np.random.randint(0,2))
reward_sum += reward
if done:
random_episodes += 1
print("Reward for this episode was:",reward_sum)
reward_sum = 0
env.reset()
# hyperparameters
H = 50 # number of hidden layer neurons
batch_size = 25 # every how many episodes to do a param update?
learning_rate = 1e-1 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward
D = 4 # input dimensionality
tf.reset_default_graph()
#This defines the network as it goes from taking an observation of the environment to
#giving a probability of chosing to the action of moving left or right.
observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W1 = tf.get_variable("W1", shape=[D, H],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,W1))
W2 = tf.get_variable("W2", shape=[H, 1],
initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1,W2)
probability = tf.nn.sigmoid(score)
#From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")
# The loss function. This sends the weights in the direction of making actions
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages)
newGrads = tf.gradients(loss,tvars)
# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.p