学习笔记TF037:实现强化学习策略网络 - R语言

TOP

学习笔记TF037:实现强化学习策略网络(三)

2017-10-09 14:10:17 【大中小】浏览:2335次

adBuffer，直到完成一个batch_size试验，汇总梯度更新模型参数。

试验循环，最大循环次数total_episodes。batch 平均Reward达到100以上，Agent表现良好，调用env.render()展示试验环境。tf.reshape将observation变形策略网络输入格式，传入网络，sess.run执行probability获得网络输出概率tfprob，Action取值1的概率。(0,1)间随机抽样，随机值小于tfprob,令Action取1,否则取0，Action取值 1概率为tfprob。

输入环境信息添加到列表xs，制造虚拟label——y，取值与Action相反，y=1-Action，添加到列表ys。env.step执行一次Action，获取observation、reward、done、info，reward 累加到reward_sum，reward添加到列表drs。

done为True，一次试验结束，episode_number加1。np.vstack 将列表xs、ys、drs元素纵向堆叠，得到epx、epy、epr，将xs、ys、drs清空，下次试验用。epx、epy、epr，一次试验中获得的所有observation、label、reward列表。discount_rewards函数计算每步Action潜在价值，标准化(减去均值再除以标准差)，得零均值标准差1分布。dicount_reward参与模型损失计算。

epx、epy、discounted_epr输入神经网络，newGrads求解梯度。获得梯度累加gradBuffer。

试验次数达到batch_size整倍数，gradBuffer累计足够梯度，用updateGrads将gradBuffer中梯度更新到策略网络模型参数，清空gradBuffer，计算下一batch梯度准备。一个batch梯度更新参数，每个梯度是使用一次试验全部样本(一个Action一个样本)计算，一个batch样本数 25(batch_size)次试验样本数和。展示当前试验次数episode_number，batch内每次试验平均reward。batch内每次试验平均reward大于200,策略网络完成任务终止循环。如没达目标，清空reward_sum，重新累计下一batch总reward。每次试验结束，任务环境env重置。

模型训练日志，策略网络200次试验，8个batch训练和参数更新，实现目标，batch内平均230 reward。可以尝试修改策略网络结构、隐含节点数、batch_size、学习速率参数优化训练，加快学习速度。

    import numpy as np
    import tensorflow as tf
    import gym
    env = gym.make('CartPole-v0')
    env.reset()
    random_episodes = 0
    reward_sum = 0
    while random_episodes < 10:
        env.render()
        observation, reward, done, _ = env.step(np.random.randint(0,2))
        reward_sum += reward
        if done:
            random_episodes += 1
            print("Reward for this episode was:",reward_sum)
            reward_sum = 0
            env.reset()
        
    # hyperparameters
    H = 50 # number of hidden layer neurons
    batch_size = 25 # every how many episodes to do a param update?
    learning_rate = 1e-1 # feel free to play with this to train faster or more stably.
    gamma = 0.99 # discount factor for reward
    D = 4 # input dimensionality        
    tf.reset_default_graph()
    #This defines the network as it goes from taking an observation of the environment to 
    #giving a probability of chosing to the action of moving left or right.
    observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
    W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
    layer1 = tf.nn.relu(tf.matmul(observations,W1))
    W2 = tf.get_variable("W2", shape=[H, 1],
               initializer=tf.contrib.layers.xavier_initializer())
    score = tf.matmul(layer1,W2)
    probability = tf.nn.sigmoid(score)
    #From here we define the parts of the network needed for learning a good policy.
    tvars = tf.trainable_variables()
    input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
    advantages = tf.placeholder(tf.float32,name="reward_signal")
    # The loss function. This sends the weights in the direction of making actions 
    # that gave good advantage (reward over time) more likely, and actions that didn't less likely.
    loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
    loss = -tf.reduce_mean(loglik * advantages) 
    newGrads = tf.gradients(loss,tvars)
    # Once we have collected a series of gradients from multiple episodes, we apply them.
    # We don't just apply gradeients after every episode in order to account for noise in the reward signal.
    adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
    W1Grad = tf.p

首页上一页 1 2 3 4 5 下一页尾页 3/5/5
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：R语言——任务2	下一篇：学习笔记TF043:TF.Learn 机器学习..