设为首页 加入收藏

TOP

学习笔记TF037:实现强化学习策略网络(三)
2017-10-09 14:10:17 】 浏览:2335
Tags:学习 笔记 TF037: 实现 强化 策略 网络
adBuffer,直到完成一个batch_size试验,汇总梯度更新模型参数。

试验循环,最大循环次数total_episodes。batch 平均Reward达到100以上,Agent表现良好,调用env.render()展示试验环境。tf.reshape将observation变形策略网络输入格式,传入网络,sess.run执行probability获得网络输出概率tfprob,Action取值1的概率。(0,1)间随机抽样,随机值小于tfprob,令Action取1,否则取0,Action取值 1概率为tfprob。

输入环境信息添加到列表xs,制造虚拟label——y,取值与Action相反,y=1-Action,添加到列表ys。env.step执行一次Action,获取observation、reward、done、info,reward 累加到reward_sum,reward添加到列表drs。

done为True,一次试验结束,episode_number加1。np.vstack 将列表xs、ys、drs元素纵向堆叠,得到epx、epy、epr,将xs、ys、drs清空,下次试验用。epx、epy、epr,一次试验中获得的所有observation、label、reward列表。discount_rewards函数计算每步Action潜在价值,标准化(减去均值再除以标准差),得零均值标准差1分布。dicount_reward参与模型损失计算。

epx、epy、discounted_epr输入神经网络,newGrads求解梯度。获得梯度累加gradBuffer。

试验次数达到batch_size整倍数,gradBuffer累计足够梯度,用updateGrads将gradBuffer中梯度更新到策略网络模型参数,清空gradBuffer,计算下一batch梯度准备。一个batch梯度更新参数,每个梯度是使用一次试验全部样本(一个Action一个样本)计算,一个batch样本数 25(batch_size)次试验样本数和。展示当前试验次数episode_number,batch内每次试验平均reward。batch内每次试验平均reward大于200,策略网络完成任务终止循环。如没达目标,清空reward_sum,重新累计下一batch总reward。每次试验结束,任务环境env重置。

模型训练日志,策略网络200次试验,8个batch训练和参数更新,实现目标,batch内平均230 reward。可以尝试修改策略网络结构、隐含节点数、batch_size、学习速率参数优化训练,加快学习速度。

    import numpy as np
    import tensorflow as tf
    import gym
    env = gym.make('CartPole-v0')
    env.reset()
    random_episodes = 0
    reward_sum = 0
    while random_episodes < 10:
        env.render()
        observation, reward, done, _ = env.step(np.random.randint(0,2))
        reward_sum += reward
        if done:
            random_episodes += 1
            print("Reward for this episode was:",reward_sum)
            reward_sum = 0
            env.reset()
        
    # hyperparameters
    H = 50 # number of hidden layer neurons
    batch_size = 25 # every how many episodes to do a param update?
    learning_rate = 1e-1 # feel free to play with this to train faster or more stably.
    gamma = 0.99 # discount factor for reward
    D = 4 # input dimensionality        
    tf.reset_default_graph()
    #This defines the network as it goes from taking an observation of the environment to 
    #giving a probability of chosing to the action of moving left or right.
    observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
    W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
    layer1 = tf.nn.relu(tf.matmul(observations,W1))
    W2 = tf.get_variable("W2", shape=[H, 1],
               initializer=tf.contrib.layers.xavier_initializer())
    score = tf.matmul(layer1,W2)
    probability = tf.nn.sigmoid(score)
    #From here we define the parts of the network needed for learning a good policy.
    tvars = tf.trainable_variables()
    input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
    advantages = tf.placeholder(tf.float32,name="reward_signal")
    # The loss function. This sends the weights in the direction of making actions 
    # that gave good advantage (reward over time) more likely, and actions that didn't less likely.
    loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
    loss = -tf.reduce_mean(loglik * advantages) 
    newGrads = tf.gradients(loss,tvars)
    # Once we have collected a series of gradients from multiple episodes, we apply them.
    # We don't just apply gradeients after every episode in order to account for noise in the reward signal.
    adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
    W1Grad = tf.p
首页 上一页 1 2 3 4 5 下一页 尾页 3/5/5
】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇R语言——任务2 下一篇学习笔记TF043:TF.Learn 机器学习..

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目