What will be the policy if the state space is continuous in Reinforcement learning Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhat is the Q function and what is the V function in reinforcement learning?Combining Neural Network with Reinforcement Learning in a Continuous SpaceQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsCatastrophic forgetting in linear semi-gradient RL agent?Why are policy gradient methods preferred over value function approximation in continuous action domains?Experience replay in Reinforcement learning - Batch SizeWhat is wrong with this reinforcement learning environment ?Policy-based RL method - how do continuous actions look like?What is “GOAL” in terms of Reinforcement Learning specified in these papers?Reinforcement learning for continuous state and action space

How to mute a string and play another at the same time

How is an IPA symbol that lacks a name (e.g. ɲ) called?

Coin Game with infinite paradox

What's the difference between using dependency injection with a container and using a service locator?

What could prevent concentrated local exploration?

Who can become a wight?

Do chord progressions usually move by fifths?

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

Converting a text document with special format to Pandas DataFrame

Determine the generator of an ideal of ring of integers

Recursive calls to a function - why is the address of the parameter passed to it lowering with each call?

How to create a command for the "strange m" symbol in latex?

How can I introduce the names of fantasy creatures to the reader?

Should man-made satellites feature an intelligent inverted "cow catcher"?

Marquee sign letters

Etymology of 見舞い

Is Vivien of the Wilds + Wilderness Reclamation a competitive combo?

tabularx column has extra padding at right?

Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?

Why these surprising proportionalities of integrals involving odd zeta values?

What is the difference between 准时 and 按时?

Is the Mordenkainen's Sword spell underpowered?

Are Flameskulls resistant to magical piercing damage?

How to leave only the following strings?

What will be the policy if the state space is continuous in Reinforcement learning

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsWhat is the Q function and what is the V function in reinforcement learning?Combining Neural Network with Reinforcement Learning in a Continuous SpaceQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsCatastrophic forgetting in linear semi-gradient RL agent?Why are policy gradient methods preferred over value function approximation in continuous action domains?Experience replay in Reinforcement learning - Batch SizeWhat is wrong with this reinforcement learning environment ?Policy-based RL method - how do continuous actions look like?What is “GOAL” in terms of Reinforcement Learning specified in these papers?Reinforcement learning for continuous state and action space

I have started recently with reinforcement learning. I have few doubts regarding the policy of an agent when it comes to continuous space. From my understanding, policy tells the agent which action to perform given a particular state. This makes sense when it comes to the maze example, where the state space is descrete and limited. What if the state space is continuous, will the agent have information of every possible state in the state space?
Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

asked Apr 18 at 4:52

Chinni

276

add a comment |

asked Apr 18 at 4:52

Chinni

276

add a comment |

asked Apr 18 at 4:52

Chinni

276

reinforcement-learning

asked Apr 18 at 4:52

Chinni

276

asked Apr 18 at 4:52

Chinni

276

asked Apr 18 at 4:52

Chinni

276

asked Apr 18 at 4:52

Chinni

276

asked Apr 18 at 4:52

Chinni

276

add a comment |

1 Answer
1

active

oldest

votes

You can still define state value functions $v(s)$, action value functions $q(s,a)$ and policy functions $pi(s)$ or $pi(a|s)$ when the state $s$ is from a very large or continuous space. Reinforcement Learning (RL) is still a well-defined problem in that space.

What becomes harder is iterating through the state space. That rules out two simple approaches:

Tabular methods - that store lists of all states with the correct action or value.

Any method that needs to iterate through all states, e.g. the dynamic programming methods Policy Iteration or Value Iteration.

These are important methods for RL. With tabulation and assuming you can iterate through all possibilities, then you can prove that you will find the optimal policy.

However, RL methods can still work with large state spaces. The main method to do so is to use some form of function approximation, which then generalises the space so that knowledge learned about a single state is used to assess similar states.

Function approximation can simply be discretising the space to make the numbers more manageable. Or you can use a parametrisable machine learning approach, such as neural networks. The combination of neural networks with reinforcement learning methods is behind the "deep" reinforcement learning approaches that have been subject of much recent research.

If you use any function approximation with RL, then you are not guaranteed to find the most optimal policy. Instead you will find an approximation of that policy. However, that is often good enough for purpose.

To answer the questions more directly:

What will be the policy if the state space is continuous in Reinforcement learning

There is no change at the theoretical level. You can express the policy as $pi(s)$ for a deterministic policy, or $pi(a|s)$ for a stochastic policy, regardless of the space of $s$.

At the implementation level, you will need to implement a parametric function that takes $s$ as one of its inputs. The function parameters $theta$ are what is learned. For instance if you use an action value based method such as Q-learning, then you will create an approximation to $Q(s,a)$ - in the literature you may see this directly represented as $hatq(s,a,theta) approx Q(s,a)$

Using a neural network for $hatq(s,a,theta)$ is one common way to achieve this, where the neural network's weight and bias values are in $theta$.

What if the state space is continuous, will the agent have information of every possible state in the state space?

Depends what you mean by "have information". The agent cannot possibly store separate data about each state. However, it may have information about similar states, or store its knowledge about states in a more abstract fashion (such as in the parameters $theta$)

Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

Yes. For this to work well with function approximation, it relies on successful generalisation between similar states. So it is important that the state space representation works towards this. For instance, if two states are close together in the state space representation you use, it should be expected that value function and policy functions are often similar - not always, the function can have arbitrary shape, but trying to learn effectively random mapping would be impossible.

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49511%2fwhat-will-be-the-policy-if-the-state-space-is-continuous-in-reinforcement-learni%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

What becomes harder is iterating through the state space. That rules out two simple approaches:

Tabular methods - that store lists of all states with the correct action or value.

Any method that needs to iterate through all states, e.g. the dynamic programming methods Policy Iteration or Value Iteration.

These are important methods for RL. With tabulation and assuming you can iterate through all possibilities, then you can prove that you will find the optimal policy.

To answer the questions more directly:

What will be the policy if the state space is continuous in Reinforcement learning

There is no change at the theoretical level. You can express the policy as $pi(s)$ for a deterministic policy, or $pi(a|s)$ for a stochastic policy, regardless of the space of $s$.

Using a neural network for $hatq(s,a,theta)$ is one common way to achieve this, where the neural network's weight and bias values are in $theta$.

What if the state space is continuous, will the agent have information of every possible state in the state space?

Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

add a comment |

What becomes harder is iterating through the state space. That rules out two simple approaches:

Tabular methods - that store lists of all states with the correct action or value.

Any method that needs to iterate through all states, e.g. the dynamic programming methods Policy Iteration or Value Iteration.

These are important methods for RL. With tabulation and assuming you can iterate through all possibilities, then you can prove that you will find the optimal policy.

To answer the questions more directly:

What will be the policy if the state space is continuous in Reinforcement learning

There is no change at the theoretical level. You can express the policy as $pi(s)$ for a deterministic policy, or $pi(a|s)$ for a stochastic policy, regardless of the space of $s$.

Using a neural network for $hatq(s,a,theta)$ is one common way to achieve this, where the neural network's weight and bias values are in $theta$.

What if the state space is continuous, will the agent have information of every possible state in the state space?

Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

add a comment |

What becomes harder is iterating through the state space. That rules out two simple approaches:

Tabular methods - that store lists of all states with the correct action or value.

Any method that needs to iterate through all states, e.g. the dynamic programming methods Policy Iteration or Value Iteration.

These are important methods for RL. With tabulation and assuming you can iterate through all possibilities, then you can prove that you will find the optimal policy.

To answer the questions more directly:

What will be the policy if the state space is continuous in Reinforcement learning

There is no change at the theoretical level. You can express the policy as $pi(s)$ for a deterministic policy, or $pi(a|s)$ for a stochastic policy, regardless of the space of $s$.

Using a neural network for $hatq(s,a,theta)$ is one common way to achieve this, where the neural network's weight and bias values are in $theta$.

What if the state space is continuous, will the agent have information of every possible state in the state space?

Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

What becomes harder is iterating through the state space. That rules out two simple approaches:

Tabular methods - that store lists of all states with the correct action or value.

Any method that needs to iterate through all states, e.g. the dynamic programming methods Policy Iteration or Value Iteration.

These are important methods for RL. With tabulation and assuming you can iterate through all possibilities, then you can prove that you will find the optimal policy.

To answer the questions more directly:

What will be the policy if the state space is continuous in Reinforcement learning

There is no change at the theoretical level. You can express the policy as $pi(s)$ for a deterministic policy, or $pi(a|s)$ for a stochastic policy, regardless of the space of $s$.

Using a neural network for $hatq(s,a,theta)$ is one common way to achieve this, where the neural network's weight and bias values are in $theta$.

What if the state space is continuous, will the agent have information of every possible state in the state space?

Also will an RL agent be able to take decision if its in a new state that it has not encountered during training ?

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

edited Apr 18 at 8:34

answered Apr 18 at 7:42

Neil Slater

17.8k33264

answered Apr 18 at 7:42

Neil Slater

17.8k33264

answered Apr 18 at 7:42

Neil Slater

17.8k33264

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

ymVzF wRcI3,I1w7PIgq fY6,PgiaT,4

搜尋此網誌

Mrthdrb

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Wiltshire Susbaint Daoine Ainmeil | Bailtean | Iomraidhean | Ceanglaichean a-mach | Clàr-taice na seòladaireachdThe Placenames of WiltshireComhairle Wiltshire

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Wiltshire Susbaint Daoine Ainmeil | Bailtean | Iomraidhean | Ceanglaichean a-mach | Clàr-taice na seòladaireachdThe Placenames of WiltshireComhairle Wiltshire

1 Answer
1

1 Answer
1

1 Answer
1