Bandit Problem Reinforcement Learning