Please point me in the right direction for further exploring my line of thinking in AI alignment

Although I’m not a researcher in AI, I have a hobby interest in the field and have been reading research papers related to my idea. I’m trying to create a reinforcement learning AI that is empathetic towards conscious agents and assists them in achieving their goals. My focus is on Collaborative Inverse Reinforcement Learning, where the AI helps any agent-like entity without predefined targets.

I aim to develop a reward function that fosters friendliness without relying on labeled examples and allows the AI to help non-human agents. A potential approach involves using empowerment as a measure of alignment between agents, with the ultimate goal of maximizing joint empowerment. However, I need to consider how to detect agent-like behavior and aggregate the preferences of multiple agents. I’m looking for relevant research to support my exploration.

5 Likes

You might be interested in Bayesian inverse planning.

But OP, what you’re trying to solve would first depend on, then consist of, major breakthroughs in the field. People study and research for decades without making even minor contributions. It’s fine to be interested in the philosophy of AI on a hobby level (I am), but don’t confuse it with doing AI research (I was).

4 Likes

The major missing element in your theory is intention. Your conscious agent possesses it, while your ML model does not and cannot have it. It’s understandable to be confused about this, as the hype surrounding AI often suggests that ML models can somehow have intention.

3 Likes

What do you mean by “intention”? Is there a formal definition for it, or are you highlighting a vague concept that I overlooked? When I searched for “deep learning intention,” I found papers that used intentions as labels for trajectories in their datasets, but I’m more interested in a method to infer intention without relying on labels.

2 Likes

“Intention” refers to a characteristic of optimization or search algorithms. It signifies what you are searching for or aiming to maximize. This concept applies not only to ML models but also to various traditional algorithms, such as A*.

1 Like

In humanities research, the concept of “intersubjectivity” refers to a pattern of activity from which many useful behaviors arise, centered on an agent’s ability to intuit or infer another agent’s subjective mental states based on their objective behavior or presentation. This capability is simply beyond the reach of LLMs. While they can simulate the appearance of understanding—similar to how fiction makes us believe in the mental states of characters in films—it’s unrealistic to expect that this simulation can be more beneficial than reading a book where a character encounters a problem similar to your own but not quite the same.