Open Source · MIT License · Listed on Glama

Skills tell your AI how. OpenExp teaches it what works.

Your agent follows instructions perfectly — but doesn't learn from results. OpenExp adds outcome-based learning: approaches that led to commits, closed deals, and shipped code surface first next time.

# Install
pip install openexp-memory
# Start Qdrant
docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
# Register hooks with Claude Code
openexp hooks install
# Done. Use Claude Code as normal.

The Learning Loop

Every session makes the next one smarter. The same algorithm behind AlphaGo — applied to your AI's working memory.

🧠

Recall

Top memories injected, ranked by Q-value

⚙️

Work

Every action captured as observations

📊

Evaluate

Session ends — was it productive?

🔄

Reward

Good session? Memories get higher scores

Skills Say "How." Nobody Says "What Works."

Static

Skills don't learn

You wrote a skill once: "how to work with CRM." The agent follows it perfectly. But it doesn't know that approach A closed deals and approach B didn't. Tomorrow it'll do the same thing as yesterday — even if yesterday didn't work.

No Feedback

No outcome signal

Your agent sent 200 emails this month. Which ones got replies? Which formulations closed deals? Which debugging approaches actually fixed bugs on the first try? Your skills don't know. There's no feedback loop.

No Signal

Memory services store, not learn

Mem0, Zep, LangMem store and retrieve. But every memory is equally important. A critical decision and a random grep have the same weight. Storage without learning is just a database.

How OpenExp Works

Your skills say how. OpenExp learns what actually works — from real results.

1

Automatic capture

Every action in your Claude Code session — file edits, commits, commands, decisions — is automatically recorded. Hooks handle it. Zero manual work.

2

Smart retrieval

Before each response, the system finds the most relevant memories. Not by similarity alone — by proven usefulness. Five ranking signals.

3

Reward loop

After every session, the system evaluates what happened. Productive sessions reward the memories that were used. Empty sessions penalize them.

Session Signals

After each session, OpenExp checks what was produced and assigns a reward score.

Session outcomeReward
Code committed+0.30
Pull request created+0.20
Deployed to production+0.10
Tests passed+0.10
Deal closed (CRM)+0.80
Nothing produced-0.10

Experiences — Your Process, Your Rewards

One memory can be valuable in one context and worthless in another. Define what "productive" means for your workflow.

Pipeline
backlog in_progress review merged deployed
Signal weights
Commit+0.30
Pull Request+0.20
Tests pass+0.10
Deploy+0.10
Decisions+0.10
Pipeline
lead contacted qualified proposal negotiation won
Signal weights
Decisions+0.20
Email sent+0.15
Follow-up+0.10
Commit+0.05
Pull Request+0.05
Pipeline
lead discovery nda proposal negotiation invoice paid
Signal weights
Payment received+0.30
Proposal sent+0.25
Invoice sent+0.20
Email sent+0.15
Decisions+0.15
Pipeline
new_ticket investigating responded resolved closed
Signal weights
Ticket closed+0.25
Email sent+0.10
Decisions+0.10
Follow-up+0.10
Same memory, different scores
"Discussed NDA with client — lawyers took 2 weeks, 10+7 year term"
coding experience
0.05
No commits. Useless.
dealflow experience
0.72
NDA led to payment.

How OpenExp Compares

FeatureOpenExpMem0ZepLangMem
Learns from outcomesQ-learningNoNoNo
Process-awarePipeline stages + signalsNoNoNo
Memory type filteringReward only decisionsNoNoNo
Hybrid retrieval5 signalsVector onlyGraph + vectorVector only
Claude Code nativeZero-config hooksIntegration requiredIntegration requiredIntegration required
Fully localQdrant + FastEmbedCloud APICloud or self-hostedCloud API

Five-Factor Retrieval

Not just "find similar text." Five signals weighted together. After 100 sessions, your retrieval is personalized by actual outcomes.

30%
Q-value
Proven usefulness
30%
Semantic
Meaning, not keywords
15%
Recency
Recent gets a boost
15%
Importance
Decisions outrank commands
10%
BM25
Exact keyword matches

Fully Local. No SaaS.

No data leaves your machine. All data lives under ~/.openexp/. You own everything.

🐳

Qdrant

Vector DB in a Docker container on your machine

FastEmbed

Local embeddings, no API calls needed

💾

Q-Cache

JSON file on disk, fully inspectable

🔍

Explainable

5-level audit trail from raw logs to LLM reasoning

FAQ

Real questions from developers, founders, sales teams, and skeptics.

Installation & Setup
How long does installation take?
If you already have Docker and Claude Code — realistically 5 minutes. Clone the repo and run ./setup.sh — the script creates a venv, starts Qdrant in Docker, creates the collection, copies .env, and registers the MCP server and hooks in Claude Code. Requires Python 3.11+ and Docker. No API key needed for core functionality — embeddings run locally via FastEmbed. First launch downloads the model (~1 min), then it’s cached.
I’m not a programmer. Can I install this myself?
Honestly — it’ll be tough on your own. Best option: ask whoever set up Claude Code for you to spend 15-20 minutes. After installation everything runs in the background — you don’t do anything extra, just work as usual.
How much disk space does it use?
Budget 500MB-1GB on startup (Qdrant Docker image + embedding model). Memories themselves are tiny: 10,000 records = ~15MB. With active use (50 sessions/week) observations take 10-20MB/month. RAM: Qdrant uses 50-100MB idle.
How do I uninstall it?
Clean removal in 4 steps: (1) docker stop/rm the Qdrant container, (2) rm -rf ~/.openexp/, (3) remove the openexp block from ~/.claude/settings.local.json, (4) delete the openexp folder. Nothing installs system-wide, zero leftover files.
How it Works
How is this different from CLAUDE.md?
CLAUDE.md is static context that you write and update by hand. OpenExp adds dynamic context: what you did yesterday, which approaches worked, which didn’t. They work together. The real advantage shows when you return to a project after a week or fixed a similar bug a month ago — the solution surfaces automatically.
What exactly gets remembered?
Everything you do through Claude Code: file edits, commands, decisions, emails. You can also explicitly say “remember that the client wants a 15% discount” — stored as a separate fact. It doesn’t record calls directly (it’s a text tool), but if you tell Claude to write down the summary after a call — that gets stored.
How does the system decide what’s important?
Q-learning. Every memory has a Q-value (from -0.5 to 1.0). If a memory was retrieved before a productive session (commit, closed deal) — its Q-value rises. If the session was empty — it drops. Over dozens of sessions, useful memories surface first, noise sinks.
Reward System
What reward signals does the system use?
Two types. Session rewards evaluate each working session automatically: commit = +0.3, PR = +0.2, deploy = +0.1, tests = +0.1, decisions = +0.1, files written = +0.02 each. Empty session = -0.1 base + -0.1 penalty. Separately, business outcome rewards fire through the CRM resolver: closed deal = +0.8, proposal sent = +0.25, payment received = +0.3. These are different reward paths — session rewards work automatically, business outcomes require CRM integration.
Different Workflows
Is this only for programmers?
No. There are ready profiles: sales with funnel stages (lead → contacted → qualified → proposal → negotiation → won) and dealflow (includes NDA, invoicing, payment). For a salesperson, a “productive session” means a sent email or a decision made, not a commit. Enable with one variable: OPENEXP_EXPERIENCE=sales. But honestly — these profiles are new and haven’t been battle-tested by many users yet. For other workflows you can create your own via openexp experience create.
Same memory, different value in different contexts?
Exactly. “Discussed NDA with client” in dealflow experience has Q-value 0.72 (led to payment), but in coding experience — 0.05 (no commits). This is called Experiences — different scoring profiles for different workflows.
What if I debug for 8 hours, find the root cause, but don’t commit?
Fair problem. By default such a session gets negative reward, and its memories are penalized. Partial solutions: create a separate Experience for research workflow with different signals, or manually calibrate via calibrate_experience_q. But by default the system is biased toward “visible productivity.”
Privacy & Reliability
Does my data go anywhere?
No. Qdrant runs in Docker on your machine, embeddings generated locally via FastEmbed. Zero cloud API for core operations. The only exception — optional LLM enrichment through Anthropic API (memory classification). Disable with OPENEXP_EXPLANATION_ENABLED=false, doesn’t affect core functionality.
If Docker crashes or computer shuts down — do I lose everything?
No. Qdrant persists data to disk. When Docker restarts — the container starts automatically (restart: unless-stopped). Q-cache is also on disk. The only thing you might lose is observations from the current unfinished session.
Integrations & Limitations
Does this work with Cursor, aider?
Currently Claude Code only. Integration is built on the hooks system (SessionStart, PostToolUse, SessionEnd) and MCP — these are Claude Code-specific APIs. Cursor and aider aren’t supported. The core engine is a generic Python library, theoretically you could write an adapter, but nobody has done that yet.
We’re on LangChain/LangGraph. How to integrate without Claude Code?
You can use the core Python library directly: search, QCache, add_memory(). But you’ll need to: (1) capture observations instead of the PostToolUse hook, (2) determine session end and its productivity, (3) integrate retrieval into your pipeline. REST API or LangChain package — not available yet.
I have 5+ projects. Won’t it get confused?
Full multi-project isolation doesn’t exist yet — one Qdrant collection for everything. Q-learning partially self-corrects: if a memory from a React project didn’t help in a Go session — its Q-value drops. Workaround: different OPENEXP_COLLECTION via .env for different projects.
Does it support multi-tenant for SaaS?
No. Currently a single-tenant library: one Qdrant, one Q-cache, one set of hooks. For SaaS with hundreds of users you’d need a custom HTTP layer with tenant routing. Not on the near-term roadmap.
Metrics & Evidence
Are there benchmarks? Retrieval quality graphs?
Honest answer — no benchmarks. None. We openly state this in CONTRIBUTING.md as an area where help is needed. “After 100 sessions” is a projection from Q-learning math (at alpha=0.25 you need ~4 positive updates to reach Q>0.5), not a result from a controlled experiment.
A/B test of “with Q-learning” vs “just vector search”?
No. The theoretical argument: similarity can’t distinguish current information from outdated, Q-value adds the signal “this has helped before.” But no ablation study has been conducted. At this stage Q-value reranking barely affects results because most memories have Q near 0. The potential is there, the proof is not.
You retrieve 10 memories, all get equal reward. But maybe only 1 actually helped?
Fundamental credit assignment problem, and we haven’t solved it. Partial mitigation: Experiences let you filter which memory types receive rewards (only “decision” and “insight,” not “action”). With enough sessions the noise averages out, but it’s slow.
Reward weights (commit=0.3, PR=0.2) — aren’t those just your personal patterns?
Fair point. Default weights are literally my workflow. A data scientist in Jupyter who never commits — every session gets negative reward. Experiences are an attempt to fix this: create a separate reward profile. But only 3 profiles ship (default, sales, dealflow), and none have been tested by other users.
Why OpenExp
Why not just Mem0? They have 51K stars and $24M funding.
Mem0 is a different weight class in infrastructure maturity. What OpenExp offers that Mem0 doesn’t: Q-learning ranking, outcome-based reward loop, process-aware memory. No competitor has learned prioritization. Realistic approach: use OpenExp’s Q-learning engine as a reranking step on top of your existing memory layer, rather than a full replacement.
How do I know it’s actually working?
After 2-3 weeks you’ll notice Claude starts “knowing” your context: conventions, past decisions, working approaches. There are also inspection tools: experience_insights shows the most valuable memory types, experience_top_memories shows top by Q-value, explain_q explains in plain language why a specific memory has its rating. But be realistic — the system needs time to accumulate data.

Stop telling. Start teaching.

Skills say how. OpenExp teaches what works. Open source. MIT license.