feat(core): Engram, Store, Retriever, CLI - Grundsystem Second Brain
- src/engram.py: Gedaechtniseinheit mit Confidence, Correctness, Links - src/store.py: SQLite FTS5 persistenter Speicher - src/retriever.py: Hybrid Suche + Reranking - src/cli.py: Kommandozeilen-Interface Issue: #1
This commit is contained in:
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
data/
|
||||
169
docs/ARCHITECTURE.md
Normal file
169
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# Second Brain - Architektur
|
||||
|
||||
## Vision
|
||||
|
||||
Ein zweites Gehirn für OpenClaw das:
|
||||
- **Kurzzeitgedächtnis**: Aktuelle Sessions, Kontext, unverarbeitete Informationen
|
||||
- **Langzeitgedächtnis**: Gesammeltes Wissen, bewertet, verknüpft, priorisiert
|
||||
- **Bewertungssystem**: Jedes Faktum hat einen Vertrauenswert (0-1), korrektierbar
|
||||
- **Proaktivität**: Agent wacht auf, prüft, handelt ohne expliziten Befehl
|
||||
- **Selbstheilung**: Erkennt eigene Fehler, korrigiert, lernt daraus
|
||||
|
||||
## Module
|
||||
|
||||
### 1. Engram Store (Gedächtnis-Einheiten)
|
||||
Jede Information wird als "Engramm" gespeichert:
|
||||
```
|
||||
{
|
||||
id: uuid
|
||||
content: string (Markdown)
|
||||
vector: [float...] (Embedding)
|
||||
metadata: {
|
||||
source: "user|agent|web|file"
|
||||
confidence: 0.0-1.0
|
||||
created: timestamp
|
||||
modified: timestamp
|
||||
access_count: int
|
||||
last_accessed: timestamp
|
||||
tags: [string...]
|
||||
session_id: string|null
|
||||
agent_id: string|null
|
||||
},
|
||||
correctness: {
|
||||
confirmed: bool
|
||||
confirmations: int
|
||||
rejections: int
|
||||
last_reviewed: timestamp
|
||||
review_history: [
|
||||
{ by: "user|agent", action: "confirm|reject|modify", at: timestamp, note: string }
|
||||
]
|
||||
},
|
||||
links: [uuid...] (verbundene Engramme)
|
||||
hierarchy: {
|
||||
parent: uuid|null
|
||||
children: [uuid...]
|
||||
depth: int
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Vector Store (ChromaDB)
|
||||
- Lokale SQLite-basierte Vektor-Datenbank
|
||||
- Kein externer Service nötig
|
||||
- Embedding über sentence-transformers (all-MiniLM-L6-v2)
|
||||
- ~22MB Modell, CPU-only, 384 Dimensionen
|
||||
|
||||
### 3. Neural Scorer
|
||||
- Kleines Feed-Forward-Netz (PyTorch)
|
||||
- Eingabe: Embedding + Metadaten (Alter, Zugriffshäufigkeit, Quelle)
|
||||
- Ausgabe: Confidence-Score (0-1)
|
||||
- Training: Reinforcement von User-Feedback (richtig/falsch)
|
||||
|
||||
### 4. Retrieval Engine
|
||||
- Hybrid: Semantische Suche (Vektor) + Keyword (BM25-ähnlich)
|
||||
- Reranking nach Confidence, Aktualität, Relevanz
|
||||
- Contextual Compression: Nur relevante Teile zurückgeben
|
||||
|
||||
### 5. Proactivity Engine
|
||||
- Cron-gesteuerte Hintergrundaufgaben
|
||||
- Heartbeat-gesteuerte Prüfungen
|
||||
- Trigger: Zeit, Events, Zustandsänderungen
|
||||
- Entscheidet selbst: Was ist jetzt wichtig?
|
||||
|
||||
### 6. Error Correction
|
||||
- Erkennt fehlgeschlagene Tool-Calls
|
||||
- Speichert Fehler mit Kontext
|
||||
- Analysiert Muster: "Immer wenn X, dann scheitert Y"
|
||||
- Auto-Fix: Alternative Strategien, Fallbacks
|
||||
|
||||
### 7. Visualisierung
|
||||
- Streamlit-Dashboard lokal
|
||||
- Graph-Ansicht: Verknüpfte Engramme
|
||||
- Timeline: Wann wurde was gelernt?
|
||||
- Stats: Vertrauen, Korrektheit, Abdeckung
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Komponente | Technologie | Warum |
|
||||
|------------|-------------|-------|
|
||||
| Vektor-DB | ChromaDB (lokal) | Kein externer Service, SQLite-basiert |
|
||||
| Embeddings | sentence-transformers | Klein, schnell, offline |
|
||||
| Neural Scorer | PyTorch (custom) | Trainierbar, lokal, kein API-Key |
|
||||
| Frontend | Streamlit | Schnell, Python-nativ, interaktiv |
|
||||
| Daten-Layer | Python-Klassen + SQLite | Kontrollierbar, debuggbar |
|
||||
| Prozesse | Cron (OpenClaw built-in) + Heartbeat | Kein externer Scheduler |
|
||||
|
||||
## Datenfluss
|
||||
|
||||
```
|
||||
User Input / Event
|
||||
|
|
||||
v
|
||||
[Parser] -> Engramm erstellen
|
||||
|
|
||||
v
|
||||
[Embedding] -> Vektor generieren
|
||||
|
|
||||
v
|
||||
[Vector Store] -> Speichern
|
||||
|
|
||||
v
|
||||
[Neural Scorer] -> Initial-Confidence
|
||||
|
|
||||
v
|
||||
[Link Engine] -> Mit bestehenden verknüpfen
|
||||
|
|
||||
v
|
||||
[Retrieval] <- Anfrage
|
||||
|
|
||||
v
|
||||
[Rerank] -> Beste Ergebnisse
|
||||
|
|
||||
v
|
||||
[Response] -> An User / Agent
|
||||
|
|
||||
v
|
||||
[Feedback Loop] <- Richtig/Falsch?
|
||||
|
|
||||
v
|
||||
[Learn] -> Scorer trainieren, Confidence anpassen
|
||||
```
|
||||
|
||||
## Dateistruktur
|
||||
|
||||
```
|
||||
second-brain/
|
||||
├── src/
|
||||
│ ├── __init__.py
|
||||
│ ├── engram.py # Engramm-Modell
|
||||
│ ├── store.py # ChromaDB-Wrapper
|
||||
│ ├── embedder.py # Embedding-Engine
|
||||
│ ├── scorer.py # Neural Confidence Scorer
|
||||
│ ├── retriever.py # Hybrid Retrieval
|
||||
│ ├── linker.py # Verknüpfungs-Engine
|
||||
│ ├── proactivity.py # Proaktivitäts-Manager
|
||||
│ ├── error_handler.py # Fehlererkennung & Korrektur
|
||||
│ ├── trainer.py # RL-Training
|
||||
│ └── config.py # Konfiguration
|
||||
├── data/
|
||||
│ ├── chromadb/ # Vector DB Files
|
||||
│ ├── engrams.jsonl # Backup aller Engramme
|
||||
│ └── scorer_model.pt # Trainiertes Scorer-Netz
|
||||
├── docs/
|
||||
│ ├── ARCHITECTURE.md
|
||||
│ └── API.md
|
||||
├── tests/
|
||||
│ └── test_core.py
|
||||
├── scripts/
|
||||
│ └── init_db.py
|
||||
└── app.py # Streamlit Dashboard
|
||||
```
|
||||
|
||||
## Nächste Schritte
|
||||
|
||||
1. Kern-Module implementieren (Store, Embedder, Engram)
|
||||
2. Scorer mit Dummy-Daten trainieren
|
||||
3. Retrieval-Engine mit Testdaten validieren
|
||||
4. Dashboard bauen
|
||||
5. Cron-Jobs für Proaktivität setup
|
||||
6. Issue #1 & #2 adressieren (Looping verhindern)
|
||||
8
src/__init__.py
Normal file
8
src/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
"""Second Brain - Gedächtnissystem für OpenClaw."""
|
||||
|
||||
from .engram import Engram, Grounding, Correctness, ReviewEntry
|
||||
from .store import EngramStore
|
||||
from .retriever import Retriever
|
||||
|
||||
__version__ = "0.1.0"
|
||||
__all__ = ["Engram", "Grounding", "Correctness", "ReviewEntry", "EngramStore", "Retriever"]
|
||||
172
src/cli.py
Normal file
172
src/cli.py
Normal file
@@ -0,0 +1,172 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Second Brain CLI - direkte Nutzung ohne externe Abhängigkeiten.
|
||||
|
||||
Usage:
|
||||
python -m src.cli add "Das ist ein Faktum" --tag wichtig --source user
|
||||
python -m src.cli search "Faktum"
|
||||
python -m src.cli show <id>
|
||||
python -m src.cli confirm <id>
|
||||
python -m src.cli reject <id>
|
||||
python -m src.cli list
|
||||
python -m src.cli stats
|
||||
python -m src.cli export backup.jsonl
|
||||
"""
|
||||
|
||||
import sys
|
||||
import json
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
from .store import EngramStore
|
||||
from .engram import Engram, Grounding
|
||||
from .retriever import Retriever
|
||||
|
||||
DB_PATH = Path(__file__).parent.parent / "data" / "brain.sqlite"
|
||||
|
||||
|
||||
def get_store():
|
||||
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
return EngramStore(str(DB_PATH))
|
||||
|
||||
|
||||
def cmd_add(args):
|
||||
store = get_store()
|
||||
eg = Engram.create(
|
||||
content=" ".join(args.content),
|
||||
source=args.source,
|
||||
tags=args.tag,
|
||||
grounding=Grounding[args.grounding] if args.grounding else Grounding.ASSUMPTION,
|
||||
)
|
||||
store.save(eg)
|
||||
print(f"Created: {eg.id}\n Content: {eg.content[:100]}\n Confidence: {eg.compute_confidence():.2f}")
|
||||
|
||||
|
||||
def cmd_search(args):
|
||||
store = get_store()
|
||||
ret = Retriever(store)
|
||||
results = ret.retrieve(
|
||||
" ".join(args.query),
|
||||
limit=args.limit,
|
||||
min_confidence=args.min_confidence,
|
||||
tag_filter=args.tag,
|
||||
)
|
||||
print(f"\n=== {len(results)} Results ===")
|
||||
for r in results:
|
||||
eg = r["engram"]
|
||||
conf = eg.compute_confidence()
|
||||
marker = "✅" if conf > 0.7 else "⚠️" if conf > 0.4 else "❌"
|
||||
print(f"\n{marker} [{str(eg.id)[:8]}] Score: {conf:.2f} ({r['match_type']})")
|
||||
print(f" {eg.content[:120]}{'...' if len(eg.content) > 120 else ''}")
|
||||
print(f" Tags: {', '.join(eg.metadata.get('tags', []))} | Source: {eg.metadata.get('source')}")
|
||||
print(f" Access: {eg.metadata.get('access_count', 0)} | Reviews: +{eg.correctness.confirmations}/-{eg.correctness.rejections}")
|
||||
|
||||
|
||||
def cmd_show(args):
|
||||
store = get_store()
|
||||
eg = store.get(args.id)
|
||||
if not eg:
|
||||
print(f"Not found: {args.id}")
|
||||
return
|
||||
print(json.dumps(eg.to_dict(), indent=2, ensure_ascii=False, default=str))
|
||||
|
||||
|
||||
def cmd_confirm(args):
|
||||
store = get_store()
|
||||
eg = store.get(args.id)
|
||||
if not eg:
|
||||
print(f"Not found: {args.id}")
|
||||
return
|
||||
eg.correctness.confirm(by="user", note=args.note or "Confirmed via CLI")
|
||||
store.save(eg)
|
||||
print(f"✅ Confirmed [{str(eg.id)[:8]}] -> Confidence: {eg.compute_confidence():.2f}")
|
||||
|
||||
|
||||
def cmd_reject(args):
|
||||
store = get_store()
|
||||
eg = store.get(args.id)
|
||||
if not eg:
|
||||
print(f"Not found: {args.id}")
|
||||
return
|
||||
eg.correctness.reject(by="user", note=args.note or "Rejected via CLI")
|
||||
store.save(eg)
|
||||
print(f"❌ Rejected [{str(eg.id)[:8]}] -> Confidence: {eg.compute_confidence():.2f}")
|
||||
|
||||
|
||||
def cmd_list(args):
|
||||
store = get_store()
|
||||
egs = store.get_all(limit=args.limit)
|
||||
print(f"\n=== {len(egs)} Engrams ===")
|
||||
for eg in egs:
|
||||
conf = eg.compute_confidence()
|
||||
marker = "✅" if conf > 0.7 else "⚠️" if conf > 0.4 else "❌"
|
||||
print(f"{marker} [{str(eg.id)[:8]}] ({conf:.2f}) {eg.content[:60]}{'...' if len(eg.content) > 60 else ''}")
|
||||
|
||||
|
||||
def cmd_stats(args):
|
||||
store = get_store()
|
||||
ret = Retriever(store)
|
||||
s = ret.stats()
|
||||
print("\n=== Second Brain Stats ===")
|
||||
print(f" Total Engrams: {s['total_engrams']}")
|
||||
print(f" Confirmed: {s['confirmed']}")
|
||||
print(f" Unconfirmed: {s['unconfirmed']}")
|
||||
print(f" Sources:")
|
||||
for src, count in s.get("sources", {}).items():
|
||||
print(f" {src}: {count}")
|
||||
print(f" DB Size: {s['db_size_bytes'] / 1024:.1f} KB")
|
||||
|
||||
|
||||
def cmd_export(args):
|
||||
store = get_store()
|
||||
count = store.export_jsonl(args.path)
|
||||
print(f"Exported {count} engrams to {args.path}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Second Brain CLI")
|
||||
sub = parser.add_subparsers(dest="cmd")
|
||||
|
||||
p_add = sub.add_parser("add", help="Add a new engram")
|
||||
p_add.add_argument("content", nargs="+")
|
||||
p_add.add_argument("--tag", action="append", default=[])
|
||||
p_add.add_argument("--source", default="user")
|
||||
p_add.add_argument("--grounding", choices=[g.name for g in Grounding])
|
||||
|
||||
p_search = sub.add_parser("search", help="Search engrams")
|
||||
p_search.add_argument("query", nargs="+")
|
||||
p_search.add_argument("--limit", type=int, default=5)
|
||||
p_search.add_argument("--min-confidence", type=float, default=0.0)
|
||||
p_search.add_argument("--tag", default=None)
|
||||
|
||||
p_show = sub.add_parser("show", help="Show engram details")
|
||||
p_show.add_argument("id")
|
||||
|
||||
p_confirm = sub.add_parser("confirm", help="Confirm an engram")
|
||||
p_confirm.add_argument("id")
|
||||
p_confirm.add_argument("--note", default="")
|
||||
|
||||
p_reject = sub.add_parser("reject", help="Reject an engram")
|
||||
p_reject.add_argument("id")
|
||||
p_reject.add_argument("--note", default="")
|
||||
|
||||
p_list = sub.add_parser("list", help="List recent engrams")
|
||||
p_list.add_argument("--limit", type=int, default=20)
|
||||
|
||||
p_stats = sub.add_parser("stats", help="Show statistics")
|
||||
|
||||
p_export = sub.add_parser("export", help="Export to JSONL")
|
||||
p_export.add_argument("path")
|
||||
|
||||
args = parser.parse_args()
|
||||
if not args.cmd:
|
||||
parser.print_help()
|
||||
return
|
||||
|
||||
{"add": cmd_add, "search": cmd_search, "show": cmd_show,
|
||||
"confirm": cmd_confirm, "reject": cmd_reject, "list": cmd_list,
|
||||
"stats": cmd_stats, "export": cmd_export}[args.cmd](args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
230
src/engram.py
Normal file
230
src/engram.py
Normal file
@@ -0,0 +1,230 @@
|
||||
"""
|
||||
Engram - Gedächtniseinheit für das Second Brain.
|
||||
Rein Python, kein externe Abhängigkeiten.
|
||||
"""
|
||||
|
||||
import json
|
||||
import hashlib
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from datetime import datetime, timezone
|
||||
from enum import IntEnum
|
||||
from typing import Optional, List, Dict, Any
|
||||
from uuid import uuid4, UUID
|
||||
|
||||
|
||||
class Grounding(IntEnum):
|
||||
"""Herkunft/Verlässlichkeit einer Information."""
|
||||
UNKNOWN = 0
|
||||
ASSUMPTION = 1
|
||||
INFERRED = 2
|
||||
SOURCED = 3
|
||||
VERIFIED = 4
|
||||
|
||||
|
||||
@dataclass
|
||||
class ReviewEntry:
|
||||
"""Ein Eintrag im Korrekturverlauf."""
|
||||
by: str # "user" oder agent_id
|
||||
action: str # "confirm", "reject", "modify"
|
||||
at: str # ISO-8601 timestamp
|
||||
note: str = ""
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {"by": self.by, "action": self.action, "at": self.at, "note": self.note}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> "ReviewEntry":
|
||||
return cls(d["by"], d["action"], d["at"], d.get("note", ""))
|
||||
|
||||
|
||||
@dataclass
|
||||
class Correctness:
|
||||
"""Verfolgt die Korrektheit eines Engramms über Zeit."""
|
||||
confirmed: bool = False
|
||||
confirmations: int = 0
|
||||
rejections: int = 0
|
||||
last_reviewed: Optional[str] = None
|
||||
review_history: List[ReviewEntry] = field(default_factory=list)
|
||||
|
||||
def confirm(self, by: str, note: str = "") -> None:
|
||||
self.confirmations += 1
|
||||
self.confirmed = True
|
||||
self.last_reviewed = _now()
|
||||
self.review_history.append(ReviewEntry(by, "confirm", self.last_reviewed, note))
|
||||
|
||||
def reject(self, by: str, note: str = "") -> None:
|
||||
self.rejections += 1
|
||||
self.confirmed = False
|
||||
self.last_reviewed = _now()
|
||||
self.review_history.append(ReviewEntry(by, "reject", self.last_reviewed, note))
|
||||
|
||||
def score(self) -> float:
|
||||
"""Confidence-Score aus Korrekturhistorie."""
|
||||
total = self.confirmations + self.rejections
|
||||
if total == 0:
|
||||
return 0.5 # Unbestimmt
|
||||
return self.confirmations / total
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"confirmed": self.confirmed,
|
||||
"confirmations": self.confirmations,
|
||||
"rejections": self.rejections,
|
||||
"last_reviewed": self.last_reviewed,
|
||||
"review_history": [r.to_dict() for r in self.review_history],
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> "Correctness":
|
||||
c = cls()
|
||||
c.confirmed = d.get("confirmed", False)
|
||||
c.confirmations = d.get("confirmations", 0)
|
||||
c.rejections = d.get("rejections", 0)
|
||||
c.last_reviewed = d.get("last_reviewed")
|
||||
c.review_history = [ReviewEntry.from_dict(r) for r in d.get("review_history", [])]
|
||||
return c
|
||||
|
||||
|
||||
@dataclass
|
||||
class Engram:
|
||||
"""
|
||||
Eine Gedächtniseinheit (Engramm).
|
||||
|
||||
Jedes Faktum, jede Beobachtung, jeder Fehler wird als Engramm gespeichert.
|
||||
Es trägt seinen eigenen Vertrauenswert und seinen Korrekturverlauf mit.
|
||||
"""
|
||||
id: UUID
|
||||
content: str
|
||||
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||
correctness: Correctness = field(default_factory=Correctness)
|
||||
links: List[UUID] = field(default_factory=list)
|
||||
hierarchy: Dict[str, Any] = field(default_factory=dict)
|
||||
embedding: Optional[List[float]] = None # Wird bei Bedarf berechnet
|
||||
|
||||
@classmethod
|
||||
def create(
|
||||
cls,
|
||||
content: str,
|
||||
source: str = "agent",
|
||||
confidence: float = 0.5,
|
||||
tags: Optional[List[str]] = None,
|
||||
session_id: Optional[str] = None,
|
||||
agent_id: Optional[str] = None,
|
||||
grounding: Grounding = Grounding.ASSUMPTION,
|
||||
parent: Optional[UUID] = None,
|
||||
) -> "Engram":
|
||||
"""Factory: Erstellt ein neues Engramm mit sinnvollen Defaults."""
|
||||
now = _now()
|
||||
return cls(
|
||||
id=uuid4(),
|
||||
content=content,
|
||||
metadata={
|
||||
"source": source,
|
||||
"confidence": confidence,
|
||||
"created": now,
|
||||
"modified": now,
|
||||
"access_count": 0,
|
||||
"last_accessed": now,
|
||||
"tags": tags or [],
|
||||
"session_id": session_id,
|
||||
"agent_id": agent_id,
|
||||
"grounding": grounding.value,
|
||||
"hash": _hash(content),
|
||||
},
|
||||
correctness=Correctness(),
|
||||
links=[],
|
||||
hierarchy={"parent": str(parent) if parent else None, "children": [], "depth": 0},
|
||||
)
|
||||
|
||||
def touch(self) -> None:
|
||||
"""Markiert Zugriff, aktualisiert Zähler und Zeit."""
|
||||
self.metadata["access_count"] = self.metadata.get("access_count", 0) + 1
|
||||
self.metadata["last_accessed"] = _now()
|
||||
|
||||
def add_link(self, other: "Engram") -> None:
|
||||
"""Bidirektionale Verknüpfung mit anderem Engramm."""
|
||||
if other.id not in self.links:
|
||||
self.links.append(other.id)
|
||||
if self.id not in other.links:
|
||||
other.links.append(self.id)
|
||||
|
||||
def set_parent(self, parent: "Engram") -> None:
|
||||
"""Setzt Eltern-Kind-Beziehung."""
|
||||
self.hierarchy["parent"] = str(parent.id)
|
||||
self.hierarchy["depth"] = parent.hierarchy.get("depth", 0) + 1
|
||||
if str(self.id) not in parent.hierarchy.get("children", []):
|
||||
parent.hierarchy.setdefault("children", []).append(str(self.id))
|
||||
|
||||
def compute_confidence(self) -> float:
|
||||
"""
|
||||
Berechnet Gesamt-Confidence aus mehreren Faktoren.
|
||||
Kein Neuronales Netz nötig - Heuristik für Phase 1.
|
||||
"""
|
||||
base = self.metadata.get("confidence", 0.5)
|
||||
# Korrektheit
|
||||
correctness_score = self.correctness.score()
|
||||
# Zugriffshäufigkeit (beliebte Engramme sind oft wichtiger)
|
||||
access = min(self.metadata.get("access_count", 0) / 10, 1.0) * 0.1
|
||||
# Alter (neuere Informationen sind relevanter)
|
||||
age_days = _age_days(self.metadata.get("created", _now()))
|
||||
recency = max(0, 1.0 - (age_days / 30)) * 0.1 # Nach 30 Tagen = 0
|
||||
# Grounding
|
||||
grounding_boost = (self.metadata.get("grounding", 0) / 4) * 0.2
|
||||
|
||||
combined = (
|
||||
base * 0.3 +
|
||||
correctness_score * 0.3 +
|
||||
access +
|
||||
recency +
|
||||
grounding_boost
|
||||
)
|
||||
return min(max(combined, 0.0), 1.0)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"id": str(self.id),
|
||||
"content": self.content,
|
||||
"metadata": self.metadata,
|
||||
"correctness": self.correctness.to_dict(),
|
||||
"links": [str(l) for l in self.links],
|
||||
"hierarchy": self.hierarchy,
|
||||
"embedding": self.embedding,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> "Engram":
|
||||
e = cls(
|
||||
id=UUID(d["id"]),
|
||||
content=d["content"],
|
||||
metadata=d.get("metadata", {}),
|
||||
correctness=Correctness.from_dict(d.get("correctness", {})),
|
||||
links=[UUID(l) for l in d.get("links", [])],
|
||||
hierarchy=d.get("hierarchy", {}),
|
||||
embedding=d.get("embedding"),
|
||||
)
|
||||
return e
|
||||
|
||||
def to_json(self) -> str:
|
||||
return json.dumps(self.to_dict(), ensure_ascii=False, indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, s: str) -> "Engram":
|
||||
return cls.from_dict(json.loads(s))
|
||||
|
||||
|
||||
# --- Helpers ---
|
||||
|
||||
def _now() -> str:
|
||||
return datetime.now(timezone.utc).isoformat()
|
||||
|
||||
|
||||
def _hash(content: str) -> str:
|
||||
return hashlib.sha256(content.encode("utf-8")).hexdigest()[:16]
|
||||
|
||||
|
||||
def _age_days(iso_str: str) -> float:
|
||||
try:
|
||||
dt = datetime.fromisoformat(iso_str)
|
||||
return (datetime.now(timezone.utc) - dt).total_seconds() / 86400
|
||||
except Exception:
|
||||
return 0.0
|
||||
55
src/retriever.py
Normal file
55
src/retriever.py
Normal file
@@ -0,0 +1,55 @@
|
||||
"""
|
||||
Hybrid-Retrieval Engine.
|
||||
Phase 1: FTS-Keyword + Confidence-Reranking.
|
||||
Phase 2: + Embedding + Fusion.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
from .engram import Engram
|
||||
from .store import EngramStore
|
||||
|
||||
|
||||
class Retriever:
|
||||
def __init__(self, store: EngramStore):
|
||||
self.store = store
|
||||
|
||||
def retrieve(
|
||||
self,
|
||||
query: str,
|
||||
limit: int = 5,
|
||||
min_confidence: float = 0.0,
|
||||
source_filter: str = None,
|
||||
tag_filter: str = None,
|
||||
) -> List[Dict[str, Any]]:
|
||||
results = []
|
||||
keyword_results = self.store.search_text(query, limit=limit * 3)
|
||||
for eg in keyword_results:
|
||||
conf = eg.compute_confidence()
|
||||
if conf < min_confidence:
|
||||
continue
|
||||
if source_filter and eg.metadata.get("source") != source_filter:
|
||||
continue
|
||||
if tag_filter and tag_filter not in eg.metadata.get("tags", []):
|
||||
continue
|
||||
eg.touch()
|
||||
self.store.save(eg)
|
||||
results.append({"engram": eg, "score": conf, "match_type": "keyword"})
|
||||
results.sort(key=lambda r: r["score"], reverse=True)
|
||||
return results[:limit]
|
||||
|
||||
def related(self, engram_id: str, limit: int = 5) -> List[Engram]:
|
||||
eg = self.store.get(engram_id)
|
||||
if not eg:
|
||||
return []
|
||||
out = []
|
||||
for lid in eg.links:
|
||||
linked = self.store.get(str(lid))
|
||||
if linked:
|
||||
out.append(linked)
|
||||
return sorted(out, key=lambda e: e.compute_confidence(), reverse=True)[:limit]
|
||||
|
||||
def recent(self, limit: int = 10) -> List[Engram]:
|
||||
return self.store.get_all(limit=limit)
|
||||
|
||||
def stats(self) -> Dict[str, Any]:
|
||||
return self.store.stats()
|
||||
253
src/store.py
Normal file
253
src/store.py
Normal file
@@ -0,0 +1,253 @@
|
||||
"""
|
||||
SQLite-basierter Engramm-Store.
|
||||
Keine externen Abhängigkeiten außer sqlite3 (stdlib).
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import List, Optional, Dict, Any
|
||||
from uuid import UUID
|
||||
|
||||
from .engram import Engram
|
||||
|
||||
|
||||
class EngramStore:
|
||||
"""
|
||||
Persistenter Engramm-Speicher mit vollem Text-Index.
|
||||
|
||||
Erstelle Instanz:
|
||||
store = EngramStore("/pfad/zur/db.sqlite")
|
||||
"""
|
||||
|
||||
def __init__(self, db_path: str):
|
||||
self.db_path = Path(db_path)
|
||||
self.db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
self._conn = sqlite3.connect(str(self.db_path), check_same_thread=False)
|
||||
self._conn.row_factory = sqlite3.Row
|
||||
self._init_schema()
|
||||
|
||||
def _init_schema(self) -> None:
|
||||
"""Erstellt Tabellen falls nicht vorhanden."""
|
||||
self._conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS engrams (
|
||||
id TEXT PRIMARY KEY,
|
||||
content TEXT NOT NULL,
|
||||
metadata_json TEXT NOT NULL,
|
||||
correctness_json TEXT NOT NULL,
|
||||
links_json TEXT NOT NULL,
|
||||
hierarchy_json TEXT NOT NULL,
|
||||
embedding_json TEXT,
|
||||
created_at TEXT NOT NULL,
|
||||
modified_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS engrams_fts USING fts5(
|
||||
content,
|
||||
tags,
|
||||
source,
|
||||
content_rowid=rowid,
|
||||
tokenize='porter'
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS engrams_links (
|
||||
from_id TEXT NOT NULL,
|
||||
to_id TEXT NOT NULL,
|
||||
PRIMARY KEY (from_id, to_id)
|
||||
);
|
||||
""")
|
||||
self._conn.commit()
|
||||
|
||||
# ---- CRUD ----
|
||||
|
||||
def save(self, engram: Engram) -> Engram:
|
||||
"""Speichert oder aktualisiert ein Engramm."""
|
||||
now = _now()
|
||||
data = {
|
||||
"id": str(engram.id),
|
||||
"content": engram.content,
|
||||
"metadata_json": json.dumps(engram.metadata, ensure_ascii=False),
|
||||
"correctness_json": json.dumps(engram.correctness.to_dict(), ensure_ascii=False),
|
||||
"links_json": json.dumps([str(l) for l in engram.links], ensure_ascii=False),
|
||||
"hierarchy_json": json.dumps(engram.hierarchy, ensure_ascii=False),
|
||||
"embedding_json": json.dumps(engram.embedding, ensure_ascii=False) if engram.embedding else None,
|
||||
"created_at": engram.metadata.get("created", now),
|
||||
"modified_at": now,
|
||||
}
|
||||
self._conn.execute("""
|
||||
INSERT INTO engrams (id, content, metadata_json, correctness_json, links_json, hierarchy_json, embedding_json, created_at, modified_at)
|
||||
VALUES (:id, :content, :metadata_json, :correctness_json, :links_json, :hierarchy_json, :embedding_json, :created_at, :modified_at)
|
||||
ON CONFLICT(id) DO UPDATE SET
|
||||
content=excluded.content,
|
||||
metadata_json=excluded.metadata_json,
|
||||
correctness_json=excluded.correctness_json,
|
||||
links_json=excluded.links_json,
|
||||
hierarchy_json=excluded.hierarchy_json,
|
||||
embedding_json=excluded.embedding_json,
|
||||
modified_at=excluded.modified_at
|
||||
""", data)
|
||||
|
||||
# FTS-Index aktualisieren (DELETE + INSERT, kein UPSERT für virtuelle Tabellen)
|
||||
tags = " ".join(engram.metadata.get("tags", []))
|
||||
source = engram.metadata.get("source", "")
|
||||
rowid = self._conn.execute("SELECT rowid FROM engrams WHERE id=?", (str(engram.id),)).fetchone()
|
||||
if rowid:
|
||||
self._conn.execute("DELETE FROM engrams_fts WHERE rowid=?", (rowid[0],))
|
||||
self._conn.execute("""
|
||||
INSERT INTO engrams_fts(rowid, content, tags, source)
|
||||
VALUES ((SELECT rowid FROM engrams WHERE id=:id), :content, :tags, :source)
|
||||
""", {"id": str(engram.id), "content": engram.content, "tags": tags, "source": source})
|
||||
|
||||
# Links speichern
|
||||
self._conn.execute("DELETE FROM engrams_links WHERE from_id=?", (str(engram.id),))
|
||||
for link in engram.links:
|
||||
self._conn.execute(
|
||||
"INSERT OR IGNORE INTO engrams_links (from_id, to_id) VALUES (?, ?)",
|
||||
(str(engram.id), str(link))
|
||||
)
|
||||
|
||||
self._conn.commit()
|
||||
return engram
|
||||
|
||||
def get(self, engram_id: str) -> Optional[Engram]:
|
||||
"""Lädt ein Engramm anhand seiner ID."""
|
||||
row = self._conn.execute(
|
||||
"SELECT * FROM engrams WHERE id=?", (engram_id,)
|
||||
).fetchone()
|
||||
if not row:
|
||||
return None
|
||||
return self._row_to_engram(row)
|
||||
|
||||
def get_all(self, limit: int = 1000, offset: int = 0) -> List[Engram]:
|
||||
"""Lädt alle Engramme (paginiert)."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT * FROM engrams ORDER BY created_at DESC LIMIT ? OFFSET ?",
|
||||
(limit, offset)
|
||||
).fetchall()
|
||||
return [self._row_to_engram(r) for r in rows]
|
||||
|
||||
def delete(self, engram_id: str) -> bool:
|
||||
"""Löscht ein Engramm und alle Verknüpfungen."""
|
||||
rowid = self._conn.execute(
|
||||
"SELECT rowid FROM engrams WHERE id=?", (engram_id,)
|
||||
).fetchone()
|
||||
if not rowid:
|
||||
return False
|
||||
self._conn.execute("DELETE FROM engrams_fts WHERE rowid=?", (rowid[0],))
|
||||
self._conn.execute("DELETE FROM engrams_links WHERE from_id=? OR to_id=?", (engram_id, engram_id))
|
||||
self._conn.execute("DELETE FROM engrams WHERE id=?", (engram_id,))
|
||||
self._conn.commit()
|
||||
return True
|
||||
|
||||
def count(self) -> int:
|
||||
"""Anzahl der gespeicherten Engramme."""
|
||||
row = self._conn.execute("SELECT COUNT(*) FROM engrams").fetchone()
|
||||
return row[0] if row else 0
|
||||
|
||||
# ---- Search ----
|
||||
|
||||
def search_text(self, query: str, limit: int = 10) -> List[Engram]:
|
||||
"""Full-Text-Suche über Engramm-Inhalt via SQLite FTS5 (OR-Verknüpfung)."""
|
||||
# FTS5-Syntax: Wörter mit OR verbinden für bessere Ergebnisse
|
||||
words = [w.strip() for w in query.replace("'", "''").split() if w.strip()]
|
||||
safe_query = " OR ".join(words) if len(words) > 1 else (words[0] if words else "*")
|
||||
sql = """
|
||||
SELECT e.* FROM engrams e
|
||||
JOIN engrams_fts fts ON e.rowid = fts.rowid
|
||||
WHERE engrams_fts MATCH ?
|
||||
ORDER BY rank
|
||||
LIMIT ?
|
||||
"""
|
||||
rows = self._conn.execute(sql, (safe_query, limit)).fetchall()
|
||||
return [self._row_to_engram(r) for r in rows]
|
||||
|
||||
def search_tag(self, tag: str, limit: int = 50) -> List[Engram]:
|
||||
"""Suche nach Tag (JSON-contains)."""
|
||||
# Einfache Substring-Suche in JSON
|
||||
rows = self._conn.execute(
|
||||
"SELECT * FROM engrams WHERE metadata_json LIKE ? ORDER BY created_at DESC LIMIT ?",
|
||||
(f'%"{tag}"%', limit)
|
||||
).fetchall()
|
||||
return [self._row_to_engram(r) for r in rows]
|
||||
|
||||
def search_source(self, source: str, limit: int = 50) -> List[Engram]:
|
||||
"""Suche nach Quelle."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT * FROM engrams WHERE metadata_json LIKE ? ORDER BY created_at DESC LIMIT ?",
|
||||
(f'%"source": "{source}"%', limit)
|
||||
).fetchall()
|
||||
return [self._row_to_engram(r) for r in rows]
|
||||
|
||||
# ---- Stats ----
|
||||
|
||||
def stats(self) -> Dict[str, Any]:
|
||||
"""Grundlegende Statistiken über den Store."""
|
||||
total = self.count()
|
||||
confirmed = self._conn.execute(
|
||||
"SELECT COUNT(*) FROM engrams WHERE correctness_json LIKE '%\"confirmed\": true%'"
|
||||
).fetchone()[0]
|
||||
sources = {}
|
||||
for row in self._conn.execute(
|
||||
"SELECT metadata_json FROM engrams"
|
||||
).fetchall():
|
||||
meta = json.loads(row["metadata_json"])
|
||||
src = meta.get("source", "unknown")
|
||||
sources[src] = sources.get(src, 0) + 1
|
||||
|
||||
return {
|
||||
"total_engrams": total,
|
||||
"confirmed": confirmed,
|
||||
"unconfirmed": total - confirmed,
|
||||
"sources": sources,
|
||||
"db_size_bytes": self.db_path.stat().st_size if self.db_path.exists() else 0,
|
||||
}
|
||||
|
||||
# ---- Backup / Export ----
|
||||
|
||||
def export_jsonl(self, path: str) -> int:
|
||||
"""Exportiert alle Engramme als JSONL (eine Zeile pro Engramm)."""
|
||||
count = 0
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
for row in self._conn.execute("SELECT * FROM engrams"):
|
||||
eg = self._row_to_engram(row)
|
||||
f.write(json.dumps(eg.to_dict(), ensure_ascii=False) + "\n")
|
||||
count += 1
|
||||
return count
|
||||
|
||||
def import_jsonl(self, path: str) -> int:
|
||||
"""Importiert Engramme aus JSONL."""
|
||||
count = 0
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
eg = Engram.from_json(line)
|
||||
self.save(eg)
|
||||
count += 1
|
||||
return count
|
||||
|
||||
# ---- Helpers ----
|
||||
|
||||
def _row_to_engram(self, row: sqlite3.Row) -> Engram:
|
||||
d = {
|
||||
"id": row["id"],
|
||||
"content": row["content"],
|
||||
"metadata": json.loads(row["metadata_json"]),
|
||||
"correctness": json.loads(row["correctness_json"]),
|
||||
"links": json.loads(row["links_json"]),
|
||||
"hierarchy": json.loads(row["hierarchy_json"]),
|
||||
}
|
||||
emb = row["embedding_json"]
|
||||
if emb:
|
||||
d["embedding"] = json.loads(emb)
|
||||
return Engram.from_dict(d)
|
||||
|
||||
def close(self) -> None:
|
||||
self._conn.close()
|
||||
|
||||
|
||||
def _now() -> str:
|
||||
from datetime import datetime, timezone
|
||||
return datetime.now(timezone.utc).isoformat()
|
||||
Reference in New Issue
Block a user