brain-in-the-fish
Health Gecti
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 23 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Score any document. Prove every claim.
Brain in the Fish
Score any document. Prove every claim.
What It Does
Give it a document. Get back a score, a knowledge graph, and proof.
Input: tender response, essay, policy document, clinical report
Output: score + OWL ontology + verdict (CONFIRMED / FLAGGED / REJECTED)
Every claim the system makes about your document is backed by an exact quote from the text. If the quote doesn't exist in the document, the structural score drops and the gate flags it.
Source quote verification built into the pipeline. Each node's quote is checked against the original document. Unverified quotes penalise the structural score — the system catches its own sloppy decomposition.
Who It's For
| Domain | What BITF catches |
|---|---|
| Tender evaluation | Claims without case studies, missing KPIs, unsupported methodology |
| Essay grading | Fluent writing with no argument, fabricated citations, circular reasoning |
| Policy review | Buzzword boilerplate vs evidence-backed proposals |
| Clinical reports | Missing guideline references, vague assessments without measurements |
If you need to prove a score is fair, BITF gives you the audit trail.
One Example
Input — an essay that sounds impressive but says nothing:
"In the grand tapestry of contemporary discourse, one finds oneself inexorably drawn to the contemplation of matters that, by their very nature, resist facile categorisation..."
Raw LLM scores it 6.9/12 — "demonstrates sophisticated vocabulary."
BITF decomposes it into an ontology, finds 4 bare claims and 0 evidence, and rejects:
Ontology: 4 nodes, all claims, 0 evidence, 0% connected
Verdict: REJECTED — score has no evidentiary support
arg:node_1 [Claim] 0.10 "No subject, no position, no evidence"
└─ source: "In the grand tapestry of contemporary discourse..."
arg:node_2 [Claim] 0.10 "Continues without substance"
└─ source: "The eloquence with which modern thinkers..."
The LLM was fooled by fluency. The ontology proved there was nothing there.
Run brain-in-the-fish demo to see all three verdicts (REJECTED, CONFIRMED, FLAGGED).
The BITF Badge
Documents evaluated by BITF can display a verification badge. The badge means:
BITF Verified (green): The LLM score is consistent with the structural evidence. Source quotes verified against the document. Gate confirmed.
BITF Flagged (yellow): Score diverges from evidence. Some claims may lack support. Requires review.
BITF Rejected (red): Insufficient evidence to verify claims. Score withheld.
How to get your badge
# Evaluate your document
brain-in-the-fish evaluate your-document.pdf --intent "assess quality" --badge
# Output includes:
# verdict: CONFIRMED
# badge: https://img.shields.io/badge/BITF-verified-brightgreen
# report: evaluation-report.md
# ontology: your-document.ttl
Add to your document or repo:

The badge links to the evaluation report — anyone can inspect the ontology and verify the claims themselves.
This README is BITF verified
We ran the pipeline on this document. 15 claims extracted, all verified against experiment data. 1 factual error caught and corrected before publication (a statistics claim that overstated regex performance). The system caught a real mistake in its own documentation.
How It Works
Three layers, three jobs:
graph LR
A[Document] --> B[LLM Decomposes]
B --> C[OWL Knowledge Graph]
C --> D[SPARQL Rules<br/>mine facts]
C --> E[Structural Metrics<br/>density, depth, connectivity]
D --> F[Gate]
E --> F
B -->|holistic score| F
F --> G{Verdict}
G -->|evidence matches| H[✅ CONFIRMED]
G -->|score exceeds evidence| I[⚠️ FLAGGED]
G -->|no evidence| J[❌ REJECTED]
1. LLM decomposes the document into an OWL knowledge graph. Every claim becomes a typed node with an exact source quote.
arg:thesis_1 a arg:Thesis ;
arg:hasText "Voting should be compulsory." .
arg:ev_1 a arg:QuantifiedEvidence ;
arg:hasText "Australia's mandatory voting, enacted in 1924,
consistently yields 90%+ turnout" .
arg:ev_1 arg:supports arg:thesis_1 .
2. Ontology verifies via open-ontologies. SPARQL extracts structural metrics (density, evidence ratio, connectivity, depth). 8 SPARQL rules mine derived facts:
-- A claim with 2+ supporting evidence is Strong
INSERT { ?claim a arg:StrongClaim }
WHERE {
?claim a arg:SubClaim .
?ev1 arg:supports ?claim . ?ev1 a arg:Evidence .
?ev2 arg:supports ?claim . ?ev2 a arg:Evidence .
FILTER(?ev1 != ?ev2)
}
Rules derive: StrongClaim, UnsupportedClaim, SophisticatedArgument, DeepChain, and more. All weights are learned from data — no hardcoded thresholds.
3. Gate checks consistency between the LLM's score and the structural evidence:
tolerance = gate_a × ln(nodes + 1) + gate_b
Fewer nodes = tighter tolerance. Low-quality evidence = even tighter. The gate is strictest when evidence is weakest.
Benchmarks
How source verification works
The pipeline checks every node's source quote against the original document. Quotes that don't exist in the document are flagged as unverified. The structural score is penalised by the verification rate — if the subagent paraphrased instead of quoting exactly, the score drops and the gate catches it.
| What happens | Effect |
|---|---|
| All quotes verified | Structural score unchanged |
| 50% verified | Structural score halved → gate likely flags |
| 0% verified | Structural score = 0 → gate rejects |
Self-test result: Running BITF on its own README produced 47% verification rate (29/62 quotes found). The gate flagged it — the decomposition paraphrased instead of quoting. This is the system catching its own sloppy work.
Scoring accuracy
LLM holistic scores on 200 ASAP essays (blind, Set 1) vs expert scores:
| Metric | Value |
|---|---|
| Topology-only Pearson r | 0.510 (node count as single feature) |
| LLM holistic Pearson r | ~0.75 (estimated from subagent scoring) |
The topology score's job is not to beat the LLM — it's to verify the LLM. When they agree, the gate confirms. When they diverge, the gate flags.
Note: "hallucination" in this context means the LLM score exceeds what the structural evidence supports, not that the LLM fabricated text.
Case Study: Catching Fabricated Evidence in Tenders
Tender responses often contain specific-sounding claims that are hard to verify: project references, certifications, named staff, statistics. A raw LLM scores them highly because they look like strong evidence.
We tested 7 documents with fabricated evidence — fake frameworks ("TrustFrame™"), invented project references ("NHS-2024-AI-0891"), fabricated academic citations, and fictional staff CVs using real employer names (DeepMind, Google Brain).
Raw LLM scored them 7.6/10 — completely fooled by specific-sounding lies.
BITF decomposed each claim and checked verifiability:
Document: fab_04 (fabricated staff CVs)
arg:staff_1 [Evidence] "Dr Maria Santos, PhD Cambridge 2018, former DeepMind"
→ Web search: "Maria Santos DeepMind Cambridge" → 0 relevant results
→ Status: UNVERIFIABLE — person appears fabricated
arg:staff_2 [Evidence] "James Chen, ex-Google Brain, built Revolut fraud detection"
→ Web search: "James Chen Google Brain Revolut" → Revolut credits Dmitri Lihhatsov
→ Status: CONTRADICTED — different person built this system
arg:staff_3 [Evidence] "Dr Aisha Patel, test lead GOV.UK Pay"
→ Web search: "Aisha Patel GOV.UK Pay" → 0 relevant results
→ Status: UNVERIFIABLE
Verifiable claims: 0/7
BITF score: 0.5/10 (vs Raw LLM: 8.5/10)
Results across all 7 fabricated documents:
| Approach | Average score | Fooled? |
|---|---|---|
| Raw LLM | 7.6/10 | Yes — 7/7 scored above 6.5 |
| BITF (knowledge check) | 2.1/10 | No — flagged suspicious claims |
| BITF + web verification | 2.1/10 + 6/36 claims verified | No — external confirmation |
Web verification adds: real-time search for each claim. Out of 36 specific claims across 7 fabricated documents, only 6 could be verified (ISO standards and government frameworks that actually exist). The rest were invented, unverifiable, or contradicted by public records.
Using web verification
# Default: decompose + knowledge check (fast, no web)
brain-in-the-fish evaluate tender.pdf --intent "assess methodology"
# With web verification (slower, checks each claim)
brain-in-the-fish evaluate tender.pdf --intent "assess methodology" --verify
Each claim gets tagged in the ontology:
arg:claim_1 arg:verificationStatus "verified" .
arg:claim_1 arg:verificationSource "https://www.iso.org/standard/81230.html" .
arg:claim_2 arg:verificationStatus "unverifiable" .
arg:claim_2 arg:searchQuery "TrustFrame methodology framework" .
arg:claim_2 arg:searchResults "0 relevant results" .
More Case Studies
- Fabrication Detection in Tenders — Raw LLM scores fake evidence 7.6/10, BITF catches it at 2.1/10
- Prompt Firewall — Dual-layer injection defense with OWL attack ontology (26 classes, 314 patterns, 8 languages)
What Didn't Work
We tried everything. Here's what we learned:
| Approach | What happened |
|---|---|
| Ontology as scorer (replacing LLM) | Pearson 0.56 max — structure captures ~25% of quality |
| Regex extraction | Found ~20% of what LLM finds |
| More features (30 instead of 14) | Overfitting — made things worse |
| Model stacking | Collapsed at N=100 |
The insight: The ontology's job isn't to score — it's to decompose and verify. The LLM scores. The ontology proves. The gate checks.
Quick Start
git clone https://github.com/fabio-rovai/open-ontologies.git
git clone https://github.com/fabio-rovai/brain-in-the-fish.git
cd brain-in-the-fish
cargo build --release
# See it work — 3 examples with verdicts
brain-in-the-fish demo
# Evaluate a document
brain-in-the-fish evaluate document.pdf --intent "assess quality"
# As MCP server (Claude orchestrates)
brain-in-the-fish serve
MCP Server Config
{
"mcpServers": {
"brain-in-the-fish": {
"command": "/path/to/brain-in-the-fish-mcp"
}
}
}
No API keys needed. Claude acts as the subagent via MCP — reads the document, builds the ontology, calls the scorer tools. Everything runs locally.
Built With
- open-ontologies — OWL knowledge graph engine (GraphStore, Reasoner, SPARQL, AlignmentEngine)
- Rust — deterministic scoring, structural analysis, gate logic
- ARIA Safeguarded AI — gatekeeper architecture: don't make the LLM deterministic, make the verification deterministic
License
MIT
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi