screen-voice-agent
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 8 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
This desktop AI agent continuously monitors your screen and system audio to help you learn languages through real-time voice interaction. It uniquely leverages a self-modifying architecture to write, approve, and hot-load new tools on the fly without requiring a restart.
Security Assessment
Risk Rating: High. While the automated code scan found no dangerous patterns or hardcoded secrets, the tool's core functionality requires access to highly sensitive data. It explicitly captures continuous screenshots and records system audio in the background to send to external APIs (OpenAI). Furthermore, its most significant security risk is its self-modifying capability; the agent is designed to write code to your local disk and execute it based on user approval. Even with an approval gate, this drastically increases the attack surface for arbitrary code execution if a prompt injection occurs or a user makes a hasty approval.
Quality Assessment
The project is very new and lacks community validation, evidenced by a low visibility warning with only 8 GitHub stars. On a positive note, it appears to be actively maintained, with the most recent code push happening today. However, the repository failed a basic health check by missing an actual license file in the repository, despite displaying an MIT License badge in the documentation.
Verdict
Use with caution — the tool legitimately requires deep system access for its features, but its newness, unlicensed status, and self-modifying code execution present substantial security risks.
A desktop AI agent that sees your screen, hears your audio, teaches you languages by voice, and extends itself with new tools on demand.
Samuel — Your Always-On AI Assistant That Lives on Your Desktop
A voice-first AI agent that watches your screen, listens to your audio, learns your preferences, and can teach itself new skills at runtime — no rebuild required.
TL;DR: Say "Hey Samuel" and talk. He sees your screen, hears your audio, remembers everything, and writes his own tools when he needs new capabilities.
See It In Action
Samuel interprets Japanese news in realtime — watching the screen and listening to audio simultaneously:
https://github.com/user-attachments/assets/36fdd220-e1af-443a-99d3-31803160625c
Ambient teaching while watching anime — vocab cards, scene clip flashcards, and voice explanations:
https://github.com/user-attachments/assets/65314d07-694d-47c5-8209-24e5bdbdf55c
https://github.com/user-attachments/assets/338f8194-49e6-496d-b218-715af4afa1ee
What Makes Samuel Different
Self-Modifying — Writes Its Own Tools at Runtime
Most AI agents have a fixed tool set. Samuel doesn't.
You: "Hey Samuel, add a weather tool"
Samuel: "I'll create a tool that fetches weather from wttr.in. [Approve] [Reject]"
You: *clicks Approve*
Samuel: *generates code → writes to disk → hot-loads into live session*
Samuel: "Done. What's the weather in Tokyo?"
No rebuild. No restart. The new tool is live in the same voice conversation. If a plugin breaks, Samuel reads the error, proposes a fix, and rewrites it — with your approval.
Always Watching, Always Listening
Samuel runs a continuous perception loop in the background:
- Screen — captures via GPT-4o Vision every 20s with smart change detection
- Audio — transcribes system audio via ScreenCaptureKit with PID-level filtering (excludes his own voice)
- Context injection — feeds observations silently into the conversation so he always knows what's happening
Ask "what did they just say?" or "what's on my screen?" at any point — he already knows.
Remembers Everything
Three types of persistent local memory:
| Type | Example | Effect |
|---|---|---|
| Preferences | "Be more concise" | Applied every session |
| Corrections | "That explanation was wrong" | Never repeated |
| Facts | "I'm intermediate at Japanese" | Adjusts behavior permanently |
Say "I already know that word" — permanently suppressed. Say "be more direct" — communication style changes from that session forward. All memory is local, auditable, and editable.
Voice-Controlled Everything
Samuel is his own settings panel. No menus, no preferences screen:
| You say | What happens |
|---|---|
| "Make yourself smaller" | Avatar shrinks |
| "Make the font bigger" | Speech bubble text grows |
| "Show me word cards while I watch" | Switches to auto vocab card mode |
| "Cards every 20 seconds" | Adjusts card frequency |
| "Only show cards when I ask" | Switches back to manual mode |
| "Hide the romaji" | Annotations hidden |
| "Reset the UI" | All visual settings restored |
Core Features
Recording Mode — Your AI Audience
Record any audio (meetings, lectures, videos) and ask Samuel anything about the transcript:
You: "Hey Samuel, start recording"
*attends a meeting*
You: "Stop recording"
Samuel: "Transcript ready. What would you like me to do with it?"
You: "Summarize the key decisions"
or "Find anything about pricing"
or "Did anyone say something incorrect about our API?"
or "Break down the Japanese grammar"
or "What were the action items?"
One recording. Any question. Samuel holds the full transcript and applies his reasoning to whatever you ask — no hardcoded analysis pipeline.
Song Teaching Mode
Drop a YouTube link into the envelope and Samuel becomes a music tutor:
- Downloads audio via
yt-dlp, fetches synced lyrics from LRCLIB (falls back to Whisper transcription) - You say "play the first 3 lines" — original audio plays, mic auto-mutes
- Audio finishes → mic unmutes → Samuel explains the vocabulary and grammar
- Fully conversational — ask "what does that word mean?", "play it again", "skip to the chorus"
Teach Me From This — Drop Anything
Drop content into Samuel's envelope (the icon below his avatar):
- YouTube link → song teaching mode with audio playback + lyrics
- Article URL → extracts text, annotates vocabulary and grammar
- Image / manga → OCR + breakdown
- Raw text → immediate analysis
- API key → Samuel asks what it's for and stores it securely
Ambient Language Assistance
Set your learning language once ("I'm learning Japanese") and Samuel assists in the background — forever:
- Manual mode (default) — ask Samuel to explain any word; he shows a vocabulary card via
show_word_card - Auto mode — say "show me cards while I watch" and Samuel periodically reviews what he hears/sees, picking out interesting words based on your proficiency level
- Cross-language hints — say "tell me the Japanese for any English words you hear" and he does that too
- Frequency control — "cards every 30 seconds" / "less often" / "stop auto cards"
All driven by Samuel's own judgment, not rigid rules. He knows your level, what you've already learned, and what's worth highlighting.
Scene Clip Flashcards
When Samuel spots a word, a vocab card appears. Tap "Save it" — he saves the actual 20-second audio clip plus a screenshot. Flashcards aren't text — they're real scenes with the original voice actor's delivery.
Architecture
"Hey Samuel" → Wake word → OpenAI Realtime API → 20+ tools → Voice response
↕
Screen capture (GPT-4o Vision, change detection, every 20s)
System audio (ScreenCaptureKit, PID-level filtering)
Ambient context → silent injection OR periodic Samuel review
Plugin system: propose → approve → generate → hot-load
Song playback: yt-dlp → local audio → HTML5 <audio> with seek
Recording: Whisper transcribe → raw transcript → user-directed analysis
Secrets store: ~/.samuel/secrets.json (local)
Personality memory: preferences + corrections + facts
Scene clip flashcards: audio + screenshot per word
Models
| Model | Purpose | Latency |
|---|---|---|
| OpenAI Realtime API | Voice conversation, all interactive features | ~500ms |
| GPT-4o Vision | Screen scanning, ambient observation | ~3–5s |
| GPT-4o-mini | Annotation, plugin code generation | ~1s |
| gpt-4o-transcribe | Recording transcription (high-fidelity) | ~3–10s |
| whisper-1 | Song segmentation with timestamps | ~3–5s |
Key Tools Samuel Has
| Tool | What it does |
|---|---|
observe_screen |
Captures and analyzes what's on screen |
start/stop_recording |
System audio capture + transcription |
teach_from_content |
Analyzes any dropped content for learning |
play_song_lines / pause_song |
Controls song audio playback |
show_word_card |
Displays a vocabulary card on demand |
set_card_mode |
Toggles manual/auto vocab card behavior |
remember_preference |
Stores persistent user preferences |
record_correction |
Stores behavioral corrections |
mark_vocabulary_known |
Permanently suppresses known words |
update_ui |
Changes visual settings by voice |
propose_plugin / write_plugin |
Self-modification pipeline |
store_secret |
Saves API keys for plugins |
pronounce |
Speaks correct pronunciation |
Tech Stack
| Layer | Technology |
|---|---|
| Desktop | Tauri v2 (Rust + WebView) |
| Frontend | React 19 + Vite + TypeScript |
| Voice | OpenAI Realtime API (WebRTC) |
| Agent Framework | @openai/agents |
| Vision | GPT-4o Vision |
| Plugin Runtime | new Function() + secrets injection |
| Song Audio | yt-dlp + HTML5 Audio |
| Lyrics | LRCLIB + YouTube oEmbed |
| Animation | Rive |
| Screen Capture | Peekaboo + macOS screencapture |
| Audio Capture | ScreenCaptureKit (Swift), PID-level filtering |
Quick Start
Prerequisites
- macOS 14+ (Sonoma or later)
- Node.js 20+ and Rust (rustup.rs)
- OpenAI API key with Realtime API + GPT-4o access
- yt-dlp (
brew install yt-dlp) for song features
Install
brew install steipete/tap/peekaboo yt-dlp
git clone https://github.com/sambuild04/reading-ai-agent.git
cd reading-ai-agent
npm install
swiftc -o src-tauri/helpers/record-audio src-tauri/helpers/record-audio.swift \
-framework ScreenCaptureKit -framework AVFoundation -framework CoreMedia
echo '{"apiKey": "sk-..."}' > ~/.books-reader.json
Grant Screen Recording permission: System Settings → Privacy & Security → Screen Recording → add Peekaboo + Samuel.
npm run tauri:dev
Say "Hey Samuel" and start talking.
API Costs
| Mode | Approx. cost |
|---|---|
| Wake word (always listening) | ~$0.006/min |
| Ambient assistance (screen + audio) | ~$0.02–0.05/min |
| Auto card mode (Samuel review) | ~$0.01/review cycle |
| Plugin code generation | ~$0.001/plugin |
| Voice conversation | Standard Realtime API pricing |
Limitations
- macOS only — depends on ScreenCaptureKit, Peekaboo, and macOS APIs
- Plugins are not OS-sandboxed —
new Function()has full JS access; the approval flow is the current security boundary - Dynamic plugins are JS only — new native capabilities (Swift/Rust) still require a rebuild
- LRCLIB coverage — not all songs have synced lyrics; Whisper transcription is the fallback
- Always-on costs — ambient mode runs continuously; costs accumulate while active
Roadmap
- Plugin marketplace — share and install community plugins
- General monitoring mode — "watch this meeting and flag errors" as a first-class feature
- SRS scheduling for scene flashcards (spaced repetition on real clips)
- Anki export
- OS-level sandboxing for dynamic plugins
- Local on-device wake word (zero API cost)
- Windows + Linux support
- iOS / Android companion app
Contributing
Issues and PRs welcome — especially for plugin ideas, new tool capabilities, and cross-platform support.
License
MIT
Built by Sam Feng
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found