Senior AI Engineer – Agentic Systems & LLM Evaluations

Location: Bay Area / Remote (with flexibility for global collaboration)

Type: Full-time

Why this role is exciting

Own the end‑to‑end intelligence layer at Peer, designing autonomous, tool‑using agents and building trustworthy evaluation pipelines for frontier‑scale language models. You’ll move faster than at a big tech lab while working on problems as technically rich.

About Peer

Peer is a Bay‑Area‑founded, remote‑friendly GenAI platform accelerating pharma clinical‑research innovation. Our engineering team ships production‑grade AI features that shorten the path from idea to life‑saving trials.

The life sciences industry is at a major inflection point, with soaring content demands and growing pressure for speed and precision. Peer’s AI platform is already live with top tier pharma customers, and delivering impact at the cutting edge of this shift. Our SaaS platform has already been shown to deliver 55-94% efficiency gains while maintaining quality standards across protocols, CSRs, INDs, and safety narratives, serving a $15B addressable market for regulatory documentation services.

We’re uniquely positioned to lead in this space—and we’re backed by top-tier investors in AI and life sciences who share our vision. If you're excited to shape the future of drug development and work at the intersection of cutting-edge tech and human health, we’d love to meet you.

Our Vision: At Peer, we are using AI-powered solutions to clear the path for important scientific and medical discoveries to ensure a brighter, healthier future for all.

Our Values:

  • Drive Impact: We focus on delivering real results for our users—making their work easier, better, and more impactful.

  • Be the Expert: We lead with deep expertise, curiosity, and honesty to guide others toward the best outcomes.

  • Go for Great: We take pride in pushing past “good enough” to deliver standout work in every detail.

  • Win as a Team: We succeed together—building trust, sharing ownership, and helping each other grow every day.

Key Responsibilities

  • Architect next‑generation autonomous agents that can plan, reason, and act with minimal oversight.

  • Build modular LLM pipelines and retrieval techniques to push the limits of domain‑specific reasoning.

  • Establish rigorous evaluation & benchmarking dashboards that track helpfulness, harmlessness, and robustness.

  • Correlate evaluation metrics with real‑world user satisfaction and product performance.

  • Blend human‑in‑the‑loop and automated evaluation loops to accelerate iteration.

  • Lead red‑teaming & risk analysis for frontier‑scale models.

  • Mentor a growing engineering squad and foster a culture of research‑grade rigor plus startup velocity.

  • Interact with end user medical writers directly and use their feedback and experience to improve the intelligence of the platform. 

How We Work

  • Small, senior squads. Two-week build cycles, async daily stand-ups, and lightweight design docs keep us shipping fast without ceremony.

  • AI-assisted by default. Tools like Copilot, Claude Code, and Cursor power every pull-request; we share impact metrics at sprint retro.

  • Trunk-based development & one-click deploys. Green tests merge to main; GitHub Actions auto-promotes to staging and production behind feature flags.

  • Observability everywhere. Real-time dashboards track latency, cost, and hallucination rates—engineers own what they ship.

  • Demo-driven culture. We have a 20-minute show-and-tell every week—demos over decks, learning over perfection.

Must‑Have Qualifications

  • 5+ years building production ML or distributed systems; 2+ years hands‑on with LLMs or NLP pipelines.

  • Proven record designing autonomous/agentic frameworks.

  • Fluency in evaluation tooling (OpenAI Evals, TruLens, or custom harnesses) and metrics design.

  • Ability to ship secure, observable services on AWS with CI/CD and IaC best practices.

Nice‑to‑Have

  • Experience red‑teaming or safety‑testing frontier models.

  • Contributions to open‑source agentic tooling or published papers on LLM evaluation.

  • Background in healthcare or other regulated domains.

  • Deep Python expertise; experience with Hugging Face, vector DBs (FAISS, Pinecone, Weaviate).

What We Offer

  • Competitive base compensation plus meaningful equity.

  • Remote‑first setup with quarterly on‑sites.

  • Comprehensive health coverage, 12 weeks paid parental leave, monthly home‑office stipends.

  • Explicit time to publish or contribute to open source.

  • A mission that directly accelerates life‑saving clinical research.

How to Apply

Send a résumé or LinkedIn profile to careers@getpeer.ai plus one short paragraph describing an agentic system or evaluation framework you’ve built and what you’d improve if you did it again. We review applications continuously and aim to complete the process (intro call → tech deep dive → team chat) within two weeks.



Previous
Previous

Senior Full Stack Engineer – Generative AI & Product Engineering

Next
Next

Account Executive - Enterprise Pharma & Large Biotechs