1d ago

Senior Software Engineer โ€” AI Evaluation & Benchmarks (Python)

Miami

$166.4k-$208k / year

contractsenior Remoteai-ml

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll design and build coding benchmarks and evaluation pipelines to test frontier AI models on real software engineering work. Your work will directly shape how model coding ability is measured and improved.

๐ŸŽฏ What You'll Do

  • Design coding benchmarks for frontier AI models on real-world programming tasks.
  • Build and maintain scalable data pipelines for evaluation workflows.
  • Analyze model-generated code for correctness, reliability, and edge-case failures.
  • Construct structured evaluation scenarios across large repos and multi-language environments.

๐Ÿ“‹ Requirements

  • 4+ years of professional software engineering experience.
  • Expert Python skills with clean, performant, well-tested code.
  • Hands-on experience in large, complex codebases.
  • Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines.

โœจ Nice to Have

  • Senior or Lead-level profile with history of technical ownership.
  • Proficiency in additional languages: JavaScript, Go, C++.
  • CI/CD experience and robust unit testing (pytest, Mocha, JUnit).

๐ŸŽ Benefits & Perks

  • ๐ŸŒ Fully remote โ€” work from anywhere in accepted locations.
  • ๐Ÿ’ต Competitive hourly rate $80โ€“$100/hr based on location and seniority.
  • ๐Ÿ“† Weekly payments via PayPal or Stripe.
  • ๐Ÿ” Potential extension beyond initial 3-month contract.

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Application reviewยท 1-2 weeks
  2. 2Technical interviewยท 1 hour
  3. 3Offerยท 1 week

๐Ÿšฉ Heads Up

  • Contract role with variable hours โ€” not suitable as sole income.
  • Requires identity verification and proof of valid work documentation.
  • No visa sponsorship; incompatible with F-1 OPT or STEM OPT.
0 0 0