1d ago
Senior Software Engineer โ AI Evaluation & Benchmarks (Python)
Miami
$166.4k-$208k / year
contractsenior Remoteai-ml
๐ Tech Stack
๐ผ About This Role
You'll design and build coding benchmarks and evaluation pipelines to test frontier AI models on real software engineering work. Your work will directly shape how model coding ability is measured and improved.
๐ฏ What You'll Do
- Design coding benchmarks for frontier AI models on real-world programming tasks.
- Build and maintain scalable data pipelines for evaluation workflows.
- Analyze model-generated code for correctness, reliability, and edge-case failures.
- Construct structured evaluation scenarios across large repos and multi-language environments.
๐ Requirements
- 4+ years of professional software engineering experience.
- Expert Python skills with clean, performant, well-tested code.
- Hands-on experience in large, complex codebases.
- Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines.
โจ Nice to Have
- Senior or Lead-level profile with history of technical ownership.
- Proficiency in additional languages: JavaScript, Go, C++.
- CI/CD experience and robust unit testing (pytest, Mocha, JUnit).
๐ Benefits & Perks
- ๐ Fully remote โ work from anywhere in accepted locations.
- ๐ต Competitive hourly rate $80โ$100/hr based on location and seniority.
- ๐ Weekly payments via PayPal or Stripe.
- ๐ Potential extension beyond initial 3-month contract.
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Application reviewยท 1-2 weeks
- 2Technical interviewยท 1 hour
- 3Offerยท 1 week
๐ฉ Heads Up
- Contract role with variable hours โ not suitable as sole income.
- Requires identity verification and proof of valid work documentation.
- No visa sponsorship; incompatible with F-1 OPT or STEM OPT.
0 0 0