22h ago
Research Engineer - Data
Menlo Park
$350k-$400k / year
full-timeai-ml Visa Sponsor
๐ Tech Stack
๐ผ About This Role
You'll build and drive the data foundation for our research efforts, owning data strategy end-to-end from sourcing datasets to integrating experimental data into the training stack. You'll work closely with researchers to understand model needs and build pipelines to get the right data in the right shape. This role sits at the intersection of data engineering, research infrastructure, and strategy.
๐ฏ What You'll Do
- Own data strategy across the training stack
- Source, evaluate, and procure external datasets
- Build and maintain robust data ingestion pipelines
- Design data quality systems for deduplication and filtering
- Integrate experimental data into the training stack
๐ Requirements
- Large-scale data pipelines for LLM pretraining or midtraining
- Data quality techniques like MinHash, perplexity filtering, classifier scoring
- Scientific data formats (papers, patents, databases, lab exports)
- Distributed processing with Spark, Ray, or Dask at petabyte scale
- Python engineering in production research environment
โจ Nice to Have
- Scientific dataset curation for domain-adaptive continued pretraining
- Synthetic data generation methods and pipelines
- Physical science background (chemistry, physics, materials)
- Multimodal data integration (text, numerical, molecular, spectral)
๐ Benefits & Perks
- ๐ Flexible location (Menlo Park or San Francisco preferred)
- ๐ผ Visa sponsorship available
- ๐ฐ Competitive base salary $350k-$400k
- ๐ฌ Cutting-edge AI research environment
- ๐ World-class team and investors
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Team Interviewยท 60 min
- 4Final Roundยท 60 min
0 0 0