22h ago

Research Engineer - Data

Menlo Park

$350k-$400k / year

full-timeai-ml Visa Sponsor

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll build and drive the data foundation for our research efforts, owning data strategy end-to-end from sourcing datasets to integrating experimental data into the training stack. You'll work closely with researchers to understand model needs and build pipelines to get the right data in the right shape. This role sits at the intersection of data engineering, research infrastructure, and strategy.

๐ŸŽฏ What You'll Do

  • Own data strategy across the training stack
  • Source, evaluate, and procure external datasets
  • Build and maintain robust data ingestion pipelines
  • Design data quality systems for deduplication and filtering
  • Integrate experimental data into the training stack

๐Ÿ“‹ Requirements

  • Large-scale data pipelines for LLM pretraining or midtraining
  • Data quality techniques like MinHash, perplexity filtering, classifier scoring
  • Scientific data formats (papers, patents, databases, lab exports)
  • Distributed processing with Spark, Ray, or Dask at petabyte scale
  • Python engineering in production research environment

โœจ Nice to Have

  • Scientific dataset curation for domain-adaptive continued pretraining
  • Synthetic data generation methods and pipelines
  • Physical science background (chemistry, physics, materials)
  • Multimodal data integration (text, numerical, molecular, spectral)

๐ŸŽ Benefits & Perks

  • ๐Ÿš€ Flexible location (Menlo Park or San Francisco preferred)
  • ๐Ÿ’ผ Visa sponsorship available
  • ๐Ÿ’ฐ Competitive base salary $350k-$400k
  • ๐Ÿ”ฌ Cutting-edge AI research environment
  • ๐ŸŒŸ World-class team and investors

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Team Interviewยท 60 min
  4. 4Final Roundยท 60 min
0 0 0