Call for Research Engineer: Machine Learning, Scaling

TL;DR - Apply here to join CRFM as a Research Engineer working on training large-scale foundation models!. Applications due December 15th, 2021 and are reviewed on a rolling basis. This role supports remote work and comes with a competitive salary.

The Stanford Center for Research of Foundation Models (CRFM) is an interdisciplinary initiative, part of the Institute for Human-Centered AI (HAI), that aims to study and advance the responsible development of foundation models – large-scale self-supervised models that can be adapted to a wide range of downstream tasks (e.g., GPT-3, CLIP, DALL-E). One of the key activities of CRFM is building open-source, easy-to-use tools that enable the broader ML community to train and perform research on foundation models.

**We are seeking applications for strong large scale machine learning systems research engineers to join CRFM and aid in this effort. You will work closely with other members of the development team and have the opportunity to collaborate on various research projects. **

Critically, you will play a fundamental role in “opening up” these models, making training more robust and accessible, tackling bleeding-edge engineering and scientific problems along the way.

Job activities will include:

  • Scaling Mistral, CRFM’s training infrastructure, to support 10-200B parameter models.
  • Supporting novel, multimodal foundation model training (e.g., video and text).
  • Supporting interdisciplinary efforts to train models for applications in varied fields, e.g., law, medicine, and robotics.
  • Actively communicating with an international, interdisciplinary community effort to facilitate foundation model training and research. This will include contacts from other academic institutions, industry labs, and open-source communities.
  • Participating in academic research and writing publications. Additionally preparing blog posts, and giving talks to diverse audiences.
  • Running experiments and debugging training infrastructure on both local, on-premises clusters, as well as on GCP and Azure.
  • Working closely with the other members of the development team: attending meetings, communicating results, reviewing code, etc.
  • Publicly releasing code and model artifacts.

We are looking for applicants with the following qualifications:

  • Bachelor’s degree (required) or Masters/PhD degree (desired) in computer science or related experience.
  • Strong software engineering background with at least 3-5 years industry experience in large-scale machine learning engineering and distributed systems (required).
  • Strong PyTorch, Tensorflow, or Jax experience (required — history of open-source contributions to the PyTorch or Tensorflow ecosystem a plus)
  • Experience with multi-node, distributed GPU workloads (desired)
    • Experience with Kubernetes and deploying Kubernetes on cloud platforms (desired)
    • Prometheus/Grafana or other monitoring experience (desired)
  • Experience with CUDA and GPU kernel programming for optimizing training (desired)

We especially encourage applicants from traditionally underrepresented backgrounds to apply such as BIPOC (black, indigenous, and people of color), women, and members of the LGBT+ communities. This position allows for remote working arrangements and comes with a competitive salary.

Apply here by December 15th, 2021!