# Role Deep Dive: Azure Data Scientist

---

## Role Overview

Azure Data Scientists build and deploy machine learning models on Azure. They experiment with algorithms, train models at scale, and operationalize ML solutions. They combine data science expertise with Azure ML platform knowledge.

**Alternative Titles:** ML Engineer, Applied Data Scientist, Cloud Data Scientist

**Typical Salary Range:** $110,000 – $175,000 (US)

---

## Core Responsibilities

### 1. Model Development & Experimentation (35% of role)
- Explore and prepare data
- Design feature engineering pipelines
- Train and evaluate ML models
- Track experiments with MLflow
- Perform hyperparameter tuning

**Granular Tasks:**
- Data preparation in Databricks notebooks (PySpark, pandas)
- Feature engineering: create derived features, handle missing values, encode categoricals, scale numericals
- Model training: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow
- Experiment tracking: log parameters, metrics, artifacts with MLflow
- Hyperparameter tuning: Azure ML SweepJob (random, grid, Bayesian sampling)
- Cross-validation: stratified k-fold for classification, time-series split for forecasting
- Model evaluation: accuracy, precision, recall, F1, AUC-ROC, RMSE, MAE
- Interpretability: SHAP values, feature importance, partial dependence plots

### 2. Azure Machine Learning Platform (25% of role)
- Create and manage AML workspaces
- Configure compute (compute instances, clusters, attached compute)
- Manage data (datastores, datasets, data labels)
- Register and version models
- Create and run pipelines
- Use AutoML for baseline models

**Granular Tasks:**
- AML Workspace: central resource for all ML artifacts
- Compute Instance: dev environment (Jupyter, VS Code integrated)
- Compute Cluster: auto-scaling training cluster (GPU/CPU), set min/max nodes
- Datastores: connect to Blob, ADLS, SQL, PostgreSQL
- Datasets: versioned, typed data references (Tabular, File)
- Model registration: register trained model with version, description, tags
- AML Pipelines: orchestrate training workflow (data prep → train → evaluate → register)
- AutoML: automated algorithm selection + hyperparameter tuning + feature engineering
- Responsible AI: assess fairness, interpretability, error analysis

### 3. Model Deployment & Operationalization (20% of role)
- Deploy models as real-time endpoints
- Deploy models as batch endpoints
- Implement A/B testing and traffic splitting
- Monitor model performance and drift
- Implement MLOps practices

**Granular Tasks:**
- **Real-time Endpoint (Managed Online Endpoint):**
  - Deploy model with scoring script (entry_script), environment (conda/Docker), compute
  - Blue/green deployment: deploy v2 alongside v1, shift traffic gradually
  - Auto-scaling: based on request count
  - Authentication: key-based or AAD token

- **Batch Endpoint:**
  - For batch scoring: invoke with input dataset, output to datastore
  - Schedule with AML pipeline or Data Factory
  - For: daily predictions, bulk scoring, large datasets

- **MLOps:**
  - CI: lint code, run unit tests, validate training script
  - CD: retrain on new data, evaluate, register if improved, deploy
  - CT (Continuous Training): trigger retraining on data drift or schedule
  - Azure DevOps / GitHub Actions pipeline for ML lifecycle
  - Model registry: version, approve (manual gate for production)

- **Monitoring:**
  - Data drift: monitor input feature distribution vs training data
  - Prediction drift: monitor output distribution changes
  - Performance: latency, throughput, error rate
  - Retrain trigger: when drift exceeds threshold

### 4. Responsible AI & Ethics (10% of role)
- Assess model fairness across demographic groups
- Explain model predictions
- Identify and mitigate bias
- Document model cards

**Granular Tasks:**
- Fairness assessment: compare model performance across groups (gender, race, age)
- SHAP explanations: show which features influenced each prediction
- Error analysis: identify subgroups with higher error rates
- Model cards: document intended use, limitations, performance, ethical considerations
- Privacy: differential privacy, data anonymization

### 5. Advanced Analytics (10% of role)
- Build recommendation systems
- Implement NLP solutions
- Build computer vision models
- Work with Azure OpenAI for LLM-based solutions

**Granular Tasks:**
- Recommendations: collaborative filtering, content-based, Azure AI Search with personalizer
- NLP: text classification, NER, sentiment analysis (Language Service), custom models
- Computer Vision: image classification, object detection (Custom Vision, Vision Studio)
- LLM: Azure OpenAI for text generation, summarization, RAG (Retrieval-Augmented Generation)
- RAG architecture: Azure OpenAI + AI Search (index documents) → ground LLM responses in your data

---

## Certification Path

| Certification | Level | Focus |
|---|---|---|
| **DP-900** | Foundational | Data fundamentals |
| **AI-900** | Foundational | AI fundamentals |
| **DP-100** | Associate | **Core cert** — Azure Data Scientist |
| **AI-102** | Associate | Azure AI Engineer (complement) |

### DP-100 Exam Breakdown
| Domain | Weight |
|---|---|
| Set up an Azure Machine Learning workspace | 5-10% |
| Manage data in Azure Machine Learning | 5-10% |
| Run experiments and train models | 20-25% |
| Optimize and manage models | 15-20% |
| Deploy and consume models | 25-30% |

---

## Interview Focus Areas

1. **Walk me through an ML project lifecycle on Azure.**
   → Define problem → Explore data (Databricks) → Feature engineering → Train models (AML + MLflow) → Evaluate → Register → Deploy (managed endpoint) → Monitor (drift) → Retrain

2. **How do you handle model drift?**
   → Monitor input distribution and prediction distribution. When drift exceeds threshold, trigger retraining pipeline. Compare new model vs champion model on test set. Deploy if improved.

3. **How do you deploy a model with zero downtime?**
   → Blue/green deployment on managed online endpoint. Deploy v2 alongside v1. Shift traffic gradually (10% → 50% → 100%). Rollback = route traffic back to v1.

4. **Explain MLOps.**
   → ML lifecycle automation: CI (test code), CD (deploy model), CT (continuous training). Pipeline: data change → retrain → evaluate → register → deploy. Model registry with approval gates.

5. **What is RAG and how do you implement it on Azure?**
   → Retrieval-Augmented Generation. Index documents in AI Search. When user queries: retrieve relevant docs → pass as context to Azure OpenAI → generate grounded response. Reduces hallucinations.
