DeepSeek R1 Distill: Complete Tutorial for Deployment & Fine-Tuning
Are you eager to explore the capabilities of the DeepSeek R1 Distill model? This guide provides a comprehensive, step-by-step approach to deploying the uncensored DeepSeek R1 Distill model to Google Cloud Run with GPU support, and also walks you through a practical fine-tuning process. The tutorial is broken down into the following sections:

- Environment Setup
- FastAPI Inference Server
- Docker Configuration
- Google Cloud Run Deployment
- Fine-Tuning Pipeline
Let’s dive in and get started.
1. Environment Setup
Before deploying and fine-tuning, make sure you have the required tools installed and configured.
1.1 Install Required Tools
- Python 3.9+
- pip: For Python package installation
- Docker: For containerization
- Google Cloud CLI: For deployment
Install Google Cloud CLI (Ubuntu/Debian):
Follow the official Google Cloud CLI installation guide to install gcloud.
1.2 Authenticate with Google Cloud
Run the following commands to initialize and authenticate with Google Cloud:
gcloud init
gcloud auth application-default login
Ensure you have an active Google Cloud project with Cloud Run, Compute Engine, and Container Registry/Artifact Registry enabled.
2. FastAPI Inference Server
We’ll create a minimal FastAPI application that serves two main endpoints:
- /v1/inference: For model inference.
- /v1/finetune: For uploading fine-tuning data (JSONL).
Create a file named main.py with the following content:
# main.py
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from pydantic import BaseModel
import json
import litellm # Minimalistic LLM library
app = FastAPI()
class InferenceRequest(BaseModel):
prompt: str
max_tokens: int = 512
@app.post("/v1/inference")
async def inference(request: InferenceRequest):
"""
Inference endpoint using deepseek-r1-distill-7b (uncensored).
"""
response = litellm.completion(
model="deepseek/deepseek-r1-distill-7b",
messages=[{"role": "user", "content": request.prompt}],
max_tokens=request.max_tokens
)
return JSONResponse(content=response)
@app.post("/v1/finetune")
async def finetune(file: UploadFile = File(...)):
"""
Fine-tune endpoint that accepts a JSONL file.
"""
if not file.filename.endswith('.jsonl'):
return JSONResponse(
status_code=400,
content={"error": "Only .jsonl files are accepted for fine-tuning"}
)
# Read lines from uploaded file
data = [json.loads(line) for line in file.file]
# Perform or schedule a fine-tuning job here (simplified placeholder)
# You can integrate with your training pipeline below.
return JSONResponse(content={"status": "Fine-tuning request received", "samples": len(data)})
3. Docker Configuration
To containerize the application, create a requirements.txt file:
fastapi
uvicorn
litellm
pydantic
transformers
datasets
accelerate
trl
torch
And create a Dockerfile:
# Dockerfile
FROM nvidia/cuda:12.0.0-base-ubuntu22.04
# Install basic dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
# Create app directory
WORKDIR /app
# Copy requirements and install
COPY requirements.txt .
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy code
COPY . .
# Expose port 8080 for Cloud Run
EXPOSE 8080
# Start server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
4. Deploy to Google Cloud Run with GPU
4.1 Enable GPU on Cloud Run
Make sure your Google Cloud project has a GPU quota available, such as nvidia-l4.
4.2 Build and Deploy
Run this command from your project directory to deploy the application to Cloud Run:
gcloud run deploy deepseek-uncensored \
--source . \
--region us-central1 \
--platform managed \
--gpu 1 \
--gpu-type nvidia-l4 \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated
This command builds the Docker image, deploys it to Cloud Run with one nvidia-l4 GPU, allocates 16 GiB memory and 4 CPU cores, and exposes the service publicly (no authentication).
5. Fine-Tuning Pipeline
This section will guide you through a basic four-stage fine-tuning pipeline similar to DeepSeek R1’s training approach.
5.1 Directory Structure Example
.
├── main.py
├── finetune_pipeline.py
├── cold_start_data.jsonl
├── reasoning_data.jsonl
├── data_collection.jsonl
├── final_data.jsonl
├── requirements.txt
└── Dockerfile
Replace the .jsonl files with your actual training data.
5.2 Fine-Tuning Code: finetune_pipeline.py
Create a finetune_pipeline.py file with the following code:
# finetune_pipeline.py
import os
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer,
Trainer, TrainingArguments)
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline, AutoModel
# 1. Cold Start Phase
def cold_start_finetune(
base_model="deepseek-ai/deepseek-r1-distill-7b",
train_file="cold_start_data.jsonl",
output_dir="cold_start_finetuned_model"
):
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Load dataset
dataset = load_dataset("json", data_files=train_file, split="train")
# Simple tokenization function
def tokenize_function(example):
return tokenizer(
example["prompt"] + "\n" + example["completion"],
truncation=True,
max_length=512
)
dataset = dataset.map(tokenize_function, batched=True)
dataset = dataset.shuffle()
# Define training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
save_steps=50,
logging_steps=50,
learning_rate=5e-5
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)
trainer.train()
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
return output_dir
# 2. Reasoning RL Training
def reasoning_rl_training(
cold_start_model_dir="cold_start_finetuned_model",
train_file="reasoning_data.jsonl",
output_dir="reasoning_rl_model"
):
# Config for PPO
config = PPOConfig(
batch_size=16,
learning_rate=1e-5,
log_with=None, # or 'wandb'
mini_batch_size=4
)
# Load model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(cold_start_model_dir)
tokenizer = AutoTokenizer.from_pretrained(cold_start_model_dir)
# Create a PPO trainer
ppo_trainer = PPOTrainer(
config,
model,
tokenizer=tokenizer,
)
# Load dataset
dataset = load_dataset("json", data_files=train_file, split="train")
# Simple RL loop (pseudo-coded for brevity)
for sample in dataset:
prompt = sample["prompt"]
desired_answer = sample["completion"] # For reward calculation
# Generate response
query_tensors = tokenizer.encode(prompt, return_tensors="pt")
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)
# Calculate reward (simplistic: measure overlap or correctness)
reward = 1.0 if desired_answer in response_text else -1.0
# Run a PPO step
ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
return output_dir
# 3. Data Collection
def collect_data(
rl_model_dir="reasoning_rl_model",
num_samples=1000,
output_file="data_collection.jsonl"
):
"""
Example data collection: generate completions from the RL model.
This is a simple version that just uses random prompts or a given file of prompts.
"""
tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
model = AutoModelForCausalLM.from_pretrained(rl_model_dir)
# Suppose we have some random prompts:
prompts = [
"Explain quantum entanglement",
"Summarize the plot of 1984 by George Orwell",
# ... add or load from a prompt file ...
]
collected = []
for i in range(num_samples):
prompt = prompts[i % len(prompts)]
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
collected.append({"prompt": prompt, "completion": completion})
# Save to JSONL
with open(output_file, "w") as f:
for item in collected:
f.write(f"{item}\n")
return output_file
# 4. Final RL Phase
def final_rl_phase(
rl_model_dir="reasoning_rl_model",
final_data="final_data.jsonl",
output_dir="final_rl_model"
):
"""
Another RL phase using a new dataset or adding human feedback.
This is a simplified approach similar to the reasoning RL training step.
"""
config = PPOConfig(
batch_size=16,
learning_rate=1e-5,
log_with=None,
mini_batch_size=4
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_dir)
tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
ppo_trainer = PPOTrainer(config, model, tokenizer=tokenizer)
dataset = load_dataset("json", data_files=final_data, split="train")
for sample in dataset:
prompt = sample["prompt"]
desired_answer = sample["completion"]
query_tensors = tokenizer.encode(prompt, return_tensors="pt")
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)
reward = 1.0 if desired_answer in response_text else 0.0
ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
return output_dir
# END-TO-END PIPELINE EXAMPLE
if __name__ == "__main__":
# 1) Cold Start
cold_start_out = cold_start_finetune(
base_model="deepseek-ai/deepseek-r1-distill-7b",
train_file="cold_start_data.jsonl",
output_dir="cold_start_finetuned_model"
)
# 2) Reasoning RL
reasoning_rl_out = reasoning_rl_training(
cold_start_model_dir=cold_start_out,
train_file="reasoning_data.jsonl",
output_dir="reasoning_rl_model"
)
# 3) Data Collection
data_collection_out = collect_data(
rl_model_dir=reasoning_rl_out,
num_samples=100,
output_file="data_collection.jsonl"
)
# 4) Final RL Phase
final_rl_out = final_rl_phase(
rl_model_dir=reasoning_rl_out,
final_data="final_data.jsonl",
output_dir="final_rl_model"
)
print("All done! Final model stored in:", final_rl_out)
Usage Overview
- Upload Your Data:
- Prepare cold_start_data.jsonl, reasoning_data.jsonl, final_data.jsonl, etc.
- Each line should be a JSON object with “prompt” and “completion” keys.
- Run the Pipeline Locally:
python3 finetune_pipeline.py
This creates directories like cold_start_finetuned_model, reasoning_rl_model, and final_rl_model.
- Deploy:
- Build and push via gcloud run deploy.
- Inference:
- After deployment, send a POST request to your Cloud Run service:
import requests
url = "https://<YOUR-CLOUD-RUN-URL>/v1/inference"
data = {"prompt": "Tell me about quantum physics", "max_tokens": 100}
response = requests.post(url, json=data)
print(response.json())
Fine-Tuning via Endpoint:
- Upload new data for fine-tuning:
import requests
url = "https://<YOUR-CLOUD-RUN-URL>/v1/finetune"
with open("new_training_data.jsonl", "rb") as f:
r = requests.post(url, files={"file": ("new_training_data.jsonl", f)})
print(r.json())
This tutorial has provided an end-to-end pipeline for deploying and fine-tuning the DeepSeek R1 Distill model. You’ve learned how to:
- Deploy a FastAPI server with Docker and GPU support on Google Cloud Run.
- Fine-tune the model in four stages: Cold Start, Reasoning RL, Data Collection, and Final RL.
- Use TRL (PPO) for basic RL-based training loops.
Disclaimer: Deploying uncensored models has ethical and legal implications. Make sure to comply with relevant laws, policies, and usage guidelines.
- TRLF (PPO) GitHub
- Hugging Face Transformers Docs
- Google Cloud Run GPU Docs
- DeepSeek R1 Project
- fastapi File Upload Docs
- Deploying FastAPI on Google Cloud Run
This comprehensive guide should equip you with the knowledge to start deploying and fine-tuning the DeepSeek R1 Distill model.