DeepSeek R1 Distill: Complete Tutorial for Deployment & Fine-Tuning
Are you eager to explore the capabilities of the DeepSeek R1 Distill model? This guide provides a comprehensive, step-by-step approach to deploying the uncensored DeepSeek R1 Distill model to Google Cloud Run with GPU support, and also walks you through a practical fine-tuning process. The tutorial is broken down into the following sections:

- Environment Setup
- FastAPI Inference Server
- Docker Configuration
- Google Cloud Run Deployment
- Fine-Tuning Pipeline
Let’s dive in and get started.
1. Environment Setup
Before deploying and fine-tuning, make sure you have the required tools installed and configured.
1.1 Install Required Tools
- Python 3.9+
- pip: For Python package installation
- Docker: For containerization
- Google Cloud CLI: For deployment
Install Google Cloud CLI (Ubuntu/Debian):
Follow the official Google Cloud CLI installation guide to install gcloud.
1.2 Authenticate with Google Cloud
Run the following commands to initialize and authenticate with Google Cloud:
gcloud init
gcloud auth application-default login
Ensure you have an active Google Cloud project with Cloud Run, Compute Engine, and Container Registry/Artifact Registry enabled.
2. FastAPI Inference Server
We’ll create a minimal FastAPI application that serves two main endpoints:
- /v1/inference: For model inference.
- /v1/finetune: For uploading fine-tuning data (JSONL).
Create a file named main.py with the following content:
main.py# main.py from fastapi import FastAPI, File, UploadFile from fastapi.responses import JSONResponse from pydantic import BaseModel import json import litellm # Minimalistic LLM library app = FastAPI() class InferenceRequest(BaseModel): prompt: str max_tokens: int = 512 @app.post("/v1/inference") async def inference(request: InferenceRequest): """ Inference endpoint using deepseek-r1-distill-7b (uncensored). """ response = litellm.completion( model="deepseek/deepseek-r1-distill-7b", messages=[{"role": "user", "content": request.prompt}], max_tokens=request.max_tokens ) return JSONResponse(content=response) @app.post("/v1/finetune") async def finetune(file: UploadFile = File(...)): """ Fine-tune endpoint that accepts a JSONL file. """ if not file.filename.endswith('.jsonl'): return JSONResponse( status_code=400, content={"error": "Only .jsonl files are accepted for fine-tuning"} ) # Read lines from uploaded file data = [json.loads(line) for line in file.file] # Perform or schedule a fine-tuning job here (simplified placeholder) # You can integrate with your training pipeline below. return JSONResponse(content={"status": "Fine-tuning request received", "samples": len(data)})
3. Docker Configuration
To containerize the application, create a requirements.txt file:
fastapi
uvicorn
litellm
pydantic
transformers
datasets
accelerate
trl
torch
And create a Dockerfile:
# Dockerfile
FROM nvidia/cuda:12.0.0-base-ubuntu22.04
# Install basic dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
# Create app directory
WORKDIR /app
# Copy requirements and install
COPY requirements.txt .
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy code
COPY . .
# Expose port 8080 for Cloud Run
EXPOSE 8080
# Start server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
4. Deploy to Google Cloud Run with GPU
4.1 Enable GPU on Cloud Run
Make sure your Google Cloud project has a GPU quota available, such as nvidia-l4.
4.2 Build and Deploy
Run this command from your project directory to deploy the application to Cloud Run:
gcloud run deploy deepseek-uncensored \
--source . \
--region us-central1 \
--platform managed \
--gpu 1 \
--gpu-type nvidia-l4 \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated
This command builds the Docker image, deploys it to Cloud Run with one nvidia-l4 GPU, allocates 16 GiB memory and 4 CPU cores, and exposes the service publicly (no authentication).
5. Fine-Tuning Pipeline
This section will guide you through a basic four-stage fine-tuning pipeline similar to DeepSeek R1’s training approach.
5.1 Directory Structure Example
.
├── main.py
├── finetune_pipeline.py
├── cold_start_data.jsonl
├── reasoning_data.jsonl
├── data_collection.jsonl
├── final_data.jsonl
├── requirements.txt
└── Dockerfile
Replace the .jsonl files with your actual training data.
5.2 Fine-Tuning Code: finetune_pipeline.py
Create a finetune_pipeline.py file with the following code:
finetune_pipeline.py# finetune_pipeline.py import os import torch from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments) from datasets import load_dataset from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import pipeline, AutoModel # 1. Cold Start Phase def cold_start_finetune( base_model="deepseek-ai/deepseek-r1-distill-7b", train_file="cold_start_data.jsonl", output_dir="cold_start_finetuned_model" ): # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(base_model) tokenizer = AutoTokenizer.from_pretrained(base_model) # Load dataset dataset = load_dataset("json", data_files=train_file, split="train") # Simple tokenization function def tokenize_function(example): return tokenizer( example["prompt"] + "\n" + example["completion"], truncation=True, max_length=512 ) dataset = dataset.map(tokenize_function, batched=True) dataset = dataset.shuffle() # Define training arguments training_args = TrainingArguments( output_dir=output_dir, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, save_steps=50, logging_steps=50, learning_rate=5e-5 ) # Trainer trainer = Trainer( model=model, args=training_args, train_dataset=dataset ) trainer.train() trainer.save_model(output_dir) tokenizer.save_pretrained(output_dir) return output_dir # 2. Reasoning RL Training def reasoning_rl_training( cold_start_model_dir="cold_start_finetuned_model", train_file="reasoning_data.jsonl", output_dir="reasoning_rl_model" ): # Config for PPO config = PPOConfig( batch_size=16, learning_rate=1e-5, log_with=None, # or 'wandb' mini_batch_size=4 ) # Load model and tokenizer model = AutoModelForCausalLMWithValueHead.from_pretrained(cold_start_model_dir) tokenizer = AutoTokenizer.from_pretrained(cold_start_model_dir) # Create a PPO trainer ppo_trainer = PPOTrainer( config, model, tokenizer=tokenizer, ) # Load dataset dataset = load_dataset("json", data_files=train_file, split="train") # Simple RL loop (pseudo-coded for brevity) for sample in dataset: prompt = sample["prompt"] desired_answer = sample["completion"] # For reward calculation # Generate response query_tensors = tokenizer.encode(prompt, return_tensors="pt") response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50) response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True) # Calculate reward (simplistic: measure overlap or correctness) reward = 1.0 if desired_answer in response_text else -1.0 # Run a PPO step ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward]) model.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) return output_dir # 3. Data Collection def collect_data( rl_model_dir="reasoning_rl_model", num_samples=1000, output_file="data_collection.jsonl" ): """ Example data collection: generate completions from the RL model. This is a simple version that just uses random prompts or a given file of prompts. """ tokenizer = AutoTokenizer.from_pretrained(rl_model_dir) model = AutoModelForCausalLM.from_pretrained(rl_model_dir) # Suppose we have some random prompts: prompts = [ "Explain quantum entanglement", "Summarize the plot of 1984 by George Orwell", # ... add or load from a prompt file ... ] collected = [] for i in range(num_samples): prompt = prompts[i % len(prompts)] inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) completion = tokenizer.decode(outputs[0], skip_special_tokens=True) collected.append({"prompt": prompt, "completion": completion}) # Save to JSONL with open(output_file, "w") as f: for item in collected: f.write(f"{item}\n") return output_file # 4. Final RL Phase def final_rl_phase( rl_model_dir="reasoning_rl_model", final_data="final_data.jsonl", output_dir="final_rl_model" ): """ Another RL phase using a new dataset or adding human feedback. This is a simplified approach similar to the reasoning RL training step. """ config = PPOConfig( batch_size=16, learning_rate=1e-5, log_with=None, mini_batch_size=4 ) model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_dir) tokenizer = AutoTokenizer.from_pretrained(rl_model_dir) ppo_trainer = PPOTrainer(config, model, tokenizer=tokenizer) dataset = load_dataset("json", data_files=final_data, split="train") for sample in dataset: prompt = sample["prompt"] desired_answer = sample["completion"] query_tensors = tokenizer.encode(prompt, return_tensors="pt") response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50) response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True) reward = 1.0 if desired_answer in response_text else 0.0 ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward]) model.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) return output_dir # END-TO-END PIPELINE EXAMPLE if __name__ == "__main__": # 1) Cold Start cold_start_out = cold_start_finetune( base_model="deepseek-ai/deepseek-r1-distill-7b", train_file="cold_start_data.jsonl", output_dir="cold_start_finetuned_model" ) # 2) Reasoning RL reasoning_rl_out = reasoning_rl_training( cold_start_model_dir=cold_start_out, train_file="reasoning_data.jsonl", output_dir="reasoning_rl_model" ) # 3) Data Collection data_collection_out = collect_data( rl_model_dir=reasoning_rl_out, num_samples=100, output_file="data_collection.jsonl" ) # 4) Final RL Phase final_rl_out = final_rl_phase( rl_model_dir=reasoning_rl_out, final_data="final_data.jsonl", output_dir="final_rl_model" ) print("All done! Final model stored in:", final_rl_out)
Usage Overview
- Upload Your Data:
- Prepare cold_start_data.jsonl, reasoning_data.jsonl, final_data.jsonl, etc.
- Each line should be a JSON object with “prompt” and “completion” keys.
- Run the Pipeline Locally:
python3 finetune_pipeline.py
This creates directories like cold_start_finetuned_model, reasoning_rl_model, and final_rl_model.
- Deploy:
- Build and push via gcloud run deploy.
- Inference:
- After deployment, send a POST request to your Cloud Run service:
import requests
url = "https://<YOUR-CLOUD-RUN-URL>/v1/inference"
data = {"prompt": "Tell me about quantum physics", "max_tokens": 100}
response = requests.post(url, json=data)
print(response.json())
Fine-Tuning via Endpoint:
- Upload new data for fine-tuning:
import requests
url = "https://<YOUR-CLOUD-RUN-URL>/v1/finetune"
with open("new_training_data.jsonl", "rb") as f:
r = requests.post(url, files={"file": ("new_training_data.jsonl", f)})
print(r.json())
This tutorial has provided an end-to-end pipeline for deploying and fine-tuning the DeepSeek R1 Distill model. You’ve learned how to:
- Deploy a FastAPI server with Docker and GPU support on Google Cloud Run.
- Fine-tune the model in four stages: Cold Start, Reasoning RL, Data Collection, and Final RL.
- Use TRL (PPO) for basic RL-based training loops.
Disclaimer: Deploying uncensored models has ethical and legal implications. Make sure to comply with relevant laws, policies, and usage guidelines.
- TRLF (PPO) GitHub
- Hugging Face Transformers Docs
- Google Cloud Run GPU Docs
- DeepSeek R1 Project
- fastapi File Upload Docs
- Deploying FastAPI on Google Cloud Run
This comprehensive guide should equip you with the knowledge to start deploying and fine-tuning the DeepSeek R1 Distill model.