Deploy an uncensored DeepSeek R1 model on Google Cloud Run

Written by

DeepSeek R1 Distill: Complete Tutorial for Deployment & Fine-Tuning

Are you eager to explore the capabilities of the DeepSeek R1 Distill model? This guide provides a comprehensive, step-by-step approach to deploying the uncensored DeepSeek R1 Distill model to Google Cloud Run with GPU support, and also walks you through a practical fine-tuning process. The tutorial is broken down into the following sections:

Environment Setup
FastAPI Inference Server
Docker Configuration
Google Cloud Run Deployment
Fine-Tuning Pipeline

Let’s dive in and get started.

1. Environment Setup

Before deploying and fine-tuning, make sure you have the required tools installed and configured.

1.1 Install Required Tools

Python 3.9+
pip: For Python package installation
Docker: For containerization
Google Cloud CLI: For deployment

Install Google Cloud CLI (Ubuntu/Debian):
Follow the official Google Cloud CLI installation guide to install gcloud.

1.2 Authenticate with Google Cloud

Run the following commands to initialize and authenticate with Google Cloud:

gcloud init
gcloud auth application-default login

Ensure you have an active Google Cloud project with Cloud Run, Compute Engine, and Container Registry/Artifact Registry enabled.

2. FastAPI Inference Server

We’ll create a minimal FastAPI application that serves two main endpoints:

/v1/inference: For model inference.
/v1/finetune: For uploading fine-tuning data (JSONL).

Create a file named main.py with the following content:

# main.py
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from pydantic import BaseModel
import json

import litellm  # Minimalistic LLM library

app = FastAPI()

class InferenceRequest(BaseModel):
    prompt: str
    max_tokens: int = 512

@app.post("/v1/inference")
async def inference(request: InferenceRequest):
    """
    Inference endpoint using deepseek-r1-distill-7b (uncensored).
    """
    response = litellm.completion(
        model="deepseek/deepseek-r1-distill-7b",
        messages=[{"role": "user", "content": request.prompt}],
        max_tokens=request.max_tokens
    )
    return JSONResponse(content=response)

@app.post("/v1/finetune")
async def finetune(file: UploadFile = File(...)):
    """
    Fine-tune endpoint that accepts a JSONL file.
    """
    if not file.filename.endswith('.jsonl'):
        return JSONResponse(
            status_code=400,
            content={"error": "Only .jsonl files are accepted for fine-tuning"}
        )

    # Read lines from uploaded file
    data = [json.loads(line) for line in file.file]

    # Perform or schedule a fine-tuning job here (simplified placeholder)
    # You can integrate with your training pipeline below.
    
    return JSONResponse(content={"status": "Fine-tuning request received", "samples": len(data)})

3. Docker Configuration

To containerize the application, create a requirements.txt file:

fastapi
uvicorn
litellm
pydantic
transformers
datasets
accelerate
trl
torch

And create a Dockerfile:

# Dockerfile
FROM nvidia/cuda:12.0.0-base-ubuntu22.04

# Install basic dependencies
RUN apt-get update && apt-get install -y python3 python3-pip

# Create app directory
WORKDIR /app

# Copy requirements and install
COPY requirements.txt .
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy code
COPY . .

# Expose port 8080 for Cloud Run
EXPOSE 8080

# Start server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

4. Deploy to Google Cloud Run with GPU

4.1 Enable GPU on Cloud Run

Make sure your Google Cloud project has a GPU quota available, such as nvidia-l4.

4.2 Build and Deploy

Run this command from your project directory to deploy the application to Cloud Run:

gcloud run deploy deepseek-uncensored \
    --source . \
    --region us-central1 \
    --platform managed \
    --gpu 1 \
    --gpu-type nvidia-l4 \
    --memory 16Gi \
    --cpu 4 \
    --allow-unauthenticated

This command builds the Docker image, deploys it to Cloud Run with one nvidia-l4 GPU, allocates 16 GiB memory and 4 CPU cores, and exposes the service publicly (no authentication).

5. Fine-Tuning Pipeline

This section will guide you through a basic four-stage fine-tuning pipeline similar to DeepSeek R1’s training approach.

5.1 Directory Structure Example

.
├── main.py
├── finetune_pipeline.py
├── cold_start_data.jsonl
├── reasoning_data.jsonl
├── data_collection.jsonl
├── final_data.jsonl
├── requirements.txt
└── Dockerfile

Replace the .jsonl files with your actual training data.

5.2 Fine-Tuning Code: finetune_pipeline.py

Create a finetune_pipeline.py file with the following code:

# finetune_pipeline.py

import os
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          Trainer, TrainingArguments)
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline, AutoModel


# 1. Cold Start Phase
def cold_start_finetune(
    base_model="deepseek-ai/deepseek-r1-distill-7b",
    train_file="cold_start_data.jsonl",
    output_dir="cold_start_finetuned_model"
):
    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(base_model)
    tokenizer = AutoTokenizer.from_pretrained(base_model)

    # Load dataset
    dataset = load_dataset("json", data_files=train_file, split="train")

    # Simple tokenization function
    def tokenize_function(example):
        return tokenizer(
            example["prompt"] + "\n" + example["completion"],
            truncation=True,
            max_length=512
        )

    dataset = dataset.map(tokenize_function, batched=True)
    dataset = dataset.shuffle()

    # Define training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        save_steps=50,
        logging_steps=50,
        learning_rate=5e-5
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset
    )

    trainer.train()
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)
    return output_dir


# 2. Reasoning RL Training
def reasoning_rl_training(
    cold_start_model_dir="cold_start_finetuned_model",
    train_file="reasoning_data.jsonl",
    output_dir="reasoning_rl_model"
):
    # Config for PPO
    config = PPOConfig(
        batch_size=16,
        learning_rate=1e-5,
        log_with=None,  # or 'wandb'
        mini_batch_size=4
    )

    # Load model and tokenizer
    model = AutoModelForCausalLMWithValueHead.from_pretrained(cold_start_model_dir)
    tokenizer = AutoTokenizer.from_pretrained(cold_start_model_dir)

    # Create a PPO trainer
    ppo_trainer = PPOTrainer(
        config,
        model,
        tokenizer=tokenizer,
    )

    # Load dataset
    dataset = load_dataset("json", data_files=train_file, split="train")

    # Simple RL loop (pseudo-coded for brevity)
    for sample in dataset:
        prompt = sample["prompt"]
        desired_answer = sample["completion"]  # For reward calculation

        # Generate response
        query_tensors = tokenizer.encode(prompt, return_tensors="pt")
        response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
        response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)

        # Calculate reward (simplistic: measure overlap or correctness)
        reward = 1.0 if desired_answer in response_text else -1.0

        # Run a PPO step
        ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    return output_dir


# 3. Data Collection
def collect_data(
    rl_model_dir="reasoning_rl_model",
    num_samples=1000,
    output_file="data_collection.jsonl"
):
    """
    Example data collection: generate completions from the RL model.
    This is a simple version that just uses random prompts or a given file of prompts.
    """
    tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
    model = AutoModelForCausalLM.from_pretrained(rl_model_dir)

    # Suppose we have some random prompts:
    prompts = [
        "Explain quantum entanglement",
        "Summarize the plot of 1984 by George Orwell",
        # ... add or load from a prompt file ...
    ]

    collected = []
    for i in range(num_samples):
        prompt = prompts[i % len(prompts)]
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=50)
        completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
        collected.append({"prompt": prompt, "completion": completion})

    # Save to JSONL
    with open(output_file, "w") as f:
        for item in collected:
            f.write(f"{item}\n")

    return output_file


# 4. Final RL Phase
def final_rl_phase(
    rl_model_dir="reasoning_rl_model",
    final_data="final_data.jsonl",
    output_dir="final_rl_model"
):
    """
    Another RL phase using a new dataset or adding human feedback.
    This is a simplified approach similar to the reasoning RL training step.
    """
    config = PPOConfig(
        batch_size=16,
        learning_rate=1e-5,
        log_with=None,
        mini_batch_size=4
    )

    model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_dir)
    tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
    ppo_trainer = PPOTrainer(config, model, tokenizer=tokenizer)

    dataset = load_dataset("json", data_files=final_data, split="train")

    for sample in dataset:
        prompt = sample["prompt"]
        desired_answer = sample["completion"]
        query_tensors = tokenizer.encode(prompt, return_tensors="pt")
        response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
        response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)

        reward = 1.0 if desired_answer in response_text else 0.0
        ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    return output_dir


# END-TO-END PIPELINE EXAMPLE
if __name__ == "__main__":
    # 1) Cold Start
    cold_start_out = cold_start_finetune(
        base_model="deepseek-ai/deepseek-r1-distill-7b",
        train_file="cold_start_data.jsonl",
        output_dir="cold_start_finetuned_model"
    )

    # 2) Reasoning RL
    reasoning_rl_out = reasoning_rl_training(
        cold_start_model_dir=cold_start_out,
        train_file="reasoning_data.jsonl",
        output_dir="reasoning_rl_model"
    )

    # 3) Data Collection
    data_collection_out = collect_data(
        rl_model_dir=reasoning_rl_out,
        num_samples=100,
        output_file="data_collection.jsonl"
    )

    # 4) Final RL Phase
    final_rl_out = final_rl_phase(
        rl_model_dir=reasoning_rl_out,
        final_data="final_data.jsonl",
        output_dir="final_rl_model"
    )

    print("All done! Final model stored in:", final_rl_out)

Usage Overview

Upload Your Data:
- Prepare cold_start_data.jsonl, reasoning_data.jsonl, final_data.jsonl, etc.
- Each line should be a JSON object with “prompt” and “completion” keys.
Run the Pipeline Locally:

python3 finetune_pipeline.py

This creates directories like cold_start_finetuned_model, reasoning_rl_model, and final_rl_model.

Deploy:
- Build and push via gcloud run deploy.
Inference:
- After deployment, send a POST request to your Cloud Run service:

import requests

url = "https://<YOUR-CLOUD-RUN-URL>/v1/inference"
data = {"prompt": "Tell me about quantum physics", "max_tokens": 100}
response = requests.post(url, json=data)
print(response.json())

Fine-Tuning via Endpoint:

Upload new data for fine-tuning:

import requests

url = "https://<YOUR-CLOUD-RUN-URL>/v1/finetune"
with open("new_training_data.jsonl", "rb") as f:
    r = requests.post(url, files={"file": ("new_training_data.jsonl", f)})
print(r.json())

This tutorial has provided an end-to-end pipeline for deploying and fine-tuning the DeepSeek R1 Distill model. You’ve learned how to:

Deploy a FastAPI server with Docker and GPU support on Google Cloud Run.
Fine-tune the model in four stages: Cold Start, Reasoning RL, Data Collection, and Final RL.
Use TRL (PPO) for basic RL-based training loops.

Disclaimer: Deploying uncensored models has ethical and legal implications. Make sure to comply with relevant laws, policies, and usage guidelines.

This comprehensive guide should equip you with the knowledge to start deploying and fine-tuning the DeepSeek R1 Distill model.

DeepSeek model Deploy uncensored DeepSeek model LLM LLM Learning uncensored llm model

Author’s Bio

Vineet Tiwari is an accomplished Solution Architect with over 5 years of experience in AI, ML, Web3, and Cloud technologies. Specializing in Large Language Models (LLMs) and blockchain systems, he excels in building secure AI solutions and custom decentralized platforms tailored to unique business needs.

Vineet’s expertise spans cloud-native architectures, data-driven machine learning models, and innovative blockchain implementations. Passionate about leveraging technology to drive business transformation, he combines technical mastery with a forward-thinking approach to deliver scalable, secure, and cutting-edge solutions. With a strong commitment to innovation, Vineet empowers businesses to thrive in an ever-evolving digital landscape.

Deploy an uncensored DeepSeek R1 model on Google Cloud Run

DeepSeek R1 Distill: Complete Tutorial for Deployment & Fine-Tuning

1. Environment Setup

1.1 Install Required Tools

1.2 Authenticate with Google Cloud

2. FastAPI Inference Server

3. Docker Configuration

4. Deploy to Google Cloud Run with GPU

4.1 Enable GPU on Cloud Run

4.2 Build and Deploy

5. Fine-Tuning Pipeline

5.1 Directory Structure Example

5.2 Fine-Tuning Code: finetune_pipeline.py

Usage Overview

Author’s Bio

Comments

Leave a Reply Cancel reply

More posts

Llama 4 is here, Meta’s Cutting-Edge Language Model

What Are Tokens in NLP?

OpenManus: FULLY FREE Manus Alternative

What is Infinite Retrieval, and How Does It Work?