Tag: uncensored llm model

  • Deploy an uncensored DeepSeek R1 model on Google Cloud Run

    Deploy an uncensored DeepSeek R1 model on Google Cloud Run

    DeepSeek R1 Distill: Complete Tutorial for Deployment & Fine-Tuning

    Are you eager to explore the capabilities of the DeepSeek R1 Distill model? This guide provides a comprehensive, step-by-step approach to deploying the uncensored DeepSeek R1 Distill model to Google Cloud Run with GPU support, and also walks you through a practical fine-tuning process. The tutorial is broken down into the following sections:

    Deploy uncensored DeepSeek model
    • Environment Setup
    • FastAPI Inference Server
    • Docker Configuration
    • Google Cloud Run Deployment
    • Fine-Tuning Pipeline

    Let’s dive in and get started.

    1. Environment Setup

    Before deploying and fine-tuning, make sure you have the required tools installed and configured.

    1.1 Install Required Tools

    • Python 3.9+
    • pip: For Python package installation
    • Docker: For containerization
    • Google Cloud CLI: For deployment

    Install Google Cloud CLI (Ubuntu/Debian):
    Follow the official Google Cloud CLI installation guide to install gcloud.

    1.2 Authenticate with Google Cloud

    Run the following commands to initialize and authenticate with Google Cloud:

    gcloud init
    gcloud auth application-default login

    Ensure you have an active Google Cloud project with Cloud Run, Compute Engine, and Container Registry/Artifact Registry enabled.

    2. FastAPI Inference Server

    We’ll create a minimal FastAPI application that serves two main endpoints:

    • /v1/inference: For model inference.
    • /v1/finetune: For uploading fine-tuning data (JSONL).

    Create a file named main.py with the following content:

    # main.py
    from fastapi import FastAPI, File, UploadFile
    from fastapi.responses import JSONResponse
    from pydantic import BaseModel
    import json
    
    import litellm  # Minimalistic LLM library
    
    app = FastAPI()
    
    class InferenceRequest(BaseModel):
        prompt: str
        max_tokens: int = 512
    
    @app.post("/v1/inference")
    async def inference(request: InferenceRequest):
        """
        Inference endpoint using deepseek-r1-distill-7b (uncensored).
        """
        response = litellm.completion(
            model="deepseek/deepseek-r1-distill-7b",
            messages=[{"role": "user", "content": request.prompt}],
            max_tokens=request.max_tokens
        )
        return JSONResponse(content=response)
    
    @app.post("/v1/finetune")
    async def finetune(file: UploadFile = File(...)):
        """
        Fine-tune endpoint that accepts a JSONL file.
        """
        if not file.filename.endswith('.jsonl'):
            return JSONResponse(
                status_code=400,
                content={"error": "Only .jsonl files are accepted for fine-tuning"}
            )
    
        # Read lines from uploaded file
        data = [json.loads(line) for line in file.file]
    
        # Perform or schedule a fine-tuning job here (simplified placeholder)
        # You can integrate with your training pipeline below.
        
        return JSONResponse(content={"status": "Fine-tuning request received", "samples": len(data)})

    3. Docker Configuration

    To containerize the application, create a requirements.txt file:

    fastapi
    uvicorn
    litellm
    pydantic
    transformers
    datasets
    accelerate
    trl
    torch

    And create a Dockerfile:

    # Dockerfile
    FROM nvidia/cuda:12.0.0-base-ubuntu22.04
    
    # Install basic dependencies
    RUN apt-get update && apt-get install -y python3 python3-pip
    
    # Create app directory
    WORKDIR /app
    
    # Copy requirements and install
    COPY requirements.txt .
    RUN pip3 install --upgrade pip
    RUN pip3 install --no-cache-dir -r requirements.txt
    
    # Copy code
    COPY . .
    
    # Expose port 8080 for Cloud Run
    EXPOSE 8080
    
    # Start server
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

    4. Deploy to Google Cloud Run with GPU

    4.1 Enable GPU on Cloud Run

    Make sure your Google Cloud project has a GPU quota available, such as nvidia-l4.

    4.2 Build and Deploy

    Run this command from your project directory to deploy the application to Cloud Run:

    gcloud run deploy deepseek-uncensored \
        --source . \
        --region us-central1 \
        --platform managed \
        --gpu 1 \
        --gpu-type nvidia-l4 \
        --memory 16Gi \
        --cpu 4 \
        --allow-unauthenticated

    This command builds the Docker image, deploys it to Cloud Run with one nvidia-l4 GPU, allocates 16 GiB memory and 4 CPU cores, and exposes the service publicly (no authentication).

    5. Fine-Tuning Pipeline

    This section will guide you through a basic four-stage fine-tuning pipeline similar to DeepSeek R1’s training approach.

    5.1 Directory Structure Example

    .
    ├── main.py
    ├── finetune_pipeline.py
    ├── cold_start_data.jsonl
    ├── reasoning_data.jsonl
    ├── data_collection.jsonl
    ├── final_data.jsonl
    ├── requirements.txt
    └── Dockerfile

    Replace the .jsonl files with your actual training data.

    5.2 Fine-Tuning Code: finetune_pipeline.py

    Create a finetune_pipeline.py file with the following code:

    # finetune_pipeline.py
    
    import os
    import torch
    from transformers import (AutoModelForCausalLM, AutoTokenizer,
                              Trainer, TrainingArguments)
    from datasets import load_dataset
    
    from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
    from transformers import pipeline, AutoModel
    
    
    # 1. Cold Start Phase
    def cold_start_finetune(
        base_model="deepseek-ai/deepseek-r1-distill-7b",
        train_file="cold_start_data.jsonl",
        output_dir="cold_start_finetuned_model"
    ):
        # Load model and tokenizer
        model = AutoModelForCausalLM.from_pretrained(base_model)
        tokenizer = AutoTokenizer.from_pretrained(base_model)
    
        # Load dataset
        dataset = load_dataset("json", data_files=train_file, split="train")
    
        # Simple tokenization function
        def tokenize_function(example):
            return tokenizer(
                example["prompt"] + "\n" + example["completion"],
                truncation=True,
                max_length=512
            )
    
        dataset = dataset.map(tokenize_function, batched=True)
        dataset = dataset.shuffle()
    
        # Define training arguments
        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=1,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            save_steps=50,
            logging_steps=50,
            learning_rate=5e-5
        )
    
        # Trainer
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=dataset
        )
    
        trainer.train()
        trainer.save_model(output_dir)
        tokenizer.save_pretrained(output_dir)
        return output_dir
    
    
    # 2. Reasoning RL Training
    def reasoning_rl_training(
        cold_start_model_dir="cold_start_finetuned_model",
        train_file="reasoning_data.jsonl",
        output_dir="reasoning_rl_model"
    ):
        # Config for PPO
        config = PPOConfig(
            batch_size=16,
            learning_rate=1e-5,
            log_with=None,  # or 'wandb'
            mini_batch_size=4
        )
    
        # Load model and tokenizer
        model = AutoModelForCausalLMWithValueHead.from_pretrained(cold_start_model_dir)
        tokenizer = AutoTokenizer.from_pretrained(cold_start_model_dir)
    
        # Create a PPO trainer
        ppo_trainer = PPOTrainer(
            config,
            model,
            tokenizer=tokenizer,
        )
    
        # Load dataset
        dataset = load_dataset("json", data_files=train_file, split="train")
    
        # Simple RL loop (pseudo-coded for brevity)
        for sample in dataset:
            prompt = sample["prompt"]
            desired_answer = sample["completion"]  # For reward calculation
    
            # Generate response
            query_tensors = tokenizer.encode(prompt, return_tensors="pt")
            response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
            response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)
    
            # Calculate reward (simplistic: measure overlap or correctness)
            reward = 1.0 if desired_answer in response_text else -1.0
    
            # Run a PPO step
            ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])
    
        model.save_pretrained(output_dir)
        tokenizer.save_pretrained(output_dir)
        return output_dir
    
    
    # 3. Data Collection
    def collect_data(
        rl_model_dir="reasoning_rl_model",
        num_samples=1000,
        output_file="data_collection.jsonl"
    ):
        """
        Example data collection: generate completions from the RL model.
        This is a simple version that just uses random prompts or a given file of prompts.
        """
        tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
        model = AutoModelForCausalLM.from_pretrained(rl_model_dir)
    
        # Suppose we have some random prompts:
        prompts = [
            "Explain quantum entanglement",
            "Summarize the plot of 1984 by George Orwell",
            # ... add or load from a prompt file ...
        ]
    
        collected = []
        for i in range(num_samples):
            prompt = prompts[i % len(prompts)]
            inputs = tokenizer(prompt, return_tensors="pt")
            outputs = model.generate(**inputs, max_new_tokens=50)
            completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
            collected.append({"prompt": prompt, "completion": completion})
    
        # Save to JSONL
        with open(output_file, "w") as f:
            for item in collected:
                f.write(f"{item}\n")
    
        return output_file
    
    
    # 4. Final RL Phase
    def final_rl_phase(
        rl_model_dir="reasoning_rl_model",
        final_data="final_data.jsonl",
        output_dir="final_rl_model"
    ):
        """
        Another RL phase using a new dataset or adding human feedback.
        This is a simplified approach similar to the reasoning RL training step.
        """
        config = PPOConfig(
            batch_size=16,
            learning_rate=1e-5,
            log_with=None,
            mini_batch_size=4
        )
    
        model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_dir)
        tokenizer = AutoTokenizer.from_pretrained(rl_model_dir)
        ppo_trainer = PPOTrainer(config, model, tokenizer=tokenizer)
    
        dataset = load_dataset("json", data_files=final_data, split="train")
    
        for sample in dataset:
            prompt = sample["prompt"]
            desired_answer = sample["completion"]
            query_tensors = tokenizer.encode(prompt, return_tensors="pt")
            response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=50)
            response_text = tokenizer.decode(response_tensors[0], skip_special_tokens=True)
    
            reward = 1.0 if desired_answer in response_text else 0.0
            ppo_trainer.step([query_tensors[0]], [response_tensors[0]], [reward])
    
        model.save_pretrained(output_dir)
        tokenizer.save_pretrained(output_dir)
        return output_dir
    
    
    # END-TO-END PIPELINE EXAMPLE
    if __name__ == "__main__":
        # 1) Cold Start
        cold_start_out = cold_start_finetune(
            base_model="deepseek-ai/deepseek-r1-distill-7b",
            train_file="cold_start_data.jsonl",
            output_dir="cold_start_finetuned_model"
        )
    
        # 2) Reasoning RL
        reasoning_rl_out = reasoning_rl_training(
            cold_start_model_dir=cold_start_out,
            train_file="reasoning_data.jsonl",
            output_dir="reasoning_rl_model"
        )
    
        # 3) Data Collection
        data_collection_out = collect_data(
            rl_model_dir=reasoning_rl_out,
            num_samples=100,
            output_file="data_collection.jsonl"
        )
    
        # 4) Final RL Phase
        final_rl_out = final_rl_phase(
            rl_model_dir=reasoning_rl_out,
            final_data="final_data.jsonl",
            output_dir="final_rl_model"
        )
    
        print("All done! Final model stored in:", final_rl_out)

    Usage Overview

    1. Upload Your Data:
      • Prepare cold_start_data.jsonl, reasoning_data.jsonl, final_data.jsonl, etc.
      • Each line should be a JSON object with “prompt” and “completion” keys.
    2. Run the Pipeline Locally:
    python3 finetune_pipeline.py

    This creates directories like cold_start_finetuned_model, reasoning_rl_model, and final_rl_model.

    1. Deploy:
      • Build and push via gcloud run deploy.
    2. Inference:
      • After deployment, send a POST request to your Cloud Run service:
    import requests
    
    url = "https://<YOUR-CLOUD-RUN-URL>/v1/inference"
    data = {"prompt": "Tell me about quantum physics", "max_tokens": 100}
    response = requests.post(url, json=data)
    print(response.json())

    Fine-Tuning via Endpoint:

    • Upload new data for fine-tuning:
    import requests
    
    url = "https://<YOUR-CLOUD-RUN-URL>/v1/finetune"
    with open("new_training_data.jsonl", "rb") as f:
        r = requests.post(url, files={"file": ("new_training_data.jsonl", f)})
    print(r.json())

    This tutorial has provided an end-to-end pipeline for deploying and fine-tuning the DeepSeek R1 Distill model. You’ve learned how to:

    • Deploy a FastAPI server with Docker and GPU support on Google Cloud Run.
    • Fine-tune the model in four stages: Cold Start, Reasoning RL, Data Collection, and Final RL.
    • Use TRL (PPO) for basic RL-based training loops.

    Disclaimer: Deploying uncensored models has ethical and legal implications. Make sure to comply with relevant laws, policies, and usage guidelines.

    This comprehensive guide should equip you with the knowledge to start deploying and fine-tuning the DeepSeek R1 Distill model.