Nirajan Paudel | ML Engineer & Researcher

Featured

Tensor: Agentic RAG Academic AI

Advanced RAG conversational AI platform engineered for engineering students, solving fragmented academic resource navigation by providing centralized access to past exam papers, syllabuses, and technical notes with intelligent question answering.

FastAPI React Pinecone Redis

The Challenge

Engineering students at the Institute of Engineering, Nepal struggle to navigate fragmented academic resources—thousands of pages of legacy PDFs, multi-page syllabuses, and complex technical notes. Traditional search tools fail when students ask year-specific or chapter-specific questions (e.g., "Show me 8-mark numericals on Sorting from 2076") because engineering documents contain both semantic meaning and rigid metadata.

The Solution

Tensor is an advanced Retrieval-Augmented Generation (RAG) conversational AI platform that provides a centralized, intelligent interface capable of answering specific technical questions, retrieving relevant past papers, and providing study guidance through a Multi-Stage Agentic RAG pipeline.

Challenge 1: Semantic Search in Highly Structured Technical Data

Problem: Traditional search tools fail with year-specific or chapter-specific queries because engineering documents contain both semantic meaning and rigid metadata.

Solution: Implemented a Hybrid Query Architecture that combines:

AI Metadata Extraction (Smart Filtering): Using GPT-4o-mini to dynamically extract filters (year, subject, chapter, marks) from natural language queries
Vector Search: Using OpenAI/Gemini Embeddings and Pinecone Serverless for high-precision semantic retrieval
Metadata Filtering: Engineered a layer that converts natural language into complex Pinecone filter objects, ensuring zero-noise results for year-specific queries
Intent Routing: An LLM-based router classifies queries into four categories: Structured (Syllabus/Marks), Semantic (Explanations), Past Question (Historical papers), or Hybrid

Challenge 2: Precision in Complex Engineering Papers

Problem: Engineering questions often involve diagrams, formulas, and specific marking schemes that basic PDF parsers fail to capture.

Solution: Developed a specialized Ingestion Pipeline:

Multimodal PDF Parsing: Used LlamaParse for high-fidelity parsing to preserve diagrams, tables, mathematical notations, and structured content
Question-Boundary Aware Chunking: Instead of arbitrary character splits, the system detects question boundaries (e.g., "1a", "2b") to ensure context remains intact and respects document hierarchy (Chapters > Sections > Paragraphs)
Two-Stage Reranking: Implemented a retrieval pool of 50 candidates, followed by a Reranker Service using Cross-Encoder Model (ms-marco-MiniLM-L-6-v2) to narrow down the top 10 most relevant chunks for the LLM
Automated Syllabus Mapping: Built a custom curriculum engine that maps 10+ engineering programs (BCT, BCE, BEL, etc.) across 8 semesters, ensuring data is tagged with precise metadata (Program, Year, Semester, Subject Code)
Markdown Cleanup Pipeline: Custom pipeline that strips non-semantic headers/footers and identifies Page Context to inject metadata into every chunk, allowing the AI to cite specific pages in responses

Challenge 3: Cost and Latency in LLM Workflows

Problem: Frequent RAG lookups and LLM calls can lead to high latency (>10s) and ballooning API costs.

Solution: Built a Multi-Layer Distributed Caching System using Async Redis (Upstash):

Result Caching: Immediate response for identical queries, delivering instant answers for frequent student queries
Intent/Metadata Caching: Caching the AI-extracted filters themselves, preventing re-classifying similar questions and reducing LLM "thought" time
Embedding Cache: Saves 100% of LLM costs for repeated query vectors
Streaming Architecture: Implemented Server-Sent Events (SSE) for real-time response streaming

Outcome: Achieved a 60%+ cache hit rate, reducing total query costs by approximately $0.001 per query and cutting P95 response times to under 3 seconds.

Engineering Challenges & Solutions

Memory Constraints on Free-Tier Hosting: Loading the Cross-Encoder model into memory required ~400MB of RAM, causing the 512MB Render server to experience Out-Of-Memory crashes. Solution: Implemented Feature Toggling and Lazy Loading with a configuration-driven switch that detects the hosting environment and toggles the heavy ML model off in favor of optimized similarity search when resources are low, maintaining 99.9% uptime
Python 3.11 Environment Stability: Modern f-string features caused syntax errors in certain server-side environments. Solution: Performed a codebase-wide refactor of the formatting logic, replacing complex f-strings with robust string concatenation and pathlib for file routing, ensuring 100% environmental portability

Key Metrics & Impact

Data Scale: Ingested and indexed 25+ comprehensive engineering past papers (Mathematics, C Programming) with dynamic metadata tagging
Performance: Optimized response latency to under 3 seconds (P95) through streaming architecture and Redis caching, with less than 200ms TTFT (Time To First Token)
Efficiency: Reduced OpenAI/Gemini API dependency by establishing a 60%+ cache hit rate for common student queries, reducing LLM API overhead by 70%
Scalability: Designed a system capable of handling 100,000+ vectors using Pinecone Serverless architecture
Precision: Achieved 85%+ query accuracy through dual-stage "Retrieve & Rerank" pipeline, with reranking layer improving Top-1 retrieval accuracy by 40% compared to standard vector search
Architecture: Managed 7+ micro-services/mixins in the backend (RAG, Reranker, Intent Router, Multi-Doc Chunker) to ensure modularity

Tech Stack

Backend: Python, FastAPI, SQLAlchemy, Pydantic

Frontend: React, Vite, Tailwind CSS, Vercel AI SDK (for streaming)

AI/ML: OpenAI (GPT-4o/mini), Gemini (Embeddings), Pinecone (Vector DB), LlamaParse, Sentence-Transformers

Infrastructure: Redis (Async/Upstash Caching), Vercel (Frontend Deployment), Railway/Local (Backend), GitHub Actions (CI/CD)

Integrations: Multi-LLM provider support (Groq, Gemini, OpenAI)

Skills Demonstrated

This project demonstrates expertise in RAG systems, vector databases, distributed caching strategies, production ML deployment, and building scalable AI architectures that solve real-world problems in academic settings.

Featured

Zenco: AI Code Analysis Platform

Open-source CLI tool leveraging LLMs to automate code documentation, refactoring, and quality enhancement across 5 languages. Reduces dev time by 20-30% and API costs by 30-50%.

Python Tree-sitter LLM APIs CI/CD

The Challenge

Software developers spend 20-30% of their time on repetitive tasks like writing docstrings, adding type hints, and identifying dead code. These essential maintenance tasks are time-consuming and expensive when using LLM APIs inefficiently on large codebases.

The Solution

Zenco provides an intelligent code analysis pipeline with a breakthrough execution priority system that processes dead code detection first, then skips dead code in subsequent analysis stages—achieving 30-50% reduction in LLM API calls and processing time.

Technical Architecture

Multi-Language Support: Implemented Tree-sitter for universal parsing across Python, JavaScript, Java, C++, and Go with language-specific documentation formatters
Modular Design: Refactored monolithic 2,057-line codebase into 4 independent processors (DeadCodeProcessor, DocstringProcessor, TypeHintProcessor, MagicNumberProcessor), reducing core module to 603 lines—a 70% reduction
Design Patterns: Applied Strategy pattern for pluggable LLM providers, Factory pattern for dynamic instantiation, Visitor pattern for AST traversal, and Adapter pattern for unified LLM interfaces
LLM Integration: Unified API interface supporting Groq, OpenAI, Anthropic, and Google Gemini with intelligent error handling and fallback strategies

DevOps & Quality Assurance

Comprehensive CI/CD: GitHub Actions pipeline with matrix testing across 3 operating systems (Ubuntu, macOS, Windows) and 4 Python versions (3.9-3.12)—12 platform combinations
Test Coverage: Achieved 95.5% test pass rate with unit and integration tests, ensuring production reliability
Package Distribution: Published on PyPI with automated versioning and release management
Developer Experience: Rich terminal UI with color-coded output, interactive configuration wizard, and comprehensive documentation

Performance & Impact

Cost Efficiency: 30-50% reduction in LLM API costs through dead code optimization
Code Quality: 95.5% test coverage with production-ready reliability
Maintainability: 70% reduction in core module complexity
Scalability: Processes codebases of any size with configurable batch processing

VS Code Extension

Developed an official Visual Studio Code extension that brings Zenco's AI-powered code analysis directly into the IDE, providing seamless integration for developers.

IDE Integration: Built a TypeScript-based VS Code extension that integrates the Zenco CLI tool directly into the editor workflow, eliminating the need to switch between terminal and IDE
Core Features: Provides all CLI functionality including file refactoring, automatic docstring generation (Google, NumPy, RST styles), type hint addition, magic number detection, and dead code removal
User Experience: Implemented a diff view for side-by-side comparison of proposed changes before applying, giving developers full control over code modifications
Automatic CLI Management: Engineered automatic installation and management of the Zenco CLI tool, with smart path resolution across Windows, macOS, and Linux platforms
Multi-Provider Support: Configured extension settings for seamless integration with Groq, OpenAI, Anthropic, and Google Gemini LLM providers
Developer Workflow: Streamlined access through status bar integration, allowing developers to trigger code analysis with a single click without leaving the editor
Cross-Platform Compatibility: Ensured robust error handling and cross-platform compatibility for CLI detection and execution

Skills Demonstrated

Software Engineering: Advanced Python, TypeScript, OOP, SOLID principles, design patterns, API integration, async programming

AI/ML: LLM API integration, prompt engineering, context-aware systems, cost optimization

DevOps: CI/CD pipelines, GitHub Actions, multi-platform testing, PyPI distribution

IDE Development: VS Code Extension API, TypeScript, cross-platform CLI integration, developer tooling

Featured

Serverless vs Kubernetes Benchmark

Production-grade infrastructure comparing AWS Lambda and EKS for DistilBERT sentiment analysis. Full IaC, CI/CD, and observability.

AWS Terraform Kubernetes FastAPI

The Challenge

Compare serverless and containerized deployments for ML inference, analyzing cold start behavior, latency, scalability, and cost trade-offs. Key technical constraint: DistilBERT model load time (~60s) exceeds API Gateway's 29-second timeout limit.

Solution Architecture

AWS Lambda (Serverless): Container-based Lambda function (3008 MB RAM) with API Gateway REST API integration and pay-per-request pricing
Kubernetes (EKS): EKS cluster with managed node groups, 2-replica deployment for high availability, and Classic Load Balancer for external access
FastAPI Backend: Request router directing traffic to Lambda or Kubernetes with exponential backoff retry for Lambda cold starts and Prometheus metrics
Streamlit Dashboard: Real-time side-by-side comparison with Locust load testing integration and AI-powered analysis using Groq API (Llama 3)

Infrastructure as Code

Terraform Modules: VPC with public/private subnets across 2 AZs, EKS cluster with managed node groups (t3.small), EC2 instances, ECR repositories, IAM roles and policies
One-Click Deployment: Automated scripts handle Terraform initialization, ECR imports, image verification, and full infrastructure provisioning
CI/CD Pipeline: GitHub Actions for automated Docker builds, AWS authentication, and ECR pushes

Key Technical Achievements

Cold Start Solution: Implemented exponential backoff retry—first request warms Lambda (loads model), automatic retry succeeds. Achieved 100% success rate despite timeout limitations.
Secure ECR Access: IAM Instance Profiles with AmazonEC2ContainerRegistryReadOnly policy—no credentials stored on disk
Async Resource Handling: Proper Terraform state management for Kubernetes LoadBalancer provisioning (2-5 min delay)
Comprehensive Testing: pytest with mocks, integration tests, Locust load testing, custom benchmarking (P50, P95), and AI-powered SRE reports

Results & Impact

Deployment Time: Reduced from manual 2+ hours to automated 15 minutes
Infrastructure Reliability: 100% reproducible deployments via Terraform
Code Quality: 90%+ test coverage with unit, integration, and load tests
Cost Analysis: Quantified cost trade-offs between serverless and containerized approaches

Skills Demonstrated

Cloud & DevOps: AWS (Lambda, EKS, EC2, ECR, API Gateway, VPC, IAM), Terraform, Kubernetes, Docker, GitHub Actions CI/CD

Backend: FastAPI, Python, Prometheus metrics, exponential backoff patterns, microservices architecture

Testing: pytest, Locust load testing, Groq API integration for automated analysis

Audio RAG Assistant

Transcribes audio with OpenAI Whisper and enables intelligent Q&A using Retrieval-Augmented Generation.

Whisper RAG Python

digitalME Personal Assistant

Personal AI chatbot for natural conversations, document retrieval, and database queries.

Python NLP RAG

PDF Chat Assistant

Upload PDFs and chat with them using Mistral-7B-Instruct for natural language document interaction.

Mistral LLM Python

Nepali Image Captioning

Transformer-based model generating paragraph-length Nepali captions with Inception V3 feature extraction.

Transformer CNN NLP

Tour Recommender - Pokhara

Content-based recommendation system for Pokhara destinations using descriptions, genres, and keywords.

Python ML NLP

Music Separation as a Service

Scalable microservices on GKE for AI-driven music source separation. Processes MP3s into 4 instrumental stems with async pipeline.

GKE Redis Flask Demucs

Overview

The system processes user-uploaded MP3s through an asynchronous pipeline, separating them into four distinct instrumental stems (vocals, drums, bass, other) for download.

Key Achievements

Infrastructure Migration: Spearheaded critical migration from self-hosted MinIO to Google Cloud Storage (GCS), boosting data durability from ephemeral state to 99.99%. Eliminated single point of failure and reduced operational overhead.
Asynchronous Processing: Engineered async processing queue using Redis, decoupling lightweight Flask REST API from resource-intensive ML worker. Reduced API response time to under 200ms while handling ML inference tasks averaging 2-5 minutes per song.
Scalability & Cost-Efficiency: Designed worker pool for horizontal scaling on GKE. Architecture capable of processing hundreds of songs per hour by dynamically adjusting worker pod replicas based on queue depth.
Resource Management: Implemented precise Kubernetes resource requests and limits (6Gi RAM for Demucs worker pods). Prevented pod evictions from memory spikes and ensured predictable performance.
Real-time Frontend: Developed dynamic frontend with JavaScript providing real-time job status updates. UI transitions from upload → processing → download links seamlessly.

Skills Demonstrated

Cloud: Google Kubernetes Engine (GKE), Google Cloud Storage, container orchestration

Backend: Flask, Redis, asynchronous processing, microservices architecture

ML/AI: Demucs model deployment, resource optimization for ML workloads

// About Me

Research Focus

// Projects

Tensor: Agentic RAG Academic AI

The Challenge

The Solution

Challenge 1: Semantic Search in Highly Structured Technical Data

Challenge 2: Precision in Complex Engineering Papers

Challenge 3: Cost and Latency in LLM Workflows

Engineering Challenges & Solutions

Key Metrics & Impact

Tech Stack

Skills Demonstrated

Zenco: AI Code Analysis Platform

The Challenge

The Solution

Technical Architecture

DevOps & Quality Assurance

Performance & Impact

VS Code Extension

Skills Demonstrated

Serverless vs Kubernetes Benchmark

The Challenge

Solution Architecture

Infrastructure as Code

Key Technical Achievements

Results & Impact

Skills Demonstrated

Audio RAG Assistant

digitalME Personal Assistant

PDF Chat Assistant

Nepali Image Captioning

Tour Recommender - Pokhara

Music Separation as a Service

Overview

Key Achievements

Skills Demonstrated

// Skills & Courses

Languages

AI & ML

Cloud & DevOps

Tools & Frameworks

Graduate Coursework

// Publications

Drowsiness and Crash Detection Mobile Application for Vehicle's Safety

Nepali Image Captioning: Generating Coherent Paragraph-Length Descriptions Using Transformer

// Blog