Tensor: Agentic RAG Academic AI
Advanced RAG conversational AI platform engineered for engineering students, solving fragmented academic resource navigation by providing centralized access to past exam papers, syllabuses, and technical notes with intelligent question answering.
The Challenge
Engineering students at the Institute of Engineering, Nepal struggle to navigate fragmented academic resources—thousands of pages of legacy PDFs, multi-page syllabuses, and complex technical notes. Traditional search tools fail when students ask year-specific or chapter-specific questions (e.g., "Show me 8-mark numericals on Sorting from 2076") because engineering documents contain both semantic meaning and rigid metadata.
The Solution
Tensor is an advanced Retrieval-Augmented Generation (RAG) conversational AI platform that provides a centralized, intelligent interface capable of answering specific technical questions, retrieving relevant past papers, and providing study guidance through a Multi-Stage Agentic RAG pipeline.
Challenge 1: Semantic Search in Highly Structured Technical Data
Problem: Traditional search tools fail with year-specific or chapter-specific queries because engineering documents contain both semantic meaning and rigid metadata.
Solution: Implemented a Hybrid Query Architecture that combines:
- AI Metadata Extraction (Smart Filtering): Using GPT-4o-mini to dynamically extract filters (year, subject, chapter, marks) from natural language queries
- Vector Search: Using OpenAI/Gemini Embeddings and Pinecone Serverless for high-precision semantic retrieval
- Metadata Filtering: Engineered a layer that converts natural language into complex Pinecone filter objects, ensuring zero-noise results for year-specific queries
- Intent Routing: An LLM-based router classifies queries into four categories: Structured (Syllabus/Marks), Semantic (Explanations), Past Question (Historical papers), or Hybrid
Challenge 2: Precision in Complex Engineering Papers
Problem: Engineering questions often involve diagrams, formulas, and specific marking schemes that basic PDF parsers fail to capture.
Solution: Developed a specialized Ingestion Pipeline:
- Multimodal PDF Parsing: Used LlamaParse for high-fidelity parsing to preserve diagrams, tables, mathematical notations, and structured content
- Question-Boundary Aware Chunking: Instead of arbitrary character splits, the system detects question boundaries (e.g., "1a", "2b") to ensure context remains intact and respects document hierarchy (Chapters > Sections > Paragraphs)
- Two-Stage Reranking: Implemented a retrieval pool of 50 candidates, followed by a Reranker Service using Cross-Encoder Model (ms-marco-MiniLM-L-6-v2) to narrow down the top 10 most relevant chunks for the LLM
- Automated Syllabus Mapping: Built a custom curriculum engine that maps 10+ engineering programs (BCT, BCE, BEL, etc.) across 8 semesters, ensuring data is tagged with precise metadata (Program, Year, Semester, Subject Code)
- Markdown Cleanup Pipeline: Custom pipeline that strips non-semantic headers/footers and identifies Page Context to inject metadata into every chunk, allowing the AI to cite specific pages in responses
Challenge 3: Cost and Latency in LLM Workflows
Problem: Frequent RAG lookups and LLM calls can lead to high latency (>10s) and ballooning API costs.
Solution: Built a Multi-Layer Distributed Caching System using Async Redis (Upstash):
- Result Caching: Immediate response for identical queries, delivering instant answers for frequent student queries
- Intent/Metadata Caching: Caching the AI-extracted filters themselves, preventing re-classifying similar questions and reducing LLM "thought" time
- Embedding Cache: Saves 100% of LLM costs for repeated query vectors
- Streaming Architecture: Implemented Server-Sent Events (SSE) for real-time response streaming
Outcome: Achieved a 60%+ cache hit rate, reducing total query costs by approximately $0.001 per query and cutting P95 response times to under 3 seconds.
Engineering Challenges & Solutions
- Memory Constraints on Free-Tier Hosting: Loading the Cross-Encoder model into memory required ~400MB of RAM, causing the 512MB Render server to experience Out-Of-Memory crashes. Solution: Implemented Feature Toggling and Lazy Loading with a configuration-driven switch that detects the hosting environment and toggles the heavy ML model off in favor of optimized similarity search when resources are low, maintaining 99.9% uptime
- Python 3.11 Environment Stability: Modern f-string features caused syntax errors in certain server-side environments. Solution: Performed a codebase-wide refactor of the formatting logic, replacing complex f-strings with robust string concatenation and pathlib for file routing, ensuring 100% environmental portability
Key Metrics & Impact
- Data Scale: Ingested and indexed 25+ comprehensive engineering past papers (Mathematics, C Programming) with dynamic metadata tagging
- Performance: Optimized response latency to under 3 seconds (P95) through streaming architecture and Redis caching, with less than 200ms TTFT (Time To First Token)
- Efficiency: Reduced OpenAI/Gemini API dependency by establishing a 60%+ cache hit rate for common student queries, reducing LLM API overhead by 70%
- Scalability: Designed a system capable of handling 100,000+ vectors using Pinecone Serverless architecture
- Precision: Achieved 85%+ query accuracy through dual-stage "Retrieve & Rerank" pipeline, with reranking layer improving Top-1 retrieval accuracy by 40% compared to standard vector search
- Architecture: Managed 7+ micro-services/mixins in the backend (RAG, Reranker, Intent Router, Multi-Doc Chunker) to ensure modularity
Tech Stack
Backend: Python, FastAPI, SQLAlchemy, Pydantic
Frontend: React, Vite, Tailwind CSS, Vercel AI SDK (for streaming)
AI/ML: OpenAI (GPT-4o/mini), Gemini (Embeddings), Pinecone (Vector DB), LlamaParse, Sentence-Transformers
Infrastructure: Redis (Async/Upstash Caching), Vercel (Frontend Deployment), Railway/Local (Backend), GitHub Actions (CI/CD)
Integrations: Multi-LLM provider support (Groq, Gemini, OpenAI)
Skills Demonstrated
This project demonstrates expertise in RAG systems, vector databases, distributed caching strategies, production ML deployment, and building scalable AI architectures that solve real-world problems in academic settings.