Autodesk Agentic RAG Chatbot

Agentic RAG Hybrid Retrieval Chroma BM25 Weighted RRF Cross-Encoder Reranking LangSmith Streamlit Hugging Face Spaces Docker Supabase Monitoring

This project is an agentic retrieval-augmented generation application for answering Autodesk product and workflow questions with grounded local evidence, optional web verification, source citations, and conservative no-answer behavior when evidence is insufficient.

The system moves beyond a basic chatbot by combining corpus cleaning, dense and lexical retrieval, reciprocal rank fusion, source caps, same-document context expansion, cross-encoder reranking, adequacy checks, prompt-injection hardening, LangSmith evaluation, monitoring hooks, and a public Docker deployment on Hugging Face Spaces.

Background & Problem Statement

Autodesk product information spans product pages, support content, feature descriptions, system requirements, licensing pages, industry workflows, and comparison pages. A user may ask a natural-language question about Fusion, AutoCAD, Revit, Inventor, Maya, or related workflows, but the answer often requires retrieving the right product-specific evidence and avoiding unsupported generalizations.

Problem Statement: Can a hosted RAG assistant answer Autodesk product and workflow questions with concise, cited responses while combining local corpus evidence, official Autodesk.com verification, open-web corroboration when selected, and strict safeguards against unsupported or prompt-injected answers?

Streamlit App Interface

The Streamlit interface is organized around four review-friendly areas: Ask, Settings & Eval, Monitoring, and About the App. Users can submit Autodesk questions, switch search modes, inspect cached evaluation metrics, and review runtime monitoring summaries when Supabase is configured.

Live Demo Walkthrough

The demo video shows the hosted application flow, including asking Autodesk product questions, selecting evidence policies, and reviewing how the assistant presents sourced answers.

Corpus Cleaning & Indexing

The data pipeline converts raw Autodesk HTML pages into RAG-ready Markdown and retains metadata needed for source-grounded retrieval. The cleaning stage removes boilerplate, drops very small or non-English pages after enrichment, and writes per-file diagnostics for review.

  • Raw HTML files found: 1,218 Autodesk pages.
  • Retained cleaned documents: 974 Markdown files after purge rules.
  • Raw-to-cleaned size reduction: 303.27 MB of HTML to 4.37 MB of Markdown.
  • Character reduction: 317,446,072 raw characters to 3,574,578 cleaned characters, approximately 98.87% smaller.
  • Indexed chunks: 19,611 chunks stored in both Chroma and BM25 artifacts.
  • Embedding model: OpenAI text-embedding-3-small with 1,536-dimensional vectors.

Agentic RAG Architecture

The runtime agent treats the user question, retrieved local excerpts, and web snippets as untrusted inputs. It applies deterministic safety checks, routes the query, retrieves local and optional web evidence, reranks the combined candidates, checks answerability, and only then generates a concise answer with citations.

  • Security screen: blocks obvious jailbreak, prompt-reveal, secret-exfiltration, and unsafe cyber requests before the router LLM.
  • Query routing: identifies Autodesk relevance, current/latest/pricing needs, and compare/contrast questions.
  • Hybrid retrieval: searches Chroma vectors and BM25 keyword indexes in parallel.
  • Rank fusion: merges dense and lexical results with weighted reciprocal rank fusion and per-source caps.
  • Compare-aware retrieval: extracts product entities, retrieves focused subqueries, deduplicates, and balances evidence across compared products.
  • Context expansion: adds neighboring chunks from the same source document to reduce chunk-boundary failures.
  • Reranking: uses cross-encoder/ms-marco-MiniLM-L6-v2 to rerank local and web evidence together.
  • Adequacy gate: returns a conservative no-answer response when evidence does not explicitly support an answer.

Search Modes & Retrieval Engineering

The app exposes three search policies so reviewers can compare a controlled local-corpus baseline against modes that add official Autodesk.com evidence or broader open-web corroboration.

Runtime Search Modes

Option Search Mode Behavior Best Use
1 Local Document Search Uses only local Autodesk corpus chunks from Chroma and BM25. Fast controlled baseline.
2 Local Document Search + Autodesk.com Combines local retrieval with official Autodesk.com web evidence. Preferred demo mode for current official Autodesk facts.
3 Local Document Search + Open Web Search Combines local retrieval with capped open-web evidence. Broader corroboration when official-only search may miss context.

Evaluation & Monitoring

The project uses a fixed 50-question golden dataset and LangSmith experiments to score faithfulness, answer relevance, context precision, context recall, and latency for each search mode. The Streamlit evaluation tab can create or reuse the LangSmith dataset, run selected modes, and cache the resulting metrics for display.

Evaluation Results

All three search modes were evaluated against the same 50-question test set. Option 2 is the strongest overall portfolio demo mode because it improves answer relevance and context recall with official Autodesk.com evidence while keeping source authority more controlled than open-web search.

LangSmith Evaluation Summary

Option Search Mode Faithfulness Answer Relevance Context Precision Context Recall Avg Latency P50 P99
1 Local Document Search 0.91 0.66 0.72 0.58 6.31s 6.55s 15.11s
2 Local + Autodesk.com 0.87 0.85 0.81 0.74 12.99s 12.43s 22.87s
3 Local + Open Web Search 0.84 0.78 0.86 0.62 14.11s 12.34s 73.17s

Key Insights

  • Official web evidence improved answer quality: Autodesk.com search raised answer relevance and context recall over local-only retrieval.
  • Local-only retrieval remained the fastest and most faithful: Option 1 had the lowest latency and highest faithfulness score.
  • Open-web search added context but increased uncertainty: Option 3 had the best context precision, but weaker authority control and a much higher P99 latency.
  • Corpus cleaning was essential: aggressive HTML cleanup reduced raw text volume by approximately 98.87% while preserving product-specific Markdown evidence.
  • RAG safety needs multiple layers: the app combines deterministic security screening, sanitized retrieval queries, untrusted-context handling, and an adequacy gate.

Applied AI Engineering Value

This project demonstrates how to build a public, reviewable RAG application for a realistic product-support domain where answers need to be concise, grounded, current when needed, and inspectable by technical reviewers.

  • Production-minded retrieval: combines local vectors, lexical search, rank fusion, source caps, context expansion, and reranking.
  • Operational deployment: packages the app with Docker for Hugging Face Spaces and keeps raw local artifacts out of the hosted build context.
  • Evaluation discipline: uses a fixed golden dataset and LangSmith metrics instead of relying on ad hoc demo questions.
  • Monitoring path: supports Supabase-backed logging of latency, source counts, web fallback behavior, answerability outcome, and model metadata.

Project Presentation

The presentation below summarizes the project motivation, architecture, retrieval strategy, evaluation, and deployment path.

GitHub Repository & Live Demo

The full implementation and hosted Hugging Face Space are available through the links below.

View Project Repository on GitHub

Open Live Demo on Hugging Face Spaces

The repository includes the Streamlit app, core agent workflow, retrieval modules, corpus cleaning scripts, index-building notebooks, LangSmith evaluation workflow, monitoring hooks, Docker deployment files, and documentation for reproducing the pipeline.

Disclaimer: this is a portfolio demonstration and educational project. Autodesk product details, pricing, licensing, and availability may change, so users should verify official requirements directly with Autodesk.