← Back to workCase study

FalconGraph Search

Finding the right BGSU resource means digging through dozens of disconnected pages and PDFs. FalconGraph turns that sprawl into one graph you can ask questions of.

RoleFull-stack: crawler, RAG backend, and UI
When2025 · BGSU Hackathon
The problem

University resources live scattered across hundreds of pages, PDFs, and DOCX files with no unifying index. A keyword search returns links; it doesn't answer the question.

The goal was an answer engine: crawl those resources, model how they link together, and let the LLM respond with grounded citations instead of hallucinations, so every answer points back to a real source.

How it works

C++ crawler

A multi-threaded crawler using OpenMP reads a central pipeline.json (seed URLs, thread count, output paths) and deposits raw HTML/PDFs, using thread-safe shared queues to avoid redundant fetches across cores.

Clean & graph

A Python pass extracts text from HTML/PDF/DOCX into checkpointed clean_nodes.json and clean_edges.json, then reconstructs a directed link graph of how resources reference one another.

Embed & index

Node content is chunked and embedded (all-MiniLM-L6-v2 / text-embedding-3-small) into a FAISS index, so retrieval can rank by semantic similarity rather than keyword overlap.

RAG answer

A FastAPI /search endpoint retrieves and ranks the top snippets via FAISS, then has the LLM generate a grounded summary with inline citations back to source pages.

Next.js UI

The frontend renders markdown answers, the list of cited sources, and a graph preview of the surrounding link context, so the answer is auditable, not a black box.

Key decisions

C++ for the crawler, Python for everything else

The crawl is the throughput bottleneck, so it got a multi-threaded C++ binary with OpenMP. The cleaning, graph, and RAG layers prioritize iteration speed, so they stayed in Python. Each language doing the job it's best at.

Checkpointed cleaning

Crawls and parsing get interrupted. Writing clean_nodes/clean_edges with checkpoints means a failed run resumes instead of starting over, important under a hackathon clock.

Citations as a hard requirement

Every generated summary is grounded in retrieved snippets and links back to its source. The graph view exists so a reader can trust the answer by tracing it, not just reading it.

Outcome
End-to-end pipeline from raw crawl to cited answer running on a hackathon timeline.
Three languages (C++ / Python / TypeScript) integrated through a single config-driven pipeline.
Answers stay auditable: every response carries its sources and graph context.

Built with

C++ (OpenMP)PythonFastAPIOpenAI APIFAISSall-MiniLM-L6-v2Next.jsTailwind CSSDaisyUI
View the codeExplore more work