How Scope Analyzes a Codebase in 2 Minutes

March 4, 202610 min read

When you connect a GitHub repo — or sync a local codebase via scope_sync through MCP — Scope takes about 2 minutes to build a complete structural model of your codebase — entities, relationships, endpoints, conventions, and domain architecture. Here's how the 5-layer pipeline works under the hood.

The pipeline at a glance

Scope's codebase analyzer runs five layers, each building on the previous:

Layer 1: Tree-sitter AST extraction (free, deterministic)
Layer 2: Code graph + PageRank (free, deterministic)
Layer 3: Schema-driven extraction (free, deterministic)
Layer 4: LLM semantic interpretation (AI-powered)
Layer 5: Domain intelligence (AI-powered, optional)

The first three layers are completely deterministic — no AI costs, no variability. The last two use Claude Sonnet for semantic understanding. This hybrid approach keeps costs low while producing rich, accurate results.

Layer 1: Tree-sitter AST extraction

Tree-sitter parses source files into abstract syntax trees without executing any code. Scope extracts:

Classes and methods — with parameter types and return types
Functions — standalone and exported
Route definitions — framework-specific patterns (Express, Rails, FastAPI, etc.)
Import graphs — which files depend on which

This works across languages: TypeScript, Python, Ruby, Go, Rust, Java, and more. The output is a structured map of every symbol in the codebase.

Layer 2: Code graph + PageRank

The import graph from Layer 1 becomes a dependency graph. Scope runs PageRank on this graph to identify the most important symbols — the "hubs" of your codebase.

A file that's imported by 20 other files ranks higher than a utility imported by 2. This ranking helps Layer 4 focus LLM analysis on the code that matters most. The top 50 ranked symbols get priority in the LLM prompt.

Layer 3: Schema-driven extraction

This is the ground truth layer. Scope has dedicated parsers for:

Rails schema.rb — tables, columns, types, indices, foreign keys
Prisma schemas — models, fields, relations, enums
GraphQL schemas — types, queries, mutations, subscriptions
SQL migrations — DDL statements for any framework

Schema-extracted entities are treated as ground truth. When the LLM in Layer 4 produces its interpretation, a post-merge step guarantees that every schema-derived entity and field is present in the final output. The LLM can add behavioral context, but it cannot override schema facts.

Layer 4: LLM semantic interpretation

Layers 1–3 tell us what exists. Layer 4 tells us what it means. Claude Sonnet receives:

The tree-sitter extraction (classes, functions, routes)
The top-50 PageRank symbols
The schema entities and associations
File contents for the most important files

The LLM produces:

Business logic descriptions — what each entity does in domain terms
User flows — how users interact with the system
Tech stack summary — framework choices with reasoning
Conventions — naming patterns, file organization, API styles

After LLM output, merge_schema_entities() runs to ensure all schema-derived entities are preserved. The LLM enriches — it never overrides ground truth.

Layer 5: Domain intelligence

When a codebase has 3 or more entities, Layer 5 runs an additional LLM pass to produce:

Domain groupings — which entities belong to which bounded context
Architectural patterns — repository pattern, service layer, event sourcing, etc.
Key files — the files that matter most for understanding the codebase

This layer is optional and only runs when there's enough complexity to warrant it.

File fetching: what gets analyzed

Scope doesn't download your entire repo. It uses a priority-based file selection system:

Priority 0: Manifests — package.json, Cargo.toml, schema.prisma
Priority 1: Schema files — schema.rb, migrations
Priority 2: Model files
Priority 3: API and business logic — routes, controllers, services
Priority 4: Frontend — pages, components
Priority 5–7: Config, docs, everything else

Maximum 500 files, 100KB per file, 1.5MB total. Files are fetched in parallel (20 concurrent) and processed in memory — no code is persisted to Scope's database.

The output

After ~2 minutes, you get a structured model of your codebase:

Every entity with fields, types, and relationships
Every endpoint with methods, paths, and connected entities
User flows with step-by-step breakdowns
Tech stack with framework details
Naming conventions, file organization, and patterns

This model is stored as vectors in Qdrant and served via MCP to any AI coding tool. When Claude Code calls get_context(scope: "entities"), it gets this pre-analyzed output — not raw file contents.

Try it

Connect a GitHub repo or sync your local codebase via MCP at within-scope.com and see what Scope finds. The analysis runs once, and the structured context is available to every AI tool in your workflow.