Inside the Permitlify AI Classifier: From Messy City Data to 22 Clean Categories
Building permits are public record. They are also one of the messiest public datasets in the United States. There are over 20,000 independent permit-issuing jurisdictions, each with its own field names, its own coding system, and its own conventions for describing the same physical work. A roof replacement in Fort Worth might be filed as "ROOF REPL". The same job in San Diego comes through as "Re-roof - composition shingle, full tear-off". In a small Iowa county it shows up as "RES-RR-22".
If you are a roofing contractor, all three of those rows describe a job you would happily quote on. But no naive keyword filter would catch all three. That is the problem the Permitlify AI classifier exists to solve.
Why we picked 22 categories
We benchmarked our taxonomy against the work types that contractors actually price differently. Twenty-two turned out to be the sweet spot — granular enough that a roofer never gets HVAC permits in their feed, broad enough that you do not need to memorize a hundred sub-types to use the product.
The full list: roofing, HVAC, plumbing, electrical, solar, ADU, kitchen remodel, bathroom remodel, addition, new construction (single-family), new construction (multi-family), commercial new, demolition, pool, fence, deck/patio, foundation, garage, siding, windows/doors, generator, and "other". A new permit gets exactly one primary category and may get up to two secondary tags.
Three layers, in order
The classifier is a cascade — three layers, each one cheaper than the next. Most permits are resolved by the first or second layer; the third is the safety net.
Layer 1 — Direct pattern match. Roughly 55% of incoming permits match one of about 1,400 hand-curated regex patterns we have built up over the years. "ROOF REPL", "RE-ROOF", "REROOFCOMP", "RR-2" all map to roofing in a microsecond with no LLM call. This is the fastest and cheapest path and we lean on it heavily.
Layer 2 — Semantic similarity search. About 35% of permits do not match any pattern but are close in meaning to permits we have classified before. We embed the permit description with a small text-embedding model and search a Faiss index of 600,000 previously-classified permits. If the nearest neighbor is within a confidence threshold, the new permit inherits that label. This costs roughly 1/40th of a full LLM call and runs in under 5 milliseconds per permit.
Layer 3 — LLM with structured output. The remaining ~10% — usually weird, ambiguous, or first-of-their-kind descriptions — go to an LLM with a very tight prompt: "Here is the permit description. Pick exactly one of these 22 categories. If you cannot, return OTHER." We constrain the output with a JSON schema so the model literally cannot return anything else. Whatever it returns gets cached, so future permits with the same description skip layer 3 entirely.
How we measure quality
Every week we have a contractor-led review queue: 200 randomly-sampled permits get hand-checked by a domain expert (a real roofer, a real plumber, etc.). Last quarter our agreement rate against the human reviewers was 96.4% on the primary category. The misses are concentrated in two corners — multi-trade permits ("re-roof and add solar panels") and ambiguous remodel permits ("interior modification") — and we are working on both.
Why this matters for your dashboard
Every classification choice you see in Permitlify is the output of this pipeline. When you toggle "show me roofing only", you are trusting that "RES-RR-22" got bucketed correctly. When the system tells you a $24,000 permit is "kitchen remodel" not "addition", that affects how you price the lead. Quality of the underlying classifier is quality of the product.
This is the boring infrastructure that you do not see — and it is the difference between a permit dataset you can build a sales process on and a spreadsheet that wastes your team's morning.
Stop reading about leads — go get them
Start your free trial and have real permit leads in your dashboard tomorrow morning at 6 AM.
Start Free Trial