A transformer-architecture neural network trained on very large text corpora to predict the next token in a sequence, producing fluent natural-language output across a wide range of tasks.
A Large Language Model is a transformer-based neural network with anywhere from a few billion to a few hundred billion parameters, trained to predict the next token in a sequence. In production enterprise use the practical question is rarely "which LLM" but "which LLM at which size, served where, with what guardrails". For sovereign deployments MindMap typically serves Llama 3.3 70B or Qwen 2.5 72B for high-quality general workloads, and Llama 3.3 8B or Mistral 7B for high-throughput specialised workloads. Closed-source frontier models (GPT-4, Claude, Gemini) cannot be deployed on-prem and are therefore ruled out for regulated buyers who cannot send data to external APIs.
A model trained on a broad corpus that can be adapted (via prompting, fine-tuning, or retrieval) to many downstream tasks rather than being purpose-built for one.
A foundation model is an LLM, vision model, or multimodal model trained on enough breadth that it can be adapted to many downstream tasks without retraining from scratch. The economic argument for foundation models is amortisation: the very expensive pretraining cost is paid once by the model maker, and the cheap adaptation cost is paid per use case by the enterprise. In enterprise practice this means a single sovereign-deployed Llama 3.3 70B can serve a bank's customer-support, internal compliance Q&A, and credit-memo summarisation use cases concurrently, with the per-use-case work confined to retrieval, prompting, and evaluation.
A purpose-built or distilled model in the 1–8 billion parameter range, optimised for a specific domain or task, typically served on commodity GPUs.
A Small Language Model is a 1–8B parameter LLM that has either been pretrained on a narrow domain (clinical notes, financial filings, code) or distilled from a larger model. SLMs trade a small amount of general capability for a large reduction in inference cost and a meaningful improvement in domain-specific accuracy. In sovereign enterprise deployments SLMs are the workhorse for high-volume routing, classification, and document-extraction workloads, with the larger 70B-class model reserved for the long-tail complex queries. A well-tuned 8B model on a single A100 routinely outperforms a 70B model on the customer's narrow benchmark while costing 10× less to serve.
Adapting a pretrained model to a specific domain or task by continuing training on a smaller, curated dataset.
Fine-tuning is the process of taking a pretrained foundation model and continuing training on a smaller dataset specific to a domain (clinical, legal, financial) or a task (classification, structured extraction, style transfer). Modern fine-tuning is almost always parameter-efficient — LoRA or QLoRA adapters that train only a fraction of the model weights — which means a customer can host one base model and dozens of domain adapters on the same GPU. For most enterprise use cases retrieval-augmented generation outperforms fine-tuning at lower cost and higher auditability; fine-tuning earns its keep when the customer needs the model to adopt a specific output format, voice, or domain vocabulary that prompting alone cannot reliably enforce.
Parameter-efficient fine-tuning that trains a small adapter rather than the full model, allowing many specialised variants to share the same base weights.
Low-Rank Adaptation (LoRA) fine-tunes a model by injecting small trainable matrices into specific transformer layers rather than updating the full parameter set. QLoRA is the quantised variant — base weights frozen at 4-bit precision, adapters trained at 16-bit — which lets a 70B model fine-tune on a single A100. The architectural payoff for enterprise is that one base model can serve dozens of LoRA adapters at once (one per business unit, one per language, one per document type), swappable at inference time. This collapses the model-management problem from "thirty separate fine-tunes" to "one base model and thirty adapter files".
Training a smaller "student" model to imitate the input-output behaviour of a larger "teacher" model.
Distillation trains a small model to imitate the behaviour of a larger one by minimising the divergence between their output distributions on a shared corpus. The practical effect is a 5–20× reduction in inference cost at a single-digit-percent quality loss on the distilled task. For enterprises, distillation is the path from "we proved this works with a 70B model" to "we can afford to serve this at production volume on an 8B model". MindMap routinely distils a customer's domain-specific 70B prototype into an 8B production model once the eval suite confirms behavioural parity within the customer's acceptance threshold.
A dense numeric vector representation of a piece of text (or image, or audio) such that semantically similar inputs produce vectors close in the embedding space.
An embedding is a fixed-length vector — typically 384 to 1024 dimensions — produced by an embedding model from a piece of text. The geometric property that makes embeddings useful is that semantically similar inputs produce vectors close in the embedding space, so a similarity search becomes a nearest-neighbour lookup. In retrieval-augmented generation, document chunks are embedded once at ingestion time, query embeddings are computed at query time, and the nearest chunks are passed to the LLM as context. For sovereign deployments MindMap uses nomic-embed-text for English-primary corpora and BGE-M3 for multilingual workloads — both open-weights and locally deployable.
The maximum number of tokens a model can attend to in a single inference call (prompt plus output combined).
The context window is the maximum number of tokens an LLM can process in one inference call, counting both the prompt and the generated output. Modern open-weights models offer 8K to 128K-token windows, with some research models pushing to 1M+. The practical engineering trap is that quality typically degrades as the context fills — the famous "lost in the middle" effect — so a 128K context is not a free pass to dump every relevant document in. Better-engineered RAG with tight retrieval and re-ranking beats long-context prompt-stuffing on most enterprise workloads, both in answer quality and in inference cost.