Artificial Intelligence & Machine Learning Keywords

Browse over 300 keywords that organize our 40,000+ AI research paper summaries. This hub gives you quick access to models, methods, tasks, metrics, core concepts, data topics, and optimization techniques across modern machine learning. Use the table of contents to jump to the explanations for each category, or scroll to the complete keyword index. Each keyword links to its own archive page, which aggregates related paper summaries. We keep terminology consistent with current literature so researchers, practitioners, and learners can navigate quickly. Start with the category overviews to understand scope, then dive into the full list below.

Models & Architectures

This category covers the major neural network families and model blueprints that power modern AI systems. It includes transformer-based language models, convolutional and recurrent networks for perception, and graph neural networks for structured data. Generative architectures such as diffusion models, variational autoencoders, and GANs also live here, reflecting their central role in synthesis and representation learning. We highlight canonical variants (e.g., BERT, GPT, ResNet, U-Net, Vision Transformers) to anchor terminology to widely used designs. Understanding these architectures clarifies capability, compute requirements, and common failure modes. When you recognize the model class, you can predict training dynamics, data needs, and suitable evaluation strategies.

Jump to full keyword list ↓

Methods & Training Techniques

Methods and training techniques describe how models learn from data and how we adapt them efficiently. This includes attention mechanisms, optimization routines, curriculum and continual learning, and regularization tools like dropout and batch normalization. Modern adaptation approaches—fine-tuning, instruction tuning, LoRA, quantization, pruning, and distillation—appear here because they change compute and data economics. Transfer learning, domain adaptation, and generalization strategies determine how knowledge moves across tasks and distributions. We also include supervision regimes (supervised, unsupervised, self-/semi-supervised, few/one/zero-shot) that dictate labeling needs. Mastering these techniques lets you scale models responsibly and make them practical under real-world constraints.

Jump to full keyword list ↓

Tasks & Applications

Tasks and applications map model capabilities to real problems across NLP, vision, speech, and multimodal settings. Classic tasks include classification, regression, clustering, detection, segmentation, and tracking. Application-specific goals like question answering, summarization, translation, image captioning, and speech recognition reflect end-user value. We also include advanced perception tasks such as optical flow, pose estimation, face recognition, and scene understanding. Organizing research by task clarifies datasets, metrics, baselines, and failure patterns. Picking the right task framing often matters as much as picking the right model.

Jump to full keyword list ↓

Metrics & Evaluation

Metrics translate model behavior into quantitative evidence and enable rigorous comparisons. Classification metrics like precision, recall, F1, ROC, and AUC capture trade-offs under different thresholds. For generation and sequence tasks, measures such as BLEU, ROUGE, perplexity, and log-likelihood assess fluency, fidelity, and calibration. Ranking and detection rely on mean average precision and related area-based summaries. Understanding metric sensitivity, dataset bias, and statistical uncertainty prevents overclaiming and supports reproducible science. Robust evaluation is how we separate genuine progress from overfitting and hype.

Precision	Recall	F1 score
Roc curve	AUC	Bleu
Rouge	Perplexity	Log likelihood
Cross entropy	CER	MAE
MSE	Mean average precision	Confusion matrix
Likelihood

Jump to full keyword list ↓

Core Concepts

Core concepts are the foundational ideas that appear across models, methods, and tasks. They include probabilistic and statistical viewpoints, representation learning, and the geometry of latent/vector spaces. We cover tokens and tokenization, similarity measures, and common mathematical operators found in deep networks. Generalization, scaling laws, under/overfitting, and regularization principles explain why models succeed—or fail—beyond the training set. Energy-based and discriminative/generative formulations provide complementary perspectives on learning. Grasping these concepts accelerates reading new papers and integrating results across subfields.

Jump to full keyword list ↓

Data & Features

Data and features determine the ceiling on model performance before any algorithmic tweaks. This category includes data augmentation, labeling quality, and dataset curation strategies that improve robustness and coverage. Feature engineering and extraction—classical and deep—shape what information is available to learners. We also include ensembles, bootstrapping, and bagging/boosting as data-centric stability techniques. Knowledge bases and graphs connect symbols with structure, enabling retrieval and reasoning. When data pipelines are healthy, models train faster, evaluate fairly, and transfer more reliably.

Jump to full keyword list ↓

Optimization & Regularization

Optimization converts objectives into learned parameters using gradient-based and related methods. Stochastic gradient descent and its variants remain the workhorses, but practical training requires careful schedules and stability tricks. Loss functions, kernels, activations, and temperature scaling shape inductive biases and calibration. Regularization—explicit or implicit—controls complexity to improve generalization and safety under distribution shift. We also highlight parameter-efficient training that reduces compute without sacrificing performance. A solid optimization toolbox turns promising architectures into dependable systems.

Gradient descent	Stochastic gradient descent	Optimization
Loss function	Objective function	Hyperparameter
Grid search	Kernel trick	Hinge loss
Logits	Temperature	Relu
Softmax	Sigmoid	Tanh
Model compression	Parameter efficient

Jump to full keyword list ↓

Other Concepts

This catch-all gathers important adjacent methods from statistics, signal processing, and classical machine learning. Bayesian inference and graphical models offer principled uncertainty handling and structure. Traditional learners—trees, random forests, XGBoost, logistic/linear regression—remain strong baselines and production workhorses. Dimensionality reduction techniques like PCA, t-SNE, and UMAP aid visualization and preprocessing. We also include linguistic tools (syntax, semantics, stemming, lemmatization) and pattern matching for text pipelines. These ideas integrate with deep learning to deliver robust, interpretable, and efficient solutions.

Jump to full keyword list ↓

All Keywords (A–Z)

Each row lists a keyword (linked to its archive), the number of paper summaries matched, and a short, beginner-friendly definition.

Keyword	# Papers	Definition
1 shot	11	Training or evaluating with only one labeled example per class, stressing extreme data efficiency.
Active learning	218	The model selects the most informative unlabeled samples for annotation to cut labeling effort.
Activity recognition	87	Detects and labels human or object activities from video or sensor time-series.
Alignment	1406	Techniques to make AI behavior match human goals, safety norms, and values.
Anchor box	0	Predefined rectangles that object detectors use to suggest likely box sizes and positions.
Anomaly detection	447	Finds rare or unusual patterns, such as fraud, defects, or system failures.
Artificial intelligence	39763	Broad field focused on systems that perform tasks requiring human-like intelligence.
Attention	2123	Lets models focus on the most relevant parts of the input when making predictions.
AUC	177	Area Under the ROC Curve; threshold-free measure of how well a classifier separates classes.
Autoencoder	288	Neural net that compresses data into a latent code and reconstructs it; useful for denoising and embeddings.
Autoregressive	350	Models that predict the next token/value using previous outputs, generating sequences step by step.
Backpropagation	154	Core algorithm that computes gradients so neural networks can learn via gradient descent.
Bag of words	14	Simple text representation that counts word occurrences while ignoring order.
Bagging	20	Ensemble method that averages models trained on bootstrap samples to reduce variance.
Batch normalization	51	Normalizes activations per mini-batch to stabilize and speed up training.
Bayesian inference	126	Updates beliefs about parameters using observed data and Bayes’ rule.
Bayesian network	27	Probabilistic graphical model where directed edges encode conditional dependencies.
BERT	384	Bidirectional transformer pre-trained on masked tokens; strong for text classification and QA.
BLEU	100	Machine translation metric comparing n-gram overlap between system output and references.
Boosting	191	Builds a strong learner by training weak learners sequentially, focusing on mistakes.
Bootstrapping	39	Resampling with replacement to estimate uncertainty or create data for ensembles.
Bounding box	36	Rectangle that localizes an object in an image for detection and tracking.
Causal language model	6	Left-to-right LLM trained to predict the next token, enabling fluent text generation.
CER	15	Character Error Rate; character-level edit distance normalized by reference length.
Classification	3095	Predicts a discrete label (e.g., spam/not-spam) from features, text, or images.
Claude	135	Anthropic’s family of LLMs designed for helpful, honest, and harmless dialogue.
Clustering	759	Groups similar data points without labels to discover structure.
CNN	485	Convolutional Neural Network specialized for grid-like data (images) using shared filters.
Confusion matrix	12	Table of predicted vs. true labels that reveals false positives/negatives.
Context length	57	The maximum number of tokens an LLM can consider at once.
Context window	44	The sliding token window a model attends over; longer windows retain more prior content.
Continual learning	345	Learning new tasks over time while minimizing catastrophic forgetting.
Contrastive loss	99	Pulls related representations together and pushes unrelated ones apart.
Convolutional network	90	Another term for CNN; extracts spatial features via learned convolution filters.
Coreference	17	Detects when different mentions refer to the same entity (e.g., “the CEO… she”).
Cosine similarity	67	Angle-based similarity between vectors; common for comparing embeddings.
Cross attention	205	Allows one sequence (decoder) to attend to another (encoder) to guide generation.
Cross entropy	114	Standard classification loss comparing predicted probabilities to true labels.
Curriculum learning	101	Trains from easy to hard examples to stabilize and speed up learning.
Data augmentation	454	Creates varied training examples (e.g., flips, noise) to improve robustness.
Data labeling	18	Assigning correct tags to data; crucial for supervised learning quality.
Decision tree	87	Interpretable model that splits features into regions to make predictions.
Decoder	394	Network block that generates outputs, often attending to encoder states.
Deep learning	2874	Uses multi-layer neural networks to learn complex representations from data.
Density estimation	70	Modeling the probability distribution of data (explicitly or implicitly).
Dependency parsing	9	Analyzes grammatical structure by linking words via typed dependencies.
Depth estimation	79	Predicts scene depth from images or video for 3D understanding.
Diffusion	1617	Generative process that learns to denoise data step-by-step to sample new content.
Diffusion model	546	Model trained to reverse a noising process; state-of-the-art in image generation.
Dimensionality reduction	120	Compresses features while preserving structure (e.g., PCA, UMAP).
Discourse	53	Studies language beyond sentences, such as coherence and topic flow.
Discriminative model	4	Models p(y\|x) (decision boundaries) rather than how data is generated.
Distillation	471	Trains a smaller “student” model to mimic a larger “teacher” model.
Doc2Vec	2	Learns fixed-length vector representations of documents for similarity and retrieval.
Domain adaptation	282	Makes models trained in one domain work well in a different domain.
Domain generalization	117	Trains models that perform on unseen domains without access to them.
Dot product	24	Basic vector operation used in similarity and attention scoring.
Dropout	125	Randomly drops units during training to reduce overfitting.
Early stopping	46	Stops training when validation performance plateaus to avoid overfitting.
Embedding	896	Dense vector representation capturing meaning of words, items, or images.
Embedding space	147	The geometric space where embeddings live; distances encode similarity.
Encoder	724	Reads inputs and produces hidden representations for downstream tasks.
Encoder decoder	164	Two-part sequence-to-sequence architecture for tasks like translation.
Energy based model	12	Assigns low “energy” to likely configurations, enabling flexible objectives.
Ensemble model	33	Combines multiple models’ predictions to boost accuracy and robustness.
Entity linking	26	Maps text mentions to entries in a knowledge base (e.g., Wikipedia).
Euclidean distance	20	Straight-line distance in a vector space; a classic similarity measure.
Event detection	21	Identifies and timestamps meaningful occurrences in streams or text.
Extreme gradient boosting	24	Boosting approach popularized by XGBoost for high-performance tabular prediction.
F1 score	436	Harmonic mean of precision and recall; balances false positives and negatives.
Face recognition	60	Identifies or verifies people from images or video frames.
Fast rcnn	0	RCNN variant that reuses shared feature maps to speed up detection.
Faster rcnn	13	Adds a Region Proposal Network to accelerate and improve detection.
FastText	9	Efficient word vectors and text classifiers that use subword information.
Feature engineering	70	Crafting useful input features from raw data to aid learning.
Feature extraction	288	Automatically deriving informative signals, often via CNNs or transformers.
Feature map	35	The activation grid produced by a convolutional layer.
Feature pyramid	8	Multi-scale feature hierarchy used in detection and segmentation.
Feature selection	181	Choosing the most predictive features to improve accuracy and speed.
Federated learning	1067	Trains models across devices/servers without centralizing raw data.
Feedforward network	3	Basic network where information flows from inputs to outputs without loops.
Few shot	704	Learning or prompting with only a handful of labeled examples.
Fine tuning	2474	Adapting a pre-trained model to a specific task or dataset.
GAN	177	Generative Adversarial Network where a generator and discriminator compete to create realistic data.
GCN	109	Graph Convolutional Network that generalizes convolution to graphs.
Gemini	194	Google’s multimodal LLM family that handles text, images, and more.
Generalization	2125	How well a model performs on new, unseen data beyond training.
Generative adversarial network	80	Two-network setup (generator vs. discriminator) to synthesize realistic samples.
Generative model	294	Models the data distribution to synthesize, impute, or score samples.
Gesture recognition	16	Detects hand/body gestures from video or sensors for interaction.
GloVe	7	Pre-trained word embeddings learned from global word co-occurrences.
GNN	582	Graph Neural Networks that propagate information along edges to reason over graphs.
GPT	1515	Generative Pre-trained Transformers; powerful LLMs for generation and reasoning.
Gradient descent	376	Iteratively updates parameters in the direction that reduces loss.
Graph attention network	38	Uses attention on graph neighbors to weight information flow.
Graph neural network	339	Neural architectures that operate directly on graph-structured data.
Grid search	25	Systematic hyperparameter search across a predefined parameter grid.
Grounding	215	Connecting language/symbols to real-world data, images, or actions.
Hallucination	293	When a model confidently generates content that is false or unfounded.
Hidden markov model	8	Probabilistic model for sequences with hidden states and observed outputs.
Hierarchical clustering	21	Builds a tree of clusters without pre-choosing the number of clusters.
Hinge loss	9	Margin-based loss used in SVMs to separate classes with a gap.
Hyperparameter	352	A configuration value set before training (e.g., learning rate, depth).
Image captioning	100	Generates descriptive sentences for images.
Image classification	557	Assigns labels to images (e.g., cat vs. dog).
Image denoising	16	Removes noise from images while preserving details.
Image generation	465	Synthesizes new images from text prompts, sketches, or noise.
Image inpainting	21	Fills in missing or masked regions of an image realistically.
Image segmentation	165	Assigns a class label to each pixel to delineate objects or regions.
Image synthesis	104	Another term for generating artificial images with models.
Inference	2757	Running a trained model to make predictions or generate outputs.
Instance segmentation	61	Segments each object instance separately, not just the class.
Instruction tuning	255	Fine-tunes LLMs on instruction–response pairs to follow prompts better.
Intent detection	6	Identifies the user’s goal in a query or utterance.
K means	93	Classic clustering algorithm that partitions data into k groups by proximity.
Kernel trick	6	Maps data into higher-dimensional spaces implicitly for linear separation.
Knowledge base	94	Structured repository of facts and entities used for reasoning or retrieval.
Knowledge distillation	344	Transfers knowledge from a large teacher to a smaller student model.
Knowledge graph	330	Graph of entities and relations enabling structured reasoning.
Language model	1109	Predicts next tokens and models text, forming the basis of LLMs.
Language understanding	242	Interprets meaning and intent in text for tasks like classification and QA.
Large language model	1119	Very large transformer models capable of versatile text and reasoning tasks.
Latent space	386	Compressed feature space where models represent data.
Lemmatization	4	Reduces words to dictionary base forms (e.g., “running”→“run”).
Likelihood	393	Probability of data under model parameters; central in many objectives.
Linear regression	146	Fits a linear relationship between features and a numeric outcome.
LLaMA	627	Efficient open LLM family widely used for research and fine-tuning.
Log likelihood	53	Log of the likelihood; turns products into sums for stable optimization.
Logistic regression	143	Linear classifier that models class probability with a sigmoid.
Logits	76	Raw, unnormalized scores before softmax or sigmoid makes probabilities.
Lora	367	Low-Rank Adaptation: parameter-efficient fine-tuning via small trainable matrices.
Loss function	615	Quantifies errors; training minimizes it to improve performance.
Low rank adaptation	172	Factorizes weight updates to reduce training cost and memory.
MAE	150	Mean Absolute Error; average absolute difference between predictions and truth.
MSE	113	Mean Squared Error; penalizes larger errors more strongly than MAE.
Machine learning	6101	Systems learn patterns from data to make predictions or decisions.
Manifold learning	27	Finds low-dimensional structure embedded in high-dimensional data.
Markov model	4	Assumes the next state depends only on the current state, not full history.
Mask	260	Binary/soft map indicating which positions or pixels to attend or train on.
Masked language model	7	Learns to predict masked tokens, building bidirectional text understanding.
Mean average precision	55	Ranking/detection metric averaging precision across recall levels or classes.
Meta learning	226	“Learning to learn” so models adapt quickly to new tasks with few examples.
Mixture model	57	Represents data as coming from a mixture of simpler distributions.
Mixture of experts	245	Routes inputs to specialized sub-models to scale capacity efficiently.
Model compression	77	Shrinks models via pruning, quantization, or distillation to run faster.
Multi head attention	68	Uses several attention “heads” to capture different relationships in parallel.
Multi modal	586	Models that handle multiple data types (text, images, audio) together.
Multi task	375	Trains one model to solve several tasks, sharing representations.
N gram	30	Sequence of N tokens; basic unit in classic language models.
N shot	3	Few-shot style where each class has N labeled examples.
Naive bayes	34	Simple probabilistic classifier assuming feature independence.
Named entity recognition	108	Finds and labels entities like people, places, and organizations in text.
Natural language processing	967	Field focused on understanding and generating human language.
Nearest neighbor	81	Predicts by finding the most similar examples in a feature space.
NER	81	Shorthand for Named Entity Recognition.
Neural network	2183	Layers of neurons that learn representations from data.
NLP	621	Abbreviation for Natural Language Processing (see above).
Novelty detection	9	Identifies new, previously unseen types of data or behaviors.
Object detection	459	Locates and classifies objects within images by drawing boxes.
Object tracking	48	Follows object identities frame-by-frame through video.
Objective function	155	The quantity a model optimizes during training (the “goal”).
One hot	21	Represents categories as vectors with a single 1 and all other entries 0.
One shot	113	Learning from only a single example per class; extreme data efficiency.
Online learning	191	Updates the model incrementally as new data arrives.
Optical flow	53	Estimates pixel-wise motion between consecutive frames.
Optimization	3402	Adjusting parameters to minimize loss and improve performance.
Outlier detection	44	Finds data points that deviate strongly from the norm.
Overfitting	413	When a model memorizes training data and performs poorly on new data.
PaLM	30	Google’s family of large language models for advanced language tasks.
Parameter efficient	285	Approaches (like LoRA) that fine-tune models with few extra parameters.
Parsing	60	Analyzes text structure (syntactic or semantic) to understand meaning.
Pattern matching	9	Finds predefined patterns in data, like regex in text.
Pattern recognition	50	Identifying regularities in data signals; broad umbrella term.
PCA	107	Principal Component Analysis; rotates data to uncorrelated axes to reduce dimension.
Perplexity	174	Language-modeling metric; lower is better (model is less “surprised”).
Pose estimation	86	Predicts positions of key body joints or object poses.
Positional encoding	67	Adds order information to tokens for transformer models.
Precision	800	Of predicted positives, the fraction that are actually correct.
Pretraining	487	Generic training on large data before fine-tuning on a target task.
Principal component analysis	92	Another spelling of PCA; same dimensionality-reduction method.
Probabilistic model	52	Uses probability distributions to model uncertainty and data generation.
Probability	889	The math of uncertainty; underpins many ML objectives.
Prompt	1208	Input text or instructions given to an LLM to guide behavior.
Prompting	733	Techniques for crafting prompts to get better model outputs.
Pruning	483	Removes less important weights/neurons to shrink and speed models.
Quantization	499	Uses lower-precision numbers (e.g., int8) to make models smaller and faster.
Question answering	857	Systems that answer questions from documents or knowledge sources.
RAG	386	Retrieval-Augmented Generation: fetches documents to ground an LLM’s answer.
Random forest	230	Ensemble of decision trees that averages predictions for stability.
RCNN	9	Region-based CNN family for object detection using region proposals.
Recall	386	Of true positives, the fraction the model successfully finds.
Recurrent network	4	Processes sequences using hidden state that carries over time.
Region proposal	3	Candidate object regions suggested before classification/refinement.
Regression	1026	Predicts a continuous numeric value (e.g., price or temperature).
Regularization	741	Techniques that control model complexity to improve generalization.
Reinforcement learning	2527	Agents learn to act by maximizing rewards through trial and error.
Reinforcement learning from human feedback	169	Uses human preferences to shape or fine-tune model behavior.
Relu	171	Popular activation that passes positives and zeros out negatives.
Representation learning	569	Automatically learns useful feature spaces from raw data.
Residual network	13	Uses skip connections to ease optimization of deep networks.
Resnet	186	A widely used residual network family for image recognition.
Retrieval augmented generation	421	Retrieves documents to ground LLM responses with real context.
RLHF	222	Short for Reinforcement Learning from Human Feedback.
RNN	125	Recurrent Neural Networks process sequences with shared weights over time.
Roc curve	12	Plots true positive rate vs. false positive rate across thresholds.
Rouge	100	Summarization metric comparing overlap with reference summaries.
SAM	159	Segment Anything Model: prompts segmentations of arbitrary objects.
Scaling laws	95	Empirical rules showing how performance scales with model/data/compute.
Scene understanding	62	Interprets complete scenes: objects, layout, and relationships.
Self attention	367	Tokens attend to one another within the same sequence.
Self supervised	768	Learns from unlabeled data by creating pretext tasks (e.g., predicting masked parts).
Self training	75	Uses a model’s confident predictions as pseudo-labels to improve itself.
Semantic parsing	18	Maps text into structured meaning like logical forms or SQL.
Semantic segmentation	249	Assigns a class label to every pixel in an image.
Semantics	370	The study of meaning in language and learned representations.
Semi supervised	324	Learns from a small labeled set plus many unlabeled examples.
Semi supervision	2	Another phrasing of semi-supervised learning.
Seq2seq	20	Sequence-to-sequence models that map input sequences to output sequences.
Sequence model	22	Any model designed for ordered data like text or time series.
Siamese network	14	Twin networks that compare inputs via a shared embedding space.
Sigmoid	39	S-shaped activation mapping real numbers to (0,1) probabilities.
Signal processing	67	Analyzing and transforming time-series and sensor data.
Softmax	134	Converts logits into a probability distribution across classes.
Spatiotemporal	203	Data varying over space and time (e.g., video, geo-temporal data).
Spectral clustering	25	Uses graph eigenvectors to cluster data not linearly separable.
Statistical model	23	Assumes a probabilistic form for data to enable inference and testing.
Stemming	15	Reduces words to crude roots (e.g., “running”→“run”) to normalize vocab.
Stochastic gradient descent	222	Optimizes using small random batches for efficiency and generalization.
Stopword	2	Common words (like “the”, “and”) often removed in classic NLP pipelines.
Student model	106	The smaller model trained via distillation from a teacher model.
Style transfer	51	Applies the visual style of one image to the content of another.
Summarization	375	Condenses long text into shorter, salient summaries.
Super resolution	100	Upscales images to higher resolution with learned detail.
Supervised	1081	Learning from labeled input-output pairs.
Support vector machine	65	Margin-based classifier that finds a separating hyperplane.
Syntax	66	Structural rules of language that govern how words combine.
Synthetic data	542	Artificially generated datasets used to augment or replace real data.
T5	84	Text-to-Text Transfer Transformer that frames many NLP tasks as text generation.
Tanh	18	Activation squashing values to (-1, 1), centered around zero.
Teacher model	104	The larger or stronger model whose behavior guides a student via distillation.
Temperature	243	Sampling knob that controls randomness in generation (higher = more random).
Template matching	3	Finds regions in an image similar to a given template patch.
Temporal model	8	Any model designed to capture time-dependent patterns.
Text classification	202	Assigns categories or labels to pieces of text.
Text generation	259	Produces new text, often guided by prompts or constraints.
TF IDF	18	Weights terms by frequency in a document vs. across the corpus to highlight salient words.
Time series	1222	Data observed over time (e.g., sensors, finance) requiring sequence models.
Token	810	The atomic unit a model reads/writes (word pieces, bytes, or characters).
Tokenization	130	Splits text into tokens the model can process.
Tokenizer	85	The algorithm/model that performs tokenization (e.g., BPE, WordPiece).
Tracking	361	Following targets (objects, features) across frames or time steps.
Transfer learning	407	Reuses knowledge from one task/domain to improve another.
Transferability	196	How well learned features or skills move across tasks or datasets.
Transformer	2234	Attention-based architecture that powers modern NLP and multimodal AI.
Translation	665	Converts text from one language to another automatically.
Triplet loss	15	Trains embeddings by pulling anchors toward positives and pushing away negatives.
Tsne	2	Nonlinear technique for visualizing high-dimensional data in 2D/3D.
Umap	25	Fast nonlinear dimensionality reduction preserving global/local structure.
Underfitting	11	When a model is too simple to capture patterns, performing poorly on train and test.
Unet	60	Encoder-decoder with skip connections, strong for medical image segmentation.
Unsupervised	815	Learns patterns from unlabeled data (clustering, representation learning).
Variational autoencoder	141	Probabilistic autoencoder that learns a latent distribution for generation.
Vector space	30	Mathematical space where embeddings live and can be compared.
Vectorization	19	Converting items (words, images) into numeric vectors for modeling.
Vision transformer	179	Applies transformer architecture directly to image patches.
VIT	257	Abbreviation for Vision Transformer (see above).
Word2Vec	18	Classic method that learns word embeddings from co-occurrence statistics.
XGBoost	134	High-performance gradient-boosted trees for tabular data.
Yolo	61	“You Only Look Once” family of real-time object detectors.
Zero shot	1058	Performs a task without seeing labeled examples for that specific task.