Artificial Intelligence & Machine Learning Keywords
Browse over 300 keywords that organize our 40,000+ AI research paper summaries. This hub gives you quick access to models, methods, tasks, metrics, core concepts, data topics, and optimization techniques across modern machine learning. Use the table of contents to jump to the explanations for each category, or scroll to the complete keyword index. Each keyword links to its own archive page, which aggregates related paper summaries. We keep terminology consistent with current literature so researchers, practitioners, and learners can navigate quickly. Start with the category overviews to understand scope, then dive into the full list below.
Models & Architectures
This category covers the major neural network families and model blueprints that power modern AI systems. It includes transformer-based language models, convolutional and recurrent networks for perception, and graph neural networks for structured data. Generative architectures such as diffusion models, variational autoencoders, and GANs also live here, reflecting their central role in synthesis and representation learning. We highlight canonical variants (e.g., BERT, GPT, ResNet, U-Net, Vision Transformers) to anchor terminology to widely used designs. Understanding these architectures clarifies capability, compute requirements, and common failure modes. When you recognize the model class, you can predict training dynamics, data needs, and suitable evaluation strategies.
Methods & Training Techniques
Methods and training techniques describe how models learn from data and how we adapt them efficiently. This includes attention mechanisms, optimization routines, curriculum and continual learning, and regularization tools like dropout and batch normalization. Modern adaptation approaches—fine-tuning, instruction tuning, LoRA, quantization, pruning, and distillation—appear here because they change compute and data economics. Transfer learning, domain adaptation, and generalization strategies determine how knowledge moves across tasks and distributions. We also include supervision regimes (supervised, unsupervised, self-/semi-supervised, few/one/zero-shot) that dictate labeling needs. Mastering these techniques lets you scale models responsibly and make them practical under real-world constraints.
Tasks & Applications
Tasks and applications map model capabilities to real problems across NLP, vision, speech, and multimodal settings. Classic tasks include classification, regression, clustering, detection, segmentation, and tracking. Application-specific goals like question answering, summarization, translation, image captioning, and speech recognition reflect end-user value. We also include advanced perception tasks such as optical flow, pose estimation, face recognition, and scene understanding. Organizing research by task clarifies datasets, metrics, baselines, and failure patterns. Picking the right task framing often matters as much as picking the right model.
Metrics & Evaluation
Metrics translate model behavior into quantitative evidence and enable rigorous comparisons. Classification metrics like precision, recall, F1, ROC, and AUC capture trade-offs under different thresholds. For generation and sequence tasks, measures such as BLEU, ROUGE, perplexity, and log-likelihood assess fluency, fidelity, and calibration. Ranking and detection rely on mean average precision and related area-based summaries. Understanding metric sensitivity, dataset bias, and statistical uncertainty prevents overclaiming and supports reproducible science. Robust evaluation is how we separate genuine progress from overfitting and hype.
Precision | Recall | F1 score |
Roc curve | AUC | Bleu |
Rouge | Perplexity | Log likelihood |
Cross entropy | CER | MAE |
MSE | Mean average precision | Confusion matrix |
Likelihood |
Core Concepts
Core concepts are the foundational ideas that appear across models, methods, and tasks. They include probabilistic and statistical viewpoints, representation learning, and the geometry of latent/vector spaces. We cover tokens and tokenization, similarity measures, and common mathematical operators found in deep networks. Generalization, scaling laws, under/overfitting, and regularization principles explain why models succeed—or fail—beyond the training set. Energy-based and discriminative/generative formulations provide complementary perspectives on learning. Grasping these concepts accelerates reading new papers and integrating results across subfields.
Data & Features
Data and features determine the ceiling on model performance before any algorithmic tweaks. This category includes data augmentation, labeling quality, and dataset curation strategies that improve robustness and coverage. Feature engineering and extraction—classical and deep—shape what information is available to learners. We also include ensembles, bootstrapping, and bagging/boosting as data-centric stability techniques. Knowledge bases and graphs connect symbols with structure, enabling retrieval and reasoning. When data pipelines are healthy, models train faster, evaluate fairly, and transfer more reliably.
Optimization & Regularization
Optimization converts objectives into learned parameters using gradient-based and related methods. Stochastic gradient descent and its variants remain the workhorses, but practical training requires careful schedules and stability tricks. Loss functions, kernels, activations, and temperature scaling shape inductive biases and calibration. Regularization—explicit or implicit—controls complexity to improve generalization and safety under distribution shift. We also highlight parameter-efficient training that reduces compute without sacrificing performance. A solid optimization toolbox turns promising architectures into dependable systems.
Other Concepts
This catch-all gathers important adjacent methods from statistics, signal processing, and classical machine learning. Bayesian inference and graphical models offer principled uncertainty handling and structure. Traditional learners—trees, random forests, XGBoost, logistic/linear regression—remain strong baselines and production workhorses. Dimensionality reduction techniques like PCA, t-SNE, and UMAP aid visualization and preprocessing. We also include linguistic tools (syntax, semantics, stemming, lemmatization) and pattern matching for text pipelines. These ideas integrate with deep learning to deliver robust, interpretable, and efficient solutions.
All Keywords (A–Z)
Each row lists a keyword (linked to its archive), the number of paper summaries matched, and a short, beginner-friendly definition.
Keyword | # Papers | Definition |
---|---|---|
1 shot | 11 | Training or evaluating with only one labeled example per class, stressing extreme data efficiency. |
Active learning | 218 | The model selects the most informative unlabeled samples for annotation to cut labeling effort. |
Activity recognition | 87 | Detects and labels human or object activities from video or sensor time-series. |
Alignment | 1406 | Techniques to make AI behavior match human goals, safety norms, and values. |
Anchor box | 0 | Predefined rectangles that object detectors use to suggest likely box sizes and positions. |
Anomaly detection | 447 | Finds rare or unusual patterns, such as fraud, defects, or system failures. |
Artificial intelligence | 39763 | Broad field focused on systems that perform tasks requiring human-like intelligence. |
Attention | 2123 | Lets models focus on the most relevant parts of the input when making predictions. |
AUC | 177 | Area Under the ROC Curve; threshold-free measure of how well a classifier separates classes. |
Autoencoder | 288 | Neural net that compresses data into a latent code and reconstructs it; useful for denoising and embeddings. |
Autoregressive | 350 | Models that predict the next token/value using previous outputs, generating sequences step by step. |
Backpropagation | 154 | Core algorithm that computes gradients so neural networks can learn via gradient descent. |
Bag of words | 14 | Simple text representation that counts word occurrences while ignoring order. |
Bagging | 20 | Ensemble method that averages models trained on bootstrap samples to reduce variance. |
Batch normalization | 51 | Normalizes activations per mini-batch to stabilize and speed up training. |
Bayesian inference | 126 | Updates beliefs about parameters using observed data and Bayes’ rule. |
Bayesian network | 27 | Probabilistic graphical model where directed edges encode conditional dependencies. |
BERT | 384 | Bidirectional transformer pre-trained on masked tokens; strong for text classification and QA. |
BLEU | 100 | Machine translation metric comparing n-gram overlap between system output and references. |
Boosting | 191 | Builds a strong learner by training weak learners sequentially, focusing on mistakes. |
Bootstrapping | 39 | Resampling with replacement to estimate uncertainty or create data for ensembles. |
Bounding box | 36 | Rectangle that localizes an object in an image for detection and tracking. |
Causal language model | 6 | Left-to-right LLM trained to predict the next token, enabling fluent text generation. |
CER | 15 | Character Error Rate; character-level edit distance normalized by reference length. |
Classification | 3095 | Predicts a discrete label (e.g., spam/not-spam) from features, text, or images. |
Claude | 135 | Anthropic’s family of LLMs designed for helpful, honest, and harmless dialogue. |
Clustering | 759 | Groups similar data points without labels to discover structure. |
CNN | 485 | Convolutional Neural Network specialized for grid-like data (images) using shared filters. |
Confusion matrix | 12 | Table of predicted vs. true labels that reveals false positives/negatives. |
Context length | 57 | The maximum number of tokens an LLM can consider at once. |
Context window | 44 | The sliding token window a model attends over; longer windows retain more prior content. |
Continual learning | 345 | Learning new tasks over time while minimizing catastrophic forgetting. |
Contrastive loss | 99 | Pulls related representations together and pushes unrelated ones apart. |
Convolutional network | 90 | Another term for CNN; extracts spatial features via learned convolution filters. |
Coreference | 17 | Detects when different mentions refer to the same entity (e.g., “the CEO… she”). |
Cosine similarity | 67 | Angle-based similarity between vectors; common for comparing embeddings. |
Cross attention | 205 | Allows one sequence (decoder) to attend to another (encoder) to guide generation. |
Cross entropy | 114 | Standard classification loss comparing predicted probabilities to true labels. |
Curriculum learning | 101 | Trains from easy to hard examples to stabilize and speed up learning. |
Data augmentation | 454 | Creates varied training examples (e.g., flips, noise) to improve robustness. |
Data labeling | 18 | Assigning correct tags to data; crucial for supervised learning quality. |
Decision tree | 87 | Interpretable model that splits features into regions to make predictions. |
Decoder | 394 | Network block that generates outputs, often attending to encoder states. |
Deep learning | 2874 | Uses multi-layer neural networks to learn complex representations from data. |
Density estimation | 70 | Modeling the probability distribution of data (explicitly or implicitly). |
Dependency parsing | 9 | Analyzes grammatical structure by linking words via typed dependencies. |
Depth estimation | 79 | Predicts scene depth from images or video for 3D understanding. |
Diffusion | 1617 | Generative process that learns to denoise data step-by-step to sample new content. |
Diffusion model | 546 | Model trained to reverse a noising process; state-of-the-art in image generation. |
Dimensionality reduction | 120 | Compresses features while preserving structure (e.g., PCA, UMAP). |
Discourse | 53 | Studies language beyond sentences, such as coherence and topic flow. |
Discriminative model | 4 | Models p(y|x) (decision boundaries) rather than how data is generated. |
Distillation | 471 | Trains a smaller “student” model to mimic a larger “teacher” model. |
Doc2Vec | 2 | Learns fixed-length vector representations of documents for similarity and retrieval. |
Domain adaptation | 282 | Makes models trained in one domain work well in a different domain. |
Domain generalization | 117 | Trains models that perform on unseen domains without access to them. |
Dot product | 24 | Basic vector operation used in similarity and attention scoring. |
Dropout | 125 | Randomly drops units during training to reduce overfitting. |
Early stopping | 46 | Stops training when validation performance plateaus to avoid overfitting. |
Embedding | 896 | Dense vector representation capturing meaning of words, items, or images. |
Embedding space | 147 | The geometric space where embeddings live; distances encode similarity. |
Encoder | 724 | Reads inputs and produces hidden representations for downstream tasks. |
Encoder decoder | 164 | Two-part sequence-to-sequence architecture for tasks like translation. |
Energy based model | 12 | Assigns low “energy” to likely configurations, enabling flexible objectives. |
Ensemble model | 33 | Combines multiple models’ predictions to boost accuracy and robustness. |
Entity linking | 26 | Maps text mentions to entries in a knowledge base (e.g., Wikipedia). |
Euclidean distance | 20 | Straight-line distance in a vector space; a classic similarity measure. |
Event detection | 21 | Identifies and timestamps meaningful occurrences in streams or text. |
Extreme gradient boosting | 24 | Boosting approach popularized by XGBoost for high-performance tabular prediction. |
F1 score | 436 | Harmonic mean of precision and recall; balances false positives and negatives. |
Face recognition | 60 | Identifies or verifies people from images or video frames. |
Fast rcnn | 0 | RCNN variant that reuses shared feature maps to speed up detection. |
Faster rcnn | 13 | Adds a Region Proposal Network to accelerate and improve detection. |
FastText | 9 | Efficient word vectors and text classifiers that use subword information. |
Feature engineering | 70 | Crafting useful input features from raw data to aid learning. |
Feature extraction | 288 | Automatically deriving informative signals, often via CNNs or transformers. |
Feature map | 35 | The activation grid produced by a convolutional layer. |
Feature pyramid | 8 | Multi-scale feature hierarchy used in detection and segmentation. |
Feature selection | 181 | Choosing the most predictive features to improve accuracy and speed. |
Federated learning | 1067 | Trains models across devices/servers without centralizing raw data. |
Feedforward network | 3 | Basic network where information flows from inputs to outputs without loops. |
Few shot | 704 | Learning or prompting with only a handful of labeled examples. |
Fine tuning | 2474 | Adapting a pre-trained model to a specific task or dataset. |
GAN | 177 | Generative Adversarial Network where a generator and discriminator compete to create realistic data. |
GCN | 109 | Graph Convolutional Network that generalizes convolution to graphs. |
Gemini | 194 | Google’s multimodal LLM family that handles text, images, and more. |
Generalization | 2125 | How well a model performs on new, unseen data beyond training. |
Generative adversarial network | 80 | Two-network setup (generator vs. discriminator) to synthesize realistic samples. |
Generative model | 294 | Models the data distribution to synthesize, impute, or score samples. |
Gesture recognition | 16 | Detects hand/body gestures from video or sensors for interaction. |
GloVe | 7 | Pre-trained word embeddings learned from global word co-occurrences. |
GNN | 582 | Graph Neural Networks that propagate information along edges to reason over graphs. |
GPT | 1515 | Generative Pre-trained Transformers; powerful LLMs for generation and reasoning. |
Gradient descent | 376 | Iteratively updates parameters in the direction that reduces loss. |
Graph attention network | 38 | Uses attention on graph neighbors to weight information flow. |
Graph neural network | 339 | Neural architectures that operate directly on graph-structured data. |
Grid search | 25 | Systematic hyperparameter search across a predefined parameter grid. |
Grounding | 215 | Connecting language/symbols to real-world data, images, or actions. |
Hallucination | 293 | When a model confidently generates content that is false or unfounded. |
Hidden markov model | 8 | Probabilistic model for sequences with hidden states and observed outputs. |
Hierarchical clustering | 21 | Builds a tree of clusters without pre-choosing the number of clusters. |
Hinge loss | 9 | Margin-based loss used in SVMs to separate classes with a gap. |
Hyperparameter | 352 | A configuration value set before training (e.g., learning rate, depth). |
Image captioning | 100 | Generates descriptive sentences for images. |
Image classification | 557 | Assigns labels to images (e.g., cat vs. dog). |
Image denoising | 16 | Removes noise from images while preserving details. |
Image generation | 465 | Synthesizes new images from text prompts, sketches, or noise. |
Image inpainting | 21 | Fills in missing or masked regions of an image realistically. |
Image segmentation | 165 | Assigns a class label to each pixel to delineate objects or regions. |
Image synthesis | 104 | Another term for generating artificial images with models. |
Inference | 2757 | Running a trained model to make predictions or generate outputs. |
Instance segmentation | 61 | Segments each object instance separately, not just the class. |
Instruction tuning | 255 | Fine-tunes LLMs on instruction–response pairs to follow prompts better. |
Intent detection | 6 | Identifies the user’s goal in a query or utterance. |
K means | 93 | Classic clustering algorithm that partitions data into k groups by proximity. |
Kernel trick | 6 | Maps data into higher-dimensional spaces implicitly for linear separation. |
Knowledge base | 94 | Structured repository of facts and entities used for reasoning or retrieval. |
Knowledge distillation | 344 | Transfers knowledge from a large teacher to a smaller student model. |
Knowledge graph | 330 | Graph of entities and relations enabling structured reasoning. |
Language model | 1109 | Predicts next tokens and models text, forming the basis of LLMs. |
Language understanding | 242 | Interprets meaning and intent in text for tasks like classification and QA. |
Large language model | 1119 | Very large transformer models capable of versatile text and reasoning tasks. |
Latent space | 386 | Compressed feature space where models represent data. |
Lemmatization | 4 | Reduces words to dictionary base forms (e.g., “running”→“run”). |
Likelihood | 393 | Probability of data under model parameters; central in many objectives. |
Linear regression | 146 | Fits a linear relationship between features and a numeric outcome. |
LLaMA | 627 | Efficient open LLM family widely used for research and fine-tuning. |
Log likelihood | 53 | Log of the likelihood; turns products into sums for stable optimization. |
Logistic regression | 143 | Linear classifier that models class probability with a sigmoid. |
Logits | 76 | Raw, unnormalized scores before softmax or sigmoid makes probabilities. |
Lora | 367 | Low-Rank Adaptation: parameter-efficient fine-tuning via small trainable matrices. |
Loss function | 615 | Quantifies errors; training minimizes it to improve performance. |
Low rank adaptation | 172 | Factorizes weight updates to reduce training cost and memory. |
MAE | 150 | Mean Absolute Error; average absolute difference between predictions and truth. |
MSE | 113 | Mean Squared Error; penalizes larger errors more strongly than MAE. |
Machine learning | 6101 | Systems learn patterns from data to make predictions or decisions. |
Manifold learning | 27 | Finds low-dimensional structure embedded in high-dimensional data. |
Markov model | 4 | Assumes the next state depends only on the current state, not full history. |
Mask | 260 | Binary/soft map indicating which positions or pixels to attend or train on. |
Masked language model | 7 | Learns to predict masked tokens, building bidirectional text understanding. |
Mean average precision | 55 | Ranking/detection metric averaging precision across recall levels or classes. |
Meta learning | 226 | “Learning to learn” so models adapt quickly to new tasks with few examples. |
Mixture model | 57 | Represents data as coming from a mixture of simpler distributions. |
Mixture of experts | 245 | Routes inputs to specialized sub-models to scale capacity efficiently. |
Model compression | 77 | Shrinks models via pruning, quantization, or distillation to run faster. |
Multi head attention | 68 | Uses several attention “heads” to capture different relationships in parallel. |
Multi modal | 586 | Models that handle multiple data types (text, images, audio) together. |
Multi task | 375 | Trains one model to solve several tasks, sharing representations. |
N gram | 30 | Sequence of N tokens; basic unit in classic language models. |
N shot | 3 | Few-shot style where each class has N labeled examples. |
Naive bayes | 34 | Simple probabilistic classifier assuming feature independence. |
Named entity recognition | 108 | Finds and labels entities like people, places, and organizations in text. |
Natural language processing | 967 | Field focused on understanding and generating human language. |
Nearest neighbor | 81 | Predicts by finding the most similar examples in a feature space. |
NER | 81 | Shorthand for Named Entity Recognition. |
Neural network | 2183 | Layers of neurons that learn representations from data. |
NLP | 621 | Abbreviation for Natural Language Processing (see above). |
Novelty detection | 9 | Identifies new, previously unseen types of data or behaviors. |
Object detection | 459 | Locates and classifies objects within images by drawing boxes. |
Object tracking | 48 | Follows object identities frame-by-frame through video. |
Objective function | 155 | The quantity a model optimizes during training (the “goal”). |
One hot | 21 | Represents categories as vectors with a single 1 and all other entries 0. |
One shot | 113 | Learning from only a single example per class; extreme data efficiency. |
Online learning | 191 | Updates the model incrementally as new data arrives. |
Optical flow | 53 | Estimates pixel-wise motion between consecutive frames. |
Optimization | 3402 | Adjusting parameters to minimize loss and improve performance. |
Outlier detection | 44 | Finds data points that deviate strongly from the norm. |
Overfitting | 413 | When a model memorizes training data and performs poorly on new data. |
PaLM | 30 | Google’s family of large language models for advanced language tasks. |
Parameter efficient | 285 | Approaches (like LoRA) that fine-tune models with few extra parameters. |
Parsing | 60 | Analyzes text structure (syntactic or semantic) to understand meaning. |
Pattern matching | 9 | Finds predefined patterns in data, like regex in text. |
Pattern recognition | 50 | Identifying regularities in data signals; broad umbrella term. |
PCA | 107 | Principal Component Analysis; rotates data to uncorrelated axes to reduce dimension. |
Perplexity | 174 | Language-modeling metric; lower is better (model is less “surprised”). |
Pose estimation | 86 | Predicts positions of key body joints or object poses. |
Positional encoding | 67 | Adds order information to tokens for transformer models. |
Precision | 800 | Of predicted positives, the fraction that are actually correct. |
Pretraining | 487 | Generic training on large data before fine-tuning on a target task. |
Principal component analysis | 92 | Another spelling of PCA; same dimensionality-reduction method. |
Probabilistic model | 52 | Uses probability distributions to model uncertainty and data generation. |
Probability | 889 | The math of uncertainty; underpins many ML objectives. |
Prompt | 1208 | Input text or instructions given to an LLM to guide behavior. |
Prompting | 733 | Techniques for crafting prompts to get better model outputs. |
Pruning | 483 | Removes less important weights/neurons to shrink and speed models. |
Quantization | 499 | Uses lower-precision numbers (e.g., int8) to make models smaller and faster. |
Question answering | 857 | Systems that answer questions from documents or knowledge sources. |
RAG | 386 | Retrieval-Augmented Generation: fetches documents to ground an LLM’s answer. |
Random forest | 230 | Ensemble of decision trees that averages predictions for stability. |
RCNN | 9 | Region-based CNN family for object detection using region proposals. |
Recall | 386 | Of true positives, the fraction the model successfully finds. |
Recurrent network | 4 | Processes sequences using hidden state that carries over time. |
Region proposal | 3 | Candidate object regions suggested before classification/refinement. |
Regression | 1026 | Predicts a continuous numeric value (e.g., price or temperature). |
Regularization | 741 | Techniques that control model complexity to improve generalization. |
Reinforcement learning | 2527 | Agents learn to act by maximizing rewards through trial and error. |
Reinforcement learning from human feedback | 169 | Uses human preferences to shape or fine-tune model behavior. |
Relu | 171 | Popular activation that passes positives and zeros out negatives. |
Representation learning | 569 | Automatically learns useful feature spaces from raw data. |
Residual network | 13 | Uses skip connections to ease optimization of deep networks. |
Resnet | 186 | A widely used residual network family for image recognition. |
Retrieval augmented generation | 421 | Retrieves documents to ground LLM responses with real context. |
RLHF | 222 | Short for Reinforcement Learning from Human Feedback. |
RNN | 125 | Recurrent Neural Networks process sequences with shared weights over time. |
Roc curve | 12 | Plots true positive rate vs. false positive rate across thresholds. |
Rouge | 100 | Summarization metric comparing overlap with reference summaries. |
SAM | 159 | Segment Anything Model: prompts segmentations of arbitrary objects. |
Scaling laws | 95 | Empirical rules showing how performance scales with model/data/compute. |
Scene understanding | 62 | Interprets complete scenes: objects, layout, and relationships. |
Self attention | 367 | Tokens attend to one another within the same sequence. |
Self supervised | 768 | Learns from unlabeled data by creating pretext tasks (e.g., predicting masked parts). |
Self training | 75 | Uses a model’s confident predictions as pseudo-labels to improve itself. |
Semantic parsing | 18 | Maps text into structured meaning like logical forms or SQL. |
Semantic segmentation | 249 | Assigns a class label to every pixel in an image. |
Semantics | 370 | The study of meaning in language and learned representations. |
Semi supervised | 324 | Learns from a small labeled set plus many unlabeled examples. |
Semi supervision | 2 | Another phrasing of semi-supervised learning. |
Seq2seq | 20 | Sequence-to-sequence models that map input sequences to output sequences. |
Sequence model | 22 | Any model designed for ordered data like text or time series. |
Siamese network | 14 | Twin networks that compare inputs via a shared embedding space. |
Sigmoid | 39 | S-shaped activation mapping real numbers to (0,1) probabilities. |
Signal processing | 67 | Analyzing and transforming time-series and sensor data. |
Softmax | 134 | Converts logits into a probability distribution across classes. |
Spatiotemporal | 203 | Data varying over space and time (e.g., video, geo-temporal data). |
Spectral clustering | 25 | Uses graph eigenvectors to cluster data not linearly separable. |
Statistical model | 23 | Assumes a probabilistic form for data to enable inference and testing. |
Stemming | 15 | Reduces words to crude roots (e.g., “running”→“run”) to normalize vocab. |
Stochastic gradient descent | 222 | Optimizes using small random batches for efficiency and generalization. |
Stopword | 2 | Common words (like “the”, “and”) often removed in classic NLP pipelines. |
Student model | 106 | The smaller model trained via distillation from a teacher model. |
Style transfer | 51 | Applies the visual style of one image to the content of another. |
Summarization | 375 | Condenses long text into shorter, salient summaries. |
Super resolution | 100 | Upscales images to higher resolution with learned detail. |
Supervised | 1081 | Learning from labeled input-output pairs. |
Support vector machine | 65 | Margin-based classifier that finds a separating hyperplane. |
Syntax | 66 | Structural rules of language that govern how words combine. |
Synthetic data | 542 | Artificially generated datasets used to augment or replace real data. |
T5 | 84 | Text-to-Text Transfer Transformer that frames many NLP tasks as text generation. |
Tanh | 18 | Activation squashing values to (-1, 1), centered around zero. |
Teacher model | 104 | The larger or stronger model whose behavior guides a student via distillation. |
Temperature | 243 | Sampling knob that controls randomness in generation (higher = more random). |
Template matching | 3 | Finds regions in an image similar to a given template patch. |
Temporal model | 8 | Any model designed to capture time-dependent patterns. |
Text classification | 202 | Assigns categories or labels to pieces of text. |
Text generation | 259 | Produces new text, often guided by prompts or constraints. |
TF IDF | 18 | Weights terms by frequency in a document vs. across the corpus to highlight salient words. |
Time series | 1222 | Data observed over time (e.g., sensors, finance) requiring sequence models. |
Token | 810 | The atomic unit a model reads/writes (word pieces, bytes, or characters). |
Tokenization | 130 | Splits text into tokens the model can process. |
Tokenizer | 85 | The algorithm/model that performs tokenization (e.g., BPE, WordPiece). |
Tracking | 361 | Following targets (objects, features) across frames or time steps. |
Transfer learning | 407 | Reuses knowledge from one task/domain to improve another. |
Transferability | 196 | How well learned features or skills move across tasks or datasets. |
Transformer | 2234 | Attention-based architecture that powers modern NLP and multimodal AI. |
Translation | 665 | Converts text from one language to another automatically. |
Triplet loss | 15 | Trains embeddings by pulling anchors toward positives and pushing away negatives. |
Tsne | 2 | Nonlinear technique for visualizing high-dimensional data in 2D/3D. |
Umap | 25 | Fast nonlinear dimensionality reduction preserving global/local structure. |
Underfitting | 11 | When a model is too simple to capture patterns, performing poorly on train and test. |
Unet | 60 | Encoder-decoder with skip connections, strong for medical image segmentation. |
Unsupervised | 815 | Learns patterns from unlabeled data (clustering, representation learning). |
Variational autoencoder | 141 | Probabilistic autoencoder that learns a latent distribution for generation. |
Vector space | 30 | Mathematical space where embeddings live and can be compared. |
Vectorization | 19 | Converting items (words, images) into numeric vectors for modeling. |
Vision transformer | 179 | Applies transformer architecture directly to image patches. |
VIT | 257 | Abbreviation for Vision Transformer (see above). |
Word2Vec | 18 | Classic method that learns word embeddings from co-occurrence statistics. |
XGBoost | 134 | High-performance gradient-boosted trees for tabular data. |
Yolo | 61 | “You Only Look Once” family of real-time object detectors. |
Zero shot | 1058 | Performs a task without seeing labeled examples for that specific task. |