AI / Deep Learning

My Kaggle: https://www.kaggle.com/ckkamaraj

▶️ Paradigm Perspective: Learning Approach → Architecture

Supervised Learning

├── Classification — predict discrete labels from input data

├── Regression — predict continuous values from input data

Unsupervised Learning

├── Clustering — group similar data points without labels (e.g., k-means, DBSCAN)

├── Dimensionality Reduction / Feature Learning

│ ├── Autoencoder (AE) — compress input into latent space and reconstruct

│ └── Variational Autoencoder (VAE) — probabilistic latent space; regularized reconstruction

├── Anomaly Detection

├── Autoencoder (AE) — detect anomalies via high reconstruction error

└── Variational Autoencoder (VAE) — detect anomalies via low likelihood or reconstruction

Self-Supervised Learning

├── Learn representations using input itself as supervision (e.g., contrastive learning, masked autoencoding)

Generative Modeling

├── Variational Autoencoder (VAE) — generate data by sampling probabilistic latent space

├── Generative Adversarial Networks (GANs) — generator vs. discriminator to create realistic data

└── Diffusion Models — gradually denoise random noise to generate data

├─ Reinforcement Learning

├── (refer below)

▶️ Problem Perspective: Domain → Architecture → Models

Natural Language Processing (Text / Documents / Code)

│

├── Embedding Models — map words/token sequences into vector spaces

│ ├─ word2vec — local context predicts embedding

│ └─ GloVe — global co-occurrence embedding factorization

│

├── RNN / LSTM / GRU — sequential processing with hidden state

│ └─ seq2seq — encoder-decoder for translation / summarization

│

├── Transformer Encoder-Only — contextual understanding

│ ├─ BERT — masked-token self-supervision

│ ├─ RoBERTa — optimized BERT training objective

│ └─ DeBERTa — disentangled attention for cleaner context

│

├── Transformer Decoder-Only — next-token generation

│ ├─ GPT series — scalable generative language modeling

│ ├─ LLaMA — efficient open LLM family

│ ├─ Mistral / Mixtral — small, fast, high-quality open models

│ └─ Claude — reasoning-focused long-context models

│

└── Encoder–Decoder Transformers — input→output structured transformation

├─ T5 / FLAN-T5 — unify tasks as “text-to-text”

└─ BART — denoising autoencoder for sequence reconstruction

Text understanding and generation: Machine translation, sentiment analysis, question answering, summarization, and chatbots.

Transformers: Attention-based for sequences (self-attention mechanisms).

GPT: Generative Pre-trained Transformer

Autoencoders and Variational Autoencoders (VAEs): For unsupervised learning; used in dimensionality reduction, denoising, and generative tasks.

AlphaFold: DeepMind's protein structure prediction using transformers and attention.

Computer Vision

│

├── CNNs — learn spatial features via convolution + feature hierarchy

│ ├─ VGG — deep stack of small 3×3 convolutions

│ ├─ ResNet — skip-connections to stabilize deep training

│ └─ EfficientNet — scaled network size (width/depth/resolution)

│

├── Vision Transformers — model image patches with attention instead of convolution

│ ├─ ViT — pure transformer patch attention

│ ├─ Swin — hierarchical local attention windows

│ └─ DeiT — data-efficient ViT training

│

├── GANs (Adversarial Training) — generator tries to fool discriminator

│ ├─ DCGAN — convolutional generator/discriminator framework

│ ├─ StyleGAN — style-based latent control of image attributes

│ └─ BigGAN — high-fidelity class-conditional generation

│

└── Diffusion Models — generate by iteratively denoising noise

├─ Stable Diffusion — latent diffusion for efficient text→image

├─ Imagen — high-fidelity text→image

└─ DALL·E 2 / DALL·E 3 — prompt-driven image synthesis

Image/video analysis: Image classification, object detection, semantic segmentation, facial recognition, and pose estimation.

Convolutional Neural Networks (CNNs): Feature extraction via convolutions and pooling.

Generative Adversarial Networks (GANs): Two-network setup (generator vs. discriminator); for realistic data generation, like images or videos.

Diffusion Models: Probabilistic models for iterative denoising; high-quality generation in images, audio, and video.

Midjourney: Proprietary diffusion-based for artistic image generation.

Sora: OpenAI's text-to-video diffusion model.

Speech & Audio

│

├── Representation Models — learn latent acoustic features

│ ├─ wav2vec 2.0 — self-supervised speech embeddings

│ └─ Whisper — robust multilingual speech recognition

│

└── Generative Audio — generate speech/music from text or embeddings

├─ VALL-E — discrete token voice cloning

├── Bark — expressive text-to-speech

└── MusicGen — text→music generation

Sound processing: Speech recognition, text-to-speech synthesis, speaker identification, and audio classification.

Multi-Modal (Text + Image + Video + Code)

│

├── Joint Embedding Models — align cross-modal representations

│ └─ CLIP — shared embedding space for text ↔ image similarity

│

├── Vision-Language Transformers — jointly attend across modalities

│ ├─ Flamingo — few-shot multimodal reasoning

│ ├─ BLIP / BLIP-2 — captioning + visual question answering alignment

│ └─ LLaVA — conversational image understanding via LLM

│

└── Unified Multimodal Foundation Models — broad cross-modal reasoning

├─ GPT-4V — visual + language integration

├─ Gemini — multimodal + planning capabilities

└─ Claude 3 — vision + language reasoning

Combines multiple data types (e.g., text + image): Visual question answering, captioning, and cross-modal retrieval.

Recommendation / Personalization

│

├── Collaborative Filtering — learn user/item latent preferences

│ ├─ Matrix Factorization — latent user × item space

│ └─ Neural Collaborative Filtering — deep learned embeddings

│

└── Sequence-based Recommenders — model user behavior over time

└─ SASRec — transformer-based sequential recommender

Predicts user preferences: Used in e-commerce, content streaming, and personalized ads.

Graph Neural Networks (GNNs): For non-Euclidean data like graphs; applied in social networks, molecular modeling, and recommendation.

Reinforcement Learning (Control / Planning / Agents)

│

├── Value-Based — learn expected return per state/action

│ └─ DQN — deep Q-learning from visual state input

│

├── Policy Gradient / Actor-Critic — directly optimize decision policy

│ ├─ A2C / A3C — parallel asynchronous agents

│ └─ PPO — stable policy updates (used widely in RLHF for LLMs)

│

└── Search + Self-Play — combine planning with learned models

├─ AlphaGo — neural policy/value + tree search

├─ AlphaZero — general self-play learning

└─ MuZero — learns environment dynamics implicitly

Decision-making in environments: Game playing, robotics control, autonomous driving, and optimization problems.

Q-Networks, Policy Gradients: For agent-environment interactions; include DQN, PPO, and actor-critic methods.

Google Sites

Report abuse