AI / Deep Learning
My Kaggle: https://www.kaggle.com/ckkamaraj
My Kaggle: https://www.kaggle.com/ckkamaraj
│ ├── Autoencoder (AE) — compress input into latent space and reconstruct
│ └── Variational Autoencoder (VAE) — probabilistic latent space; regularized reconstruction
├── Autoencoder (AE) — detect anomalies via high reconstruction error
└── Variational Autoencoder (VAE) — detect anomalies via low likelihood or reconstruction
├── Learn representations using input itself as supervision (e.g., contrastive learning, masked autoencoding)
├── (refer below)
│
│ ├─ word2vec — local context predicts embedding
│ └─ GloVe — global co-occurrence embedding factorization
│
│ └─ seq2seq — encoder-decoder for translation / summarization
│
│ ├─ BERT — masked-token self-supervision
│ ├─ RoBERTa — optimized BERT training objective
│ └─ DeBERTa — disentangled attention for cleaner context
│
│ ├─ GPT series — scalable generative language modeling
│ ├─ LLaMA — efficient open LLM family
│ ├─ Mistral / Mixtral — small, fast, high-quality open models
│ └─ Claude — reasoning-focused long-context models
│
├─ T5 / FLAN-T5 — unify tasks as “text-to-text”
└─ BART — denoising autoencoder for sequence reconstruction
Text understanding and generation: Machine translation, sentiment analysis, question answering, summarization, and chatbots.
Transformers: Attention-based for sequences (self-attention mechanisms).
GPT: Generative Pre-trained Transformer
Autoencoders and Variational Autoencoders (VAEs): For unsupervised learning; used in dimensionality reduction, denoising, and generative tasks.
AlphaFold: DeepMind's protein structure prediction using transformers and attention.
│
│ ├─ VGG — deep stack of small 3×3 convolutions
│ ├─ ResNet — skip-connections to stabilize deep training
│ └─ EfficientNet — scaled network size (width/depth/resolution)
│
│ ├─ ViT — pure transformer patch attention
│ ├─ Swin — hierarchical local attention windows
│ └─ DeiT — data-efficient ViT training
│
│ ├─ DCGAN — convolutional generator/discriminator framework
│ ├─ StyleGAN — style-based latent control of image attributes
│ └─ BigGAN — high-fidelity class-conditional generation
│
├─ Stable Diffusion — latent diffusion for efficient text→image
├─ Imagen — high-fidelity text→image
└─ DALL·E 2 / DALL·E 3 — prompt-driven image synthesis
Image/video analysis: Image classification, object detection, semantic segmentation, facial recognition, and pose estimation.
Convolutional Neural Networks (CNNs): Feature extraction via convolutions and pooling.
Generative Adversarial Networks (GANs): Two-network setup (generator vs. discriminator); for realistic data generation, like images or videos.
Diffusion Models: Probabilistic models for iterative denoising; high-quality generation in images, audio, and video.
Midjourney: Proprietary diffusion-based for artistic image generation.
Sora: OpenAI's text-to-video diffusion model.
│
│ ├─ wav2vec 2.0 — self-supervised speech embeddings
│ └─ Whisper — robust multilingual speech recognition
│
├─ VALL-E — discrete token voice cloning
├── Bark — expressive text-to-speech
└── MusicGen — text→music generation
Sound processing: Speech recognition, text-to-speech synthesis, speaker identification, and audio classification.
│
│ └─ CLIP — shared embedding space for text ↔ image similarity
│
│ ├─ Flamingo — few-shot multimodal reasoning
│ ├─ BLIP / BLIP-2 — captioning + visual question answering alignment
│ └─ LLaVA — conversational image understanding via LLM
│
├─ GPT-4V — visual + language integration
├─ Gemini — multimodal + planning capabilities
└─ Claude 3 — vision + language reasoning
Combines multiple data types (e.g., text + image): Visual question answering, captioning, and cross-modal retrieval.
│
│ ├─ Matrix Factorization — latent user × item space
│ └─ Neural Collaborative Filtering — deep learned embeddings
│
└─ SASRec — transformer-based sequential recommender
Predicts user preferences: Used in e-commerce, content streaming, and personalized ads.
Graph Neural Networks (GNNs): For non-Euclidean data like graphs; applied in social networks, molecular modeling, and recommendation.
│
│ └─ DQN — deep Q-learning from visual state input
│
│ ├─ A2C / A3C — parallel asynchronous agents
│ └─ PPO — stable policy updates (used widely in RLHF for LLMs)
│
├─ AlphaGo — neural policy/value + tree search
├─ AlphaZero — general self-play learning
└─ MuZero — learns environment dynamics implicitly
Decision-making in environments: Game playing, robotics control, autonomous driving, and optimization problems.
Q-Networks, Policy Gradients: For agent-environment interactions; include DQN, PPO, and actor-critic methods.