Skip to content

Questions#

Machine Learning#

Machine Learning Concepts
How would you describe the concept of machine learning in your own words?

Machine learning focuses on creating systems that improve their performance on a task by learning patterns from data rather than relying on explicit programming.

Can you give a few examples of real-world areas where machine learning is particularly effective?

Machine learning is especially valuable for solving complex problems without clear rule-based solutions, automating decision-making instead of hand-crafted logic, adapting to changing environments, and extracting insights from large datasets.

What are some typical problems addressed with unsupervised learning methods?

Typical unsupervised learning tasks include clustering, data visualization, dimensionality reduction, and association rule mining.

Would detecting spam emails be treated as a supervised or unsupervised learning problem, and why?

Spam filtering is an example of a supervised learning problem because the model learns from examples of emails labeled as "spam" or "not spam".

What does the term ‘out-of-core learning’ refer to in machine learning?

Out-of-core learning enables training on datasets too large to fit in memory by processing them in smaller chunks (mini-batches) and updating the model incrementally.

How can you distinguish between model parameters and hyperparameters?
  • Model parameters define how the model behaves and are learned during training (e.g., weights in linear regression).

  • Hyperparameters are external settings chosen before training, such as the learning rate or regularization strength.

What are some major difficulties or limitations commonly faced when building machine learning systems?
Key challenges in machine learning include

- insufficient or low-quality data
- poor feature selection
- non-representative samples
- models that either underfit (too simple) or overfit (too complex)
If a model performs well on training data but poorly on unseen data, what issue is occurring, and how might you address it?

When a model performs well on training data but poorly on unseen examples, it’s overfitting. This can be mitigated by collecting more diverse data, simplifying the model, applying regularization, or cleaning noisy data.

What is a test dataset used for, and why is it essential in evaluating a model’s performance?

A test set provides an unbiased estimate of how well a model will perform on new, real-world data before deployment.

What role does a validation set play during the model development process?

A validation set helps compare multiple models and tune hyperparameters, ensuring better generalization to unseen data.

What is a train-dev dataset, in what situations would you create one, and how is it applied during model evaluation?

The train-dev set is a small portion of the training data set aside to identify mismatches between the training distribution and the validation/test distributions. You use it when you suspect that your production data may differ from your training data. The model is trained on most of the training data and evaluated on the train-dev set to detect overfitting or data mismatch before comparing results on the validation set.

Why is it problematic to adjust hyperparameters based on test set performance?

If you tune hyperparameters using the test set, you risk overfitting to that specific test data, making your performance results misleadingly high. As a result, the model might perform worse in real-world scenarios because the test set is no longer an unbiased measure of generalization.

LLM Fundamentals#

Explain bias-variance tradeoff. How does it manifest in LLMs?
  • Bias: Error from incorrect assumption in the model

    • High bias leads to underfitting, where the model fails to capture patterns in the training data
  • Variance: Error from sensitivity to small fluctuations in the training data

    • High variance leads to overfitting, where the model memorizes noise instead of learning generalization patterns

The bias-variance tradeoff in ML describe the tension between ability to fit training data and ability to generalize the new data

bias-variance in LLM:

  • Model Parameters: Capacity vs. Overfitting

    • Too few parameters: A model with insufficient (e.g. small transformer) cannot capture complex patterns in the data, leading to high bias.

      A small LLM might fail to understand language or generate coherent long texts.

    • Too many parameters: A model with excessive capacity risks overfitting to training data, memorizing noise and details instead of learning generalizable patterns

      A large LLM fine-tuned on a small dataset may generate text that is statistically similar to the training data but lack coherence and factual accuracy. (e.g. hallucinations)

    • Balancing Act:

      More parameters reduce bias by enabling the model to capture complex patterns but increase variance if not regularized. Regularization techniques: (e.g dropout, weight decay) help mitigate overfitting in high-parameter models

  • Training Epochs: Learning Duration vs. Overfitting

    • Too few epochs: The model hasn't learned enough from the data, leading to high bias.

      A transformer trained for only 1 epoch may fail to capture meaningful relationships in the text.

    • Too many epochs: The model starts memorizing training data, increasing variance. This is common in transformer with high capacity and small datasets

      A transformer fine-tuned on a medical dataset for 100 epochs may overfit to rare cases, leading to poor generalization.

    • Tradeoff in Transformers

      Training loss decreases with epochs (low bias), but validation loss eventually increase (high variance).

      Early stopping is critical for transformers to avoid overfitting, especially when training on small or noisy datasets.

  • Noise vs Representativeness
    • Low-quality data: Noisy, biased, or incomplete data prevents the model from learning accurate patterns, increasing biase.

      A transformer trained on a dataset with limited examples of rare diseases may fail to diagnose them accurately

    • Noisy/unrepresentative data: The model learns inconsistent patterns, increasing variance.

      A dataset with duplicate or corrupted text may cause the model to overfit. A transformer trained on a dataset with biased political content may generate polarized outputs. Data augmentation (e.g. paraphrasing, back-translation) increases diversity, mitigating overfitting

What is the difference between L1 and L2 regularization? When would you use elastic net in an LLM fine-tune?

Regularization adds a penalty term to the loss function so that the optimizer favours simpler or smoother solutions. In practice it is usually added to a model‑level loss (cross‑entropy, MSE, …) as a separate scalar that scales with the weights.

\[ \text{Loss}_{\text{regularized}} = \text{Loss}_{\text{original}} + \lambda \cdot \text{Penalty}(w) \]
Feature L1 (Lasso) L2 (Ridge)
Weight Behavior Many → 0 (sparse) "All → small, non-zero"
Feature Selection Yes No
Solution Not always unique Always unique
Robust to Outliers Less More

Key Insight:

  • L1 regularization is more robust to outliers in the data (target outliers)
  • L2 regularization is more robust to outliers in the features (collinearity)

L1/L2 in LLM:

  • Use L2 by default. Use L1 if you want sparse, interpretable updates.
  • L2 keeps updates smooth. L1 keeps updates minimal — and that’s often better for deployment.
  • Use L2 to win benchmarks. Use L1 to ship to users.
    1. Sparse LoRA = Tiny Adapters
    2. Faster Inference (Real Speedup!)
    3. Better Generalization (Less Overfitting)
    4. Interpretable Fine-Tuning
    5. Clean Model Merging
  ┌──────────────────────┐
   Fine-tuning an LLM?    └────────┬─────────────┘
                      ┌──────────────────────┐     YES  Use L2 (weight_decay=0.01)
   Large, clean data?   │───NO──►┐
  └────────┬─────────────┘                                                                      ┌──────────────────────┐ ┌──────────────────────────┐
   Need max accuracy?     Want small/fast model?     └────────┬─────────────┘ └────────────┬─────────────┘
                                              YES                        YES
                                                                              Use L2                     Use L1 (+ pruning)
Prove that dropout is equivalent to an ensemble during inference (hint: geometric distribution).
  • Where dropout appears in a Transformer
    • Attention dropout
    • Feedforward dropout
    • Residual dropout
  • The ensemble view of dropout in a Transformer
    • Each layer (and even each neuron) may be dropped independently.
    • A particular dropout mask \(m = (m^{(1)}, m^{(2)}, \dots, m^{(L)})\) defines one specific subnetwork (one “member” of the ensemble).

During training: Randomly turn off some neurons (like flipping a coin for each one). This forces the network to learn many different "sub-networks" — each time you train, a different combination of neurons is active.

During testing (inference): Instead of picking one sub-network, we use all neurons, but scale down their strength (usually by half if dropout rate is 50%). This is the "mean network."

Why this is like an ensemble: Imagine you could run the model 1,000 times (or \(2^{(N)}\) times for N neurons), each time with a different random set of neurons turned off, and then average all their predictions. That would be a huge ensemble of sub-networks — very accurate, but way too slow. Dropout’s trick: Using the scaled "mean network" at test time gives exactly the same prediction as if you had averaged the geometric mean of all those possible sub-networks.

Dropout = training lots of sub-networks, inference = using their collective average — fast and smart.

What is the curse of dimensionality? How do positional encodings mitigate it in Transformers?

Higher dimensions → sparser data → harder to learn meaningful relationships.

The curse of dimensionality refers to the set of problems that arise when data or model representations exist in high-dimensional spaces.

  • Data sparsity: Points become exponentially sparse — distances between points tend to concentrate, making similarity less meaningful.
  • Combinatorial explosion: The volume of the space grows exponentially \(O(k^{(d)})\), so covering it requires exponentially more data.
  • Poor generalization: Models struggle to learn smooth mappings because there’s too little data to constrain the high-dimensional space.

Token in Transformers Transformers process tokens as vectors in a high-dimensional embedding space (e.g., 768 or 4096 dimensions). However — self-attention treats each token as a set element rather than a sequence element. The attention mechanism itself has no built-in sense of order. The model only knows “content similarity,” not which token came first or last.

Without order, the model would need to learn positional relationships implicitly across high-dimensional embeddings. That’s hard — and it exacerbates the curse of dimensionality because:

  • There’s no geometric bias for position.
  • Each token embedding can drift freely in a massive space.
  • The model must infer ordering purely from statistical co-occurrence — requiring more data and more parameters.

How Positional Encodings Help

Positional encodings (PEs) inject structured, low-dimensional information about sequence order directly into the embeddings. - Adds a geometric bias to embeddings — nearby positions have nearby encodings. - Reduces the effective search space — positions are no longer independent random vectors. - Enables extrapolation: the sinusoidal pattern generalizes beyond training positions. - The model can compute relative positions via linear operations (e.g., dot products of PEs reflect distance).

Explain maximum likelihood estimation for language modeling.

Training a neural LM (like a Transformer) by minimizing the negative log-likelihood (NLL) is the same as maximizing the likelihood:

\[\boxed{ \text{Maximizing likelihood} \;\; \Leftrightarrow \;\; \text{Maximizing log-likelihood} \;\; \Leftrightarrow \;\; \text{Minimizing negative log-likelihood} }\]

Example

Sentence: "The cat sat on the mat."

The MLE objective trains the model to maximize: \(P(\text{The}) \cdot P(\text{cat}|\text{The}) \cdot P(\text{sat}|\text{The cat}) \cdot P(\text{on}|\text{The cat sat}) \cdot \dots\)

What is negative log-likelihood? Write the per-token loss for GPT.
\[\boxed{ \ell(\theta) = \sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta) } \]
\[\boxed{ \text{NLL}(\theta) = -\ell(\theta) } \]
\[\boxed{ \text{NLL}(\theta) = -\sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta) } \]

where \(x_{<t}\) means All tokens before \(t\): \(x_1, \dots, x_{t-1}\)

This is the heart of autoregressive language modeling — like GPT!

Compare cross-entropy, perplexity, and BLEU. When is perplexity misleading?
  1. Cross-Entropy: Cross-entropy measures how well a probabilistic model predicts a target distribution — in LM, how well the model assigns high probability to the correct next tokens.
  2. Perplexity: Perplexity (PPL) is simply the exponentiation of the cross-entropy
  3. BLEU (Bilingual Evaluation Understudy): BLEU is an n-gram overlap metric for evaluating machine translation or text generation quality against reference texts

Perplexity is rephrasing cross-entropy in a more intuitive, more human-readable.

Perplexity = "How predictable is the language?"

BLEU = "How much does the output match a reference?"

Example:

Reference: "The cat is on the mat."

Model output: "The dog is on the mat."

→ Low perplexity (grammatical, fluent)

→ Low BLEU (wrong content) BLEU is non-probabilistic and reference-based — unlike cross-entropy and perplexity.

⚠️ When Perplexity Is Misleading???

Perplexity only measures how well the model predicts tokens probabilistically — not how meaningful or correct the generated text is.

  • Different tokenizations or vocabularies
    • A model with smaller tokens or subwords might have lower perplexity just because predictions are more granular, not actually better linguistically.
  • Domain mismatch
    • A model trained on Wikipedia might have low perplexity on Wikipedia text but produce incoherent answers to questions — it knows probabilities, not task structure.
  • Human-aligned vs statistical objectives
    • A model can assign high likelihood to grammatical but dull continuations (e.g., “The cat sat on the mat”) while rejecting creative or rare but correct continuations — good perplexity, poor real-world usefulness.
  • Non-autoregressive or non-likelihood models
    • For encoder-decoder or retrieval-augmented systems, perplexity may not correlate with generation quality because these models are not optimized purely for next-token prediction.
  • Overfitting
    • A model with very low perplexity on training data may memorize text, but generalize poorly (BLEU or human eval drops).
Why is label smoothing used in LLMs? Derive its modified loss?

Label smoothing is used in LLMs to prevent overconfidence and improve generalization.

Instead of training on a one-hot target (where the correct token has probability 1 and all others 0), a small portion ε of that probability is spread across all other tokens.

So the true token gets (1 − ε) probability, and the rest share ε uniformly.

This changes the loss from the usual “−log p(correct token)” to a mix of:

  • (1 − ε) × loss for the correct token, and
  • ε × average loss over all tokens.
What is the difference between hard and soft attention?
  • Hard attention → discrete, selective, non-differentiable.
  • Soft attention → continuous, weighted, differentiable.

Fundamentals of Large Language Models (LLMs)#

Question Bank

  • Fundamentals of Large Language Models (LLMs)
LLM Basic
What are the main open-source LLM families currently available?
  • Llama: Decoder-Only
  • Mistral: Decoder-Only (MoE in Mixtral)
  • Gemma: Decoder-Only
  • Phi: Decoder-Only
  • Qwen: Decoder-Only (dense + MoE)
  • DeepSeek: Decoder-Only (MoE in V2)
  • Falcon: Decoder-Only
  • OLMo: Decoder-Only
What’s the difference between prefix decoder, causal decoder, and encoder-decoder architectures?
  • Causal Decoder (Decoder-Only): Autoregressive model that generates text left-to-right, attending only to previous tokens.
  • Prefix Decoder (PrefixLM): Causal decoder with a bidirectional prefix (input context) followed by autoregressive generation.
  • Encoder-Decoder (Seq2Seq): Two separate Transformer stacks(Encoder & Decoder)
Causal Decoder
  • Prompt

    Translate to French: The cat is on the mat.

  • Generation (autoregressive, causal mask):

    Le [only sees "Le"]

    Le chat [sees "Le chat"]

    Le chat est [sees "Le chat est"]

    Le chat est sur [sees up to "sur"]

    Le chat est sur le [sees up to "le"]

    Le chat est sur le tapis. [final]

  • Summary

    Cannot see future tokens

    Cannot see full input bidirectionally — but works via prompt engineering

Prefix Decoder
  • Input Format

    [Prefix] The cat is on the mat. [SEP] Translate to French: [Generate] Le chat est sur le tapis.

  • Attention

    Prefix (The cat is on the mat. [SEP] Translate to French:) → bidirectional

Encoder-Decoder
What is the training objective of large language models?

LLMs are trained to predict the next token in a sequence.

Why are most modern LLMs decoder-only architectures?

Most modern LLMs are decoder-only because this architecture is the simplest, fastest, and most flexible for large-scale text generation. Below is the full reasoning, broken into the fundamental, engineering, and use-case levels.

  • Decoder-only naturally matches the training objective
  • Simpler architecture → easier scaling
  • Better for long-context generation
  • Fits universal multitask learning with a single text stream
  • Aligns with inference needs
    • streaming output
    • token-by-token generation
    • low latency
    • high throughput
    • continuous prompts
Explain the difference between encoder-only, decoder-only, and encoder-decoder models.
  • Encoder-only Models (BERT, RoBERTa, DeBERTa, ELECTRA)
    • classification (sentiment, fraud detection)
    • named entity recognition
    • sentence similarity
    • search / embeddings
    • anomaly or pattern detection
  • Decoder-only Models (GPT, Llama, Mixtral, Gemma, Qwen)
    • Text generation
    • Multi-task language modeling
    • Anything that treats tasks as text → text in one stream
  • Encoder–Decoder (Seq2Seq) Models (T5, FLAN-T5, BART, mT5, early Transformer models)
    • Translation
    • Summarization
    • Text-to-text tasks with clear input → output mapping
What’s the difference between prefix LM and causal LM?
  • Causal LM: every token can only attend to previous tokens.
  • Prefix LM: the prefix can be fully bidirectional, while the rest is generated causally.
Feature Causal LM Prefix LM
Attention Strictly left-to-right Prefix: full; Generation: causal
Use case Free-form generation Conditional generation, prefix tuning
Examples GPT, Llama, Mixtral T5 (prefix mode), UL2, some prompt-tuning models
Future access? No Only inside prefix
Mask complexity Simple Mixed masks
Layer Normalization Variants
Comparison of LayerNorm vs BatchNorm vs RMSNorm?
Norm Formula Pros Cons
BatchNorm Normalize across batch Great for CNNs Bad for variable batch / autoregressive decoding
LayerNorm Normalize across hidden dim Stable for Transformers Slightly more compute than RMSNorm
RMSNorm Normalize only scale Faster, more stable in LLMs No centering → sometimes slightly less expressive
What’s the core idea of DeepNorm?

DeepNorm keeps the Transformer stable at extreme depths by scaling the residual connections proportionally to the square root of the model depth.

What are the advantages of DeepNorm?

DeepNorm = deep models that actually train and perform well, without tricks.

  • Enables Extremely Deep Transformers (1,000+ layers)
  • Superior Training Stability
  • Improved Optimization Landscape
  • Better Performance on Downstream Tasks
  • No Architectural Overhead
  • Robust Across Scales and Tasks
What are the differences when applying LayerNorm at different positions in LLMs?
  • Pre-NormPost-Norm (Original Transformer, 2017): Normalizes after adding the residual.
    • Pros:
      • Fairly stable for shallow models (<12 layers)
      • Works well in classic NMT models
    • Cons:
      • Fails to train deep models (vanishing/exploding gradients)
      • Poor gradient flow
      • Not used in modern LLMs
  • Pre-Norm (Current Standard in GPT/LLaMA): Normalize before attention or feed-forward
    • Pros:
      • Much more stable for deep Transformers
      • Great training stability up to hundreds of layers
      • Works well with small batch sizes
      • Default in GPT-⅔, LLaMA, Mistral, Gemma, Phi-3, Qwen2
    • Cons:
      • Residual stream grows in magnitude unless controlled (→ RMSNorm or DeepNorm often added)
      • Slightly diminished expressive capacity compared to Post-Norm (but negligible in practice)
    • Pros:
      • Extra stability & smoothness
      • Improved optimization in some NMT models

    Sandwich-Norm: LayerNorm applied before AND after sublayers.

    • Cons:
      • Expensive (two norms per sublayer)
      • Rarely used in large decoder-only LLMs

🧠 Why LayerNorm position matters

1. Training Stability
    •   Pre-Norm prevents exploding residuals
    •   Post-Norm accumulates errors → unstable for deep models
2. Gradient Flow
    - Residuals in Pre-Norm allow gradients to bypass the sublayers directly.
Which normalization method is used in different LLM architectures?

Large decoder-only LLMs almost universally use RMSNorm + Pre-Norm.

Activation Functions in LLMs
What’s the formula for the FFN (Feed-Forward Network) block?
  • Standard FFN Formula

    \[\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2\]
    \[W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{mode$$l}}}\]
    \[b_1 \in \mathbb{R}^{d_{\text{ff}}}\]
    \[W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\]
    \[b_2 \in \mathbb{R}^{d_{\text{model}}}\]
    \[\sigma = \text{activation} \text{ } \text{function} \text{(ReLU in original Transformer, GELU in GPT, SwiGLU/GeLU-linear in modern LLMs)}\]
  • Gated FFN in LLMs

    \[\text{FFN}(x) = W_3 \left( \text{Swish}(W_1x) \odot W_2x \right)\]
    \[\text{Swish}(u) = u \cdot \sigma(u)\]
    \[W_1, W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\]
    \[W_3 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\]
What’s the GeLU formula?

Gaussian Error Linear Unit (GeLU)

\[\text{GeLU}(x) = \frac{x}{2}\left(1 + \operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right)\]
\[\operatorname{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} \, dt\]
What’s the Swish formula?

Swish is a smooth, non-monotonic activation.

\[\text{Swish}(x) = \frac{x}{1 + e^{-x}}\]
What’s the formula of an FFN block with GLU (Gated Linear Unit)?
What’s the formula of a GLU block using GeLU?
What’s the formula of a GLU block using Swish?
Which activation functions do popular LLMs use?
What are the differences between Adam and SGD optimizers?
Attention Mechanisms — Advanced Topics
What are the problems with traditional attention?
What are the directions of improvement for attention?
What are the attention variants?
What issues exist in multi-head attention?
Explain Multi-Query Attention (MQA).
Compare Multi-head, Multi-Query, and Grouped-Query Attention.
What are the benefits of MQA?
Which models use MQA or GQA?
Why was FlashAttention introduced? Briefly explain its core idea.
What are FlashAttention advantages?
Which models implement FlashAttention?
What is parallel transformer block?
What’s the computational complexity of attention and how can it be improved?
Compare MHA, GQA, and MQA — what are their key differences?
Cross-Attention
Why do we need Cross-Attention?
Explain Cross-Attention.
Compare Cross-Attention and Self-Attention — similarities and differences.
Provide a code implementation of Cross-Attention.
What are its application scenarios?
What are the advantages and challenges of Cross-Attention?
Transformer Operations
How to load a BERT model using transformers?
How to output a specific hidden_state from BERT using transformers?
How to get the final or intermediate layer vector outputs of BERT?
LLM Loss Functions
What is KL divergence?
Write the cross-entropy loss and explain its meaning.
What’s the difference between KL divergence and cross-entropy?
How to handle large loss differences in multi-task learning?
Why is cross-entropy preferred over MSE for classification tasks?
What is information gain?
How to compute softmax and cross-entropy loss (and binary cross-entropy)?
What if the exponential term in softmax overflows the float limit?
Similarity & Contrastive Learning
Besides cosine similarity, what other similarity metrics exist?
What is contrastive learning?
How important are negative samples in contrastive learning, and how to handle costly negative sampling?
  • Advanced Topics in LLMs
Advanced LLM
What is a generative large model?
How do LLMs make generated text diverse and non-repetitive?
What is the repetition problem (LLM echo problem)? Why does it happen? How can it be mitigated?
Can LLaMA handle infinitely long inputs? Explain why?
When should you use BERT vs. LLaMA / ChatGLM models?
Do different domains require their own domain-specific LLMs? Why?
How to enable an LLM to process longer texts?
  • Fine-Tuning Large Models
General Fine-Tuning
Why does the loss drop suddenly in the second epoch during SFT?
How much VRAM is needed for full fine-tuning?
Why do models seem dumber after SFT?
How to construct instruction fine-tuning datasets?
How to improve prompt representativeness?
How to increase prompt data volume?
How to select domain data for continued pretraining?
How to prevent forgetting general abilities after domain tuning?
How to make the model learn more knowledge during pretraining?
When performing SFT, should the base model be Chat or Base?
What’s the input/output format for domain fine-tuning?
How to build a domain evaluation set?
Is vocabulary expansion necessary? Why?
How to train your own LLM?
What are the benefits of instruction fine-tuning?
During which stage — pretraining or fine-tuning — is knowledge injected?
SFT Tricks
What’s the typical SFT workflow?
What are key aspects of training data?
How to choose between large and small models?
How to ensure multi-task training balance?
Can SFT learn knowledge at all?

??? question "How to select datasets effectively?

Training Experience
How to choose a distributed training framework?
What are key LLM training tips?
How to choose model size?
How to select GPU accelerators?
  • LangChain and Agent-Based Systems
LangChain Core
What is LangChain?
What are its core concepts?
Components and Chains
Prompt Templates and Values
Example Selectors
Output Parsers
Indexes and Retrievers
Chat Message History
Agents and Toolkits
Long-Term Memory in Multi-Turn Conversations
How can Agents access conversation context?
Retrieve full history
Use sliding window for recent context

??? question "

Practical RAG Q&A using LangChain
(Practical implementation questions about RAG apps in LangChain)
  • Retrieval-Augmented Generation (RAG)
RAG Basics
Why do LLMs need an external (vector) knowledge base?
What’s the overall workflow of LLM+VectorDB document chat?
What are the core technologies?
How to build an effective prompt template?
RAG Concepts
What are the limitations of base LLMs that RAG solves?
What is RAG?
How to obtain accurate semantic representations?
How to align query/document semantic spaces?
How to match retrieval model output with LLM preferences?
How to improve results via post-retrieval processing?
How to optimize generator adaptation to inputs?
What are the benefits of using RAG?
RAG Layout Analysis
Why is PDF parsing necessary?
What are common methods and their differences?
What problems exist in PDF parsing?
Why is table recognition important?
What are the main methods?
Traditional methods
pdfplumber extraction techniques
Why do we need text chunking?
What are common chunking strategies (regex, Spacy, LangChain, etc.)?
RAG Retrieval Strategies
Why use LLMs to assist recall?
HYDE approach: idea and issues
FLARE approach: idea and recall strategies
Why construct hard negative samples?
Random sampling vs. Top-K hard negative sampling
RAG Evaluation
Why evaluate RAG?
What are the evaluation methods, metrics, and frameworks?
RAG Optimization
What are the optimization strategies for retrieval and generation modules?
How to enhance context using knowledge graphs (KGs)?
What are the problems with vector-based context augmentation?
How can KG-based methods improve it?
What are the main pain points in RAG and their solutions?
Content missing
Top-ranked docs missed
Context loss
Failure to extract answers
Explain RAG-Fusion. Why it’s needed,Core technologies,Workflow, and Advantages
Graph RAG
Why do we need Graph RAG?
What is Graph RAG and how does it work? Show a code example and use case.
How to improve ranking optimization in Graph RAG?
  • Parameter-Efficient Fine-Tuning (PEFT)
PEFT Fundamentals
What is fine-tuning, and how is it performed?
Why do we need PEFT?
What is PEFT and its advantages?
Adapter Tuning
Why use adapter-tuning?
What’s the core idea behind adapter-tuning?
How does it differ from full fine-tuning?