Questions#

Machine Learning#

Machine Learning Concepts

How would you describe the concept of machine learning in your own words?

Machine learning focuses on creating systems that improve their performance on a task by learning patterns from data rather than relying on explicit programming.

Can you give a few examples of real-world areas where machine learning is particularly effective?

Machine learning is especially valuable for solving complex problems without clear rule-based solutions, automating decision-making instead of hand-crafted logic, adapting to changing environments, and extracting insights from large datasets.

What are some typical problems addressed with unsupervised learning methods?

Typical unsupervised learning tasks include clustering, data visualization, dimensionality reduction, and association rule mining.

Would detecting spam emails be treated as a supervised or unsupervised learning problem, and why?

Spam filtering is an example of a supervised learning problem because the model learns from examples of emails labeled as "spam" or "not spam".

What does the term ‘out-of-core learning’ refer to in machine learning?

Out-of-core learning enables training on datasets too large to fit in memory by processing them in smaller chunks (mini-batches) and updating the model incrementally.

How can you distinguish between model parameters and hyperparameters?

Model parameters define how the model behaves and are learned during training (e.g., weights in linear regression).
Hyperparameters are external settings chosen before training, such as the learning rate or regularization strength.

What are some major difficulties or limitations commonly faced when building machine learning systems?

Key challenges in machine learning include

- insufficient or low-quality data
- poor feature selection
- non-representative samples
- models that either underfit (too simple) or overfit (too complex)

If a model performs well on training data but poorly on unseen data, what issue is occurring, and how might you address it?

When a model performs well on training data but poorly on unseen examples, it’s overfitting. This can be mitigated by collecting more diverse data, simplifying the model, applying regularization, or cleaning noisy data.

What is a test dataset used for, and why is it essential in evaluating a model’s performance?

A test set provides an unbiased estimate of how well a model will perform on new, real-world data before deployment.

What role does a validation set play during the model development process?

A validation set helps compare multiple models and tune hyperparameters, ensuring better generalization to unseen data.

What is a train-dev dataset, in what situations would you create one, and how is it applied during model evaluation?

The train-dev set is a small portion of the training data set aside to identify mismatches between the training distribution and the validation/test distributions. You use it when you suspect that your production data may differ from your training data. The model is trained on most of the training data and evaluated on the train-dev set to detect overfitting or data mismatch before comparing results on the validation set.

Why is it problematic to adjust hyperparameters based on test set performance?

If you tune hyperparameters using the test set, you risk overfitting to that specific test data, making your performance results misleadingly high. As a result, the model might perform worse in real-world scenarios because the test set is no longer an unbiased measure of generalization.

LLM Fundamentals#

Explain bias-variance tradeoff. How does it manifest in LLMs?

Bias: Error from incorrect assumption in the model
- High bias leads to underfitting, where the model fails to capture patterns in the training data
Variance: Error from sensitivity to small fluctuations in the training data
- High variance leads to overfitting, where the model memorizes noise instead of learning generalization patterns

The bias-variance tradeoff in ML describe the tension between ability to fit training data and ability to generalize the new data

bias-variance in LLM:

Model Parameters: Capacity vs. Overfitting
- Too few parameters: A model with insufficient (e.g. small transformer) cannot capture complex patterns in the data, leading to high bias.
  
  A small LLM might fail to understand language or generate coherent long texts.
- Too many parameters: A model with excessive capacity risks overfitting to training data, memorizing noise and details instead of learning generalizable patterns
  
  A large LLM fine-tuned on a small dataset may generate text that is statistically similar to the training data but lack coherence and factual accuracy. (e.g. hallucinations)
- Balancing Act:
  
  More parameters reduce bias by enabling the model to capture complex patterns but increase variance if not regularized. Regularization techniques: (e.g dropout, weight decay) help mitigate overfitting in high-parameter models
Training Epochs: Learning Duration vs. Overfitting
- Too few epochs: The model hasn't learned enough from the data, leading to high bias.
  
  A transformer trained for only 1 epoch may fail to capture meaningful relationships in the text.
- Too many epochs: The model starts memorizing training data, increasing variance. This is common in transformer with high capacity and small datasets
  
  A transformer fine-tuned on a medical dataset for 100 epochs may overfit to rare cases, leading to poor generalization.
- Tradeoff in Transformers
  
  Training loss decreases with epochs (low bias), but validation loss eventually increase (high variance).
  
  Early stopping is critical for transformers to avoid overfitting, especially when training on small or noisy datasets.
Noise vs Representativeness
- Low-quality data: Noisy, biased, or incomplete data prevents the model from learning accurate patterns, increasing biase.
  
  A transformer trained on a dataset with limited examples of rare diseases may fail to diagnose them accurately
- Noisy/unrepresentative data: The model learns inconsistent patterns, increasing variance.
  
  A dataset with duplicate or corrupted text may cause the model to overfit. A transformer trained on a dataset with biased political content may generate polarized outputs. Data augmentation (e.g. paraphrasing, back-translation) increases diversity, mitigating overfitting

What is the difference between L1 and L2 regularization? When would you use elastic net in an LLM fine-tune?

Regularization adds a penalty term to the loss function so that the optimizer favours simpler or smoother solutions. In practice it is usually added to a model‑level loss (cross‑entropy, MSE, …) as a separate scalar that scales with the weights.

\[ \text{Loss}_{\text{regularized}} = \text{Loss}_{\text{original}} + \lambda \cdot \text{Penalty}(w) \]

Feature	L1 (Lasso)	L2 (Ridge)
Weight Behavior	Many → 0 (sparse)	"All → small, non-zero"
Feature Selection	Yes	No
Solution	Not always unique	Always unique
Robust to Outliers	Less	More

Key Insight:

L1 regularization is more robust to outliers in the data (target outliers)
L2 regularization is more robust to outliers in the features (collinearity)

L1/L2 in LLM:

Use L2 by default. Use L1 if you want sparse, interpretable updates.
L2 keeps updates smooth. L1 keeps updates minimal — and that’s often better for deployment.
Use L2 to win benchmarks. Use L1 to ship to users.
1. Sparse LoRA = Tiny Adapters
2. Faster Inference (Real Speedup!)
3. Better Generalization (Less Overfitting)
4. Interpretable Fine-Tuning
5. Clean Model Merging

  ┌──────────────────────┐
  │ Fine-tuning an LLM?  │
  └────────┬─────────────┘
          │
          ▼
  ┌──────────────────────┐     YES → Use L2 (weight_decay=0.01)
  │ Large, clean data?   │───NO──►┐
  └────────┬─────────────┘        │
          │                    │
          ▼                    ▼
  ┌──────────────────────┐ ┌──────────────────────────┐
  │ Need max accuracy?   │ │ Want small/fast model?   │
  └────────┬─────────────┘ └────────────┬─────────────┘
          │                          │
          YES                        YES
          │                          │
          ▼                          ▼
      Use L2                     Use L1 (+ pruning)

Prove that dropout is equivalent to an ensemble during inference (hint: geometric distribution).

Where dropout appears in a Transformer
- Attention dropout
- Feedforward dropout
- Residual dropout
The ensemble view of dropout in a Transformer
- Each layer (and even each neuron) may be dropped independently.
- A particular dropout mask $m = (m^{(1)}, m^{(2)}, \dots, m^{(L)})$ defines one specific subnetwork (one “member” of the ensemble).

During training: Randomly turn off some neurons (like flipping a coin for each one). This forces the network to learn many different "sub-networks" — each time you train, a different combination of neurons is active.

During testing (inference): Instead of picking one sub-network, we use all neurons, but scale down their strength (usually by half if dropout rate is 50%). This is the "mean network."

Why this is like an ensemble: Imagine you could run the model 1,000 times (or $2^{(N)}$ times for N neurons), each time with a different random set of neurons turned off, and then average all their predictions. That would be a huge ensemble of sub-networks — very accurate, but way too slow. Dropout’s trick: Using the scaled "mean network" at test time gives exactly the same prediction as if you had averaged the geometric mean of all those possible sub-networks.

Dropout = training lots of sub-networks, inference = using their collective average — fast and smart.

What is the curse of dimensionality? How do positional encodings mitigate it in Transformers?

Higher dimensions → sparser data → harder to learn meaningful relationships.

The curse of dimensionality refers to the set of problems that arise when data or model representations exist in high-dimensional spaces.

Data sparsity: Points become exponentially sparse — distances between points tend to concentrate, making similarity less meaningful.
Combinatorial explosion: The volume of the space grows exponentially $O(k^{(d)})$, so covering it requires exponentially more data.
Poor generalization: Models struggle to learn smooth mappings because there’s too little data to constrain the high-dimensional space.

Token in Transformers Transformers process tokens as vectors in a high-dimensional embedding space (e.g., 768 or 4096 dimensions). However — self-attention treats each token as a set element rather than a sequence element. The attention mechanism itself has no built-in sense of order. The model only knows “content similarity,” not which token came first or last.

Without order, the model would need to learn positional relationships implicitly across high-dimensional embeddings. That’s hard — and it exacerbates the curse of dimensionality because:

There’s no geometric bias for position.
Each token embedding can drift freely in a massive space.
The model must infer ordering purely from statistical co-occurrence — requiring more data and more parameters.

How Positional Encodings Help

Positional encodings (PEs) inject structured, low-dimensional information about sequence order directly into the embeddings. - Adds a geometric bias to embeddings — nearby positions have nearby encodings. - Reduces the effective search space — positions are no longer independent random vectors. - Enables extrapolation: the sinusoidal pattern generalizes beyond training positions. - The model can compute relative positions via linear operations (e.g., dot products of PEs reflect distance).

Explain maximum likelihood estimation for language modeling.

Training a neural LM (like a Transformer) by minimizing the negative log-likelihood (NLL) is the same as maximizing the likelihood:

\[\boxed{ \text{Maximizing likelihood} \;\; \Leftrightarrow \;\; \text{Maximizing log-likelihood} \;\; \Leftrightarrow \;\; \text{Minimizing negative log-likelihood} }\]

Example

Sentence: "The cat sat on the mat."

The MLE objective trains the model to maximize: $P(\text{The}) \cdot P(\text{cat}|\text{The}) \cdot P(\text{sat}|\text{The cat}) \cdot P(\text{on}|\text{The cat sat}) \cdot \dots$

What is negative log-likelihood? Write the per-token loss for GPT.

\[\boxed{ \ell(\theta) = \sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta) } \]

\[\boxed{ \text{NLL}(\theta) = -\ell(\theta) } \]

\[\boxed{ \text{NLL}(\theta) = -\sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta) } \]

where $x_{<t}$ means All tokens before $t$: $x_1, \dots, x_{t-1}$

This is the heart of autoregressive language modeling — like GPT!

Compare cross-entropy, perplexity, and BLEU. When is perplexity misleading?

Cross-Entropy: Cross-entropy measures how well a probabilistic model predicts a target distribution — in LM, how well the model assigns high probability to the correct next tokens.
Perplexity: Perplexity (PPL) is simply the exponentiation of the cross-entropy
BLEU (Bilingual Evaluation Understudy): BLEU is an n-gram overlap metric for evaluating machine translation or text generation quality against reference texts

Perplexity is rephrasing cross-entropy in a more intuitive, more human-readable.

Perplexity = "How predictable is the language?"

BLEU = "How much does the output match a reference?"

Example:

Reference: "The cat is on the mat."

Model output: "The dog is on the mat."

→ Low perplexity (grammatical, fluent)

→ Low BLEU (wrong content) BLEU is non-probabilistic and reference-based — unlike cross-entropy and perplexity.

⚠️ When Perplexity Is Misleading???

Perplexity only measures how well the model predicts tokens probabilistically — not how meaningful or correct the generated text is.

Different tokenizations or vocabularies
- A model with smaller tokens or subwords might have lower perplexity just because predictions are more granular, not actually better linguistically.
Domain mismatch
- A model trained on Wikipedia might have low perplexity on Wikipedia text but produce incoherent answers to questions — it knows probabilities, not task structure.
Human-aligned vs statistical objectives
- A model can assign high likelihood to grammatical but dull continuations (e.g., “The cat sat on the mat”) while rejecting creative or rare but correct continuations — good perplexity, poor real-world usefulness.
Non-autoregressive or non-likelihood models
- For encoder-decoder or retrieval-augmented systems, perplexity may not correlate with generation quality because these models are not optimized purely for next-token prediction.
Overfitting
- A model with very low perplexity on training data may memorize text, but generalize poorly (BLEU or human eval drops).

Why is label smoothing used in LLMs? Derive its modified loss?

Label smoothing is used in LLMs to prevent overconfidence and improve generalization.

Instead of training on a one-hot target (where the correct token has probability 1 and all others 0), a small portion ε of that probability is spread across all other tokens.

So the true token gets (1 − ε) probability, and the rest share ε uniformly.

This changes the loss from the usual “−log p(correct token)” to a mix of:

(1 − ε) × loss for the correct token, and
ε × average loss over all tokens.

What is the difference between hard and soft attention?

Hard attention → discrete, selective, non-differentiable.
Soft attention → continuous, weighted, differentiable.

Fundamentals of Large Language Models (LLMs)#

Question Bank

Fundamentals of Large Language Models (LLMs)

LLM Basic

What are the main open-source LLM families currently available?

Llama: Decoder-Only
Mistral: Decoder-Only (MoE in Mixtral)
Gemma: Decoder-Only
Phi: Decoder-Only
Qwen: Decoder-Only (dense + MoE)
DeepSeek: Decoder-Only (MoE in V2)
Falcon: Decoder-Only
OLMo: Decoder-Only

What’s the difference between prefix decoder, causal decoder, and encoder-decoder architectures?

Causal Decoder (Decoder-Only): Autoregressive model that generates text left-to-right, attending only to previous tokens.
Prefix Decoder (PrefixLM): Causal decoder with a bidirectional prefix (input context) followed by autoregressive generation.
Encoder-Decoder (Seq2Seq): Two separate Transformer stacks(Encoder & Decoder)

Causal Decoder

Prompt

Translate to French: The cat is on the mat.
Generation (autoregressive, causal mask):

Le [only sees "Le"]

Le chat [sees "Le chat"]

Le chat est [sees "Le chat est"]

Le chat est sur [sees up to "sur"]

Le chat est sur le [sees up to "le"]

Le chat est sur le tapis. [final]
Summary

Cannot see future tokens

Cannot see full input bidirectionally — but works via prompt engineering

Prefix Decoder

Input Format

[Prefix] The cat is on the mat. [SEP] Translate to French: [Generate] Le chat est sur le tapis.
Attention

Prefix (The cat is on the mat. [SEP] Translate to French:) → bidirectional

Encoder-Decoder

What is the training objective of large language models?

LLMs are trained to predict the next token in a sequence.

Why are most modern LLMs decoder-only architectures?

Most modern LLMs are decoder-only because this architecture is the simplest, fastest, and most flexible for large-scale text generation. Below is the full reasoning, broken into the fundamental, engineering, and use-case levels.

Decoder-only naturally matches the training objective
Simpler architecture → easier scaling
Better for long-context generation
Fits universal multitask learning with a single text stream
Aligns with inference needs
- streaming output
- token-by-token generation
- low latency
- high throughput
- continuous prompts

Explain the difference between encoder-only, decoder-only, and encoder-decoder models.

Encoder-only Models (BERT, RoBERTa, DeBERTa, ELECTRA)
- classification (sentiment, fraud detection)
- named entity recognition
- sentence similarity
- search / embeddings
- anomaly or pattern detection
Decoder-only Models (GPT, Llama, Mixtral, Gemma, Qwen)
- Text generation
- Multi-task language modeling
- Anything that treats tasks as text → text in one stream
Encoder–Decoder (Seq2Seq) Models (T5, FLAN-T5, BART, mT5, early Transformer models)
- Translation
- Summarization
- Text-to-text tasks with clear input → output mapping

What’s the difference between prefix LM and causal LM?

Causal LM: every token can only attend to previous tokens.
Prefix LM: the prefix can be fully bidirectional, while the rest is generated causally.

Feature	Causal LM	Prefix LM
Attention	Strictly left-to-right	Prefix: full; Generation: causal
Use case	Free-form generation	Conditional generation, prefix tuning
Examples	GPT, Llama, Mixtral	T5 (prefix mode), UL2, some prompt-tuning models
Future access?	No	Only inside prefix
Mask complexity	Simple	Mixed masks

Layer Normalization Variants

Comparison of LayerNorm vs BatchNorm vs RMSNorm?

Norm	Formula	Pros	Cons
BatchNorm	Normalize across batch	Great for CNNs	Bad for variable batch / autoregressive decoding
LayerNorm	Normalize across hidden dim	Stable for Transformers	Slightly more compute than RMSNorm
RMSNorm	Normalize only scale	Faster, more stable in LLMs	No centering → sometimes slightly less expressive

What’s the core idea of DeepNorm?

DeepNorm keeps the Transformer stable at extreme depths by scaling the residual connections proportionally to the square root of the model depth.

What are the advantages of DeepNorm?

DeepNorm = deep models that actually train and perform well, without tricks.

Enables Extremely Deep Transformers (1,000+ layers)
Superior Training Stability
Improved Optimization Landscape
Better Performance on Downstream Tasks
No Architectural Overhead
Robust Across Scales and Tasks

What are the differences when applying LayerNorm at different positions in LLMs?

~~Pre-NormPost-Norm~~ (Original Transformer, 2017): Normalizes after adding the residual.
- Pros:
  - Fairly stable for shallow models (<12 layers)
  - Works well in classic NMT models
- Cons:
  - Fails to train deep models (vanishing/exploding gradients)
  - Poor gradient flow
  - Not used in modern LLMs
Pre-Norm (Current Standard in GPT/LLaMA): Normalize before attention or feed-forward
- Pros:
  - Much more stable for deep Transformers
  - Great training stability up to hundreds of layers
  - Works well with small batch sizes
  - Default in GPT-⅔, LLaMA, Mistral, Gemma, Phi-3, Qwen2
- Cons:
  - Residual stream grows in magnitude unless controlled (→ RMSNorm or DeepNorm often added)
  - Slightly diminished expressive capacity compared to Post-Norm (but negligible in practice)
- Pros:
  - Extra stability & smoothness
  - Improved optimization in some NMT models
Sandwich-Norm: LayerNorm applied before AND after sublayers.
- Cons:
  - Expensive (two norms per sublayer)
  - Rarely used in large decoder-only LLMs

🧠 Why LayerNorm position matters

1. Training Stability
    •   Pre-Norm prevents exploding residuals
    •   Post-Norm accumulates errors → unstable for deep models
2. Gradient Flow
    - Residuals in Pre-Norm allow gradients to bypass the sublayers directly.

Which normalization method is used in different LLM architectures?

Large decoder-only LLMs almost universally use RMSNorm + Pre-Norm.

Activation Functions in LLMs

What’s the formula for the FFN (Feed-Forward Network) block?

Standard FFN Formula

\[\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2\]

\[W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{mode$$l}}}\]

\[b_1 \in \mathbb{R}^{d_{\text{ff}}}\]

\[W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\]

\[b_2 \in \mathbb{R}^{d_{\text{model}}}\]

\[\sigma = \text{activation} \text{ } \text{function} \text{(ReLU in original Transformer, GELU in GPT, SwiGLU/GeLU-linear in modern LLMs)}\]
Gated FFN in LLMs

\[\text{FFN}(x) = W_3 \left( \text{Swish}(W_1x) \odot W_2x \right)\]

\[\text{Swish}(u) = u \cdot \sigma(u)\]

\[W_1, W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\]

\[W_3 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\]

What’s the GeLU formula?

Gaussian Error Linear Unit (GeLU)

\[\text{GeLU}(x) = \frac{x}{2}\left(1 + \operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right)\]

\[\operatorname{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} \, dt\]

What’s the Swish formula?

Swish is a smooth, non-monotonic activation.

\[\text{Swish}(x) = \frac{x}{1 + e^{-x}}\]

What’s the formula of an FFN block with GLU (Gated Linear Unit)?

What’s the formula of a GLU block using GeLU?

What’s the formula of a GLU block using Swish?

Which activation functions do popular LLMs use?

What are the differences between Adam and SGD optimizers?

Attention Mechanisms — Advanced Topics

What are the problems with traditional attention?

What are the directions of improvement for attention?

What are the attention variants?

What issues exist in multi-head attention?

Explain Multi-Query Attention (MQA).

Compare Multi-head, Multi-Query, and Grouped-Query Attention.

What are the benefits of MQA?

Which models use MQA or GQA?

Why was FlashAttention introduced? Briefly explain its core idea.

What are FlashAttention advantages?

Which models implement FlashAttention?

What is parallel transformer block?

What’s the computational complexity of attention and how can it be improved?

Compare MHA, GQA, and MQA — what are their key differences?

Cross-Attention

Why do we need Cross-Attention?

Explain Cross-Attention.

Compare Cross-Attention and Self-Attention — similarities and differences.

Provide a code implementation of Cross-Attention.

What are its application scenarios?

What are the advantages and challenges of Cross-Attention?

Transformer Operations

How to load a BERT model using transformers?

How to output a specific hidden_state from BERT using transformers?

How to get the final or intermediate layer vector outputs of BERT?

LLM Loss Functions

What is KL divergence?

Write the cross-entropy loss and explain its meaning.

What’s the difference between KL divergence and cross-entropy?

How to handle large loss differences in multi-task learning?

Why is cross-entropy preferred over MSE for classification tasks?

What is information gain?

How to compute softmax and cross-entropy loss (and binary cross-entropy)?

What if the exponential term in softmax overflows the float limit?

Similarity & Contrastive Learning

Besides cosine similarity, what other similarity metrics exist?

What is contrastive learning?

How important are negative samples in contrastive learning, and how to handle costly negative sampling?

Advanced Topics in LLMs

Advanced LLM

What is a generative large model?

How do LLMs make generated text diverse and non-repetitive?

What is the repetition problem (LLM echo problem)? Why does it happen? How can it be mitigated?

Can LLaMA handle infinitely long inputs? Explain why?

When should you use BERT vs. LLaMA / ChatGLM models?

Do different domains require their own domain-specific LLMs? Why?

How to enable an LLM to process longer texts?

Fine-Tuning Large Models

General Fine-Tuning

Why does the loss drop suddenly in the second epoch during SFT?

How much VRAM is needed for full fine-tuning?

Why do models seem dumber after SFT?

How to construct instruction fine-tuning datasets?

How to improve prompt representativeness?

How to increase prompt data volume?

How to select domain data for continued pretraining?

How to prevent forgetting general abilities after domain tuning?

How to make the model learn more knowledge during pretraining?

When performing SFT, should the base model be Chat or Base?

What’s the input/output format for domain fine-tuning?

How to build a domain evaluation set?

Is vocabulary expansion necessary? Why?

How to train your own LLM?

What are the benefits of instruction fine-tuning?

During which stage — pretraining or fine-tuning — is knowledge injected?

SFT Tricks

What’s the typical SFT workflow?

What are key aspects of training data?

How to choose between large and small models?

How to ensure multi-task training balance?

Can SFT learn knowledge at all?

??? question "How to select datasets effectively?

Training Experience

How to choose a distributed training framework?

What are key LLM training tips?

How to choose model size?

How to select GPU accelerators?

LangChain and Agent-Based Systems

LangChain Core

What is LangChain?

What are its core concepts?

Components and Chains

Prompt Templates and Values

Example Selectors

Output Parsers

Indexes and Retrievers

Chat Message History

Agents and Toolkits

Long-Term Memory in Multi-Turn Conversations

How can Agents access conversation context?

Retrieve full history

Use sliding window for recent context

??? question "

Practical RAG Q&A using LangChain

(Practical implementation questions about RAG apps in LangChain)

Retrieval-Augmented Generation (RAG)

RAG Basics

Why do LLMs need an external (vector) knowledge base?

What’s the overall workflow of LLM+VectorDB document chat?

What are the core technologies?

How to build an effective prompt template?

RAG Concepts

What are the limitations of base LLMs that RAG solves?

What is RAG?

How to obtain accurate semantic representations?

How to align query/document semantic spaces?

How to match retrieval model output with LLM preferences?

How to improve results via post-retrieval processing?

How to optimize generator adaptation to inputs?

What are the benefits of using RAG?

RAG Layout Analysis

Why is PDF parsing necessary?

What are common methods and their differences?

What problems exist in PDF parsing?

Why is table recognition important?

What are the main methods?

Traditional methods

pdfplumber extraction techniques

Why do we need text chunking?

What are common chunking strategies (regex, Spacy, LangChain, etc.)?

RAG Retrieval Strategies

Why use LLMs to assist recall?

HYDE approach: idea and issues

FLARE approach: idea and recall strategies

Why construct hard negative samples?

Random sampling vs. Top-K hard negative sampling

RAG Evaluation

Why evaluate RAG?

What are the evaluation methods, metrics, and frameworks?

RAG Optimization

What are the optimization strategies for retrieval and generation modules?

How to enhance context using knowledge graphs (KGs)?

What are the problems with vector-based context augmentation?

How can KG-based methods improve it?

What are the main pain points in RAG and their solutions?

Content missing

Top-ranked docs missed

Context loss

Failure to extract answers

Explain RAG-Fusion. Why it’s needed,Core technologies,Workflow, and Advantages

Graph RAG

Why do we need Graph RAG?

What is Graph RAG and how does it work? Show a code example and use case.

How to improve ranking optimization in Graph RAG?

Parameter-Efficient Fine-Tuning (PEFT)

PEFT Fundamentals

What is fine-tuning, and how is it performed?

Why do we need PEFT?

What is PEFT and its advantages?

Adapter Tuning

Why use adapter-tuning?

What’s the core idea behind adapter-tuning?

How does it differ from full fine-tuning?