Archive year

2025

vllm throughput

In large-language-model (LLM) inference serving contexts, once the model compute becomes sufficiently fast, the performance bottleneck often shifts to...

FastMCP MCP Server Hub

MCP Server Hub Currently, our different projects are using various MCP servers. To streamline and unify the process, we plan to implement a HUB MCP...

How LLM Tools work

Tools in Large Language Models (LLMs) Tools enable large language models (LLMs) to interact with external systems, APIs, or data sources, extending...

MCP Transports

| Feature | stdio | sse (Server-Sent Events) | streamable-http | |--------------------------|------------------------------------------|--------------...

Text to SQL (Smolagents)

Out: None [Step 1: Duration 146.87 seconds| Input tokens: 2,113 | Output tokens: 923] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2...

RAG-Reranking

Retrieval-Augmented Generation (RAG) is a powerful approach that combines retrieval and generation to produce high-quality responses. However, the...

Local LLM Setup

If you find this in your VSCode, congratulations! You have successfully set up Ollama for code generation and assistance in Visual Studio Code. alt...