GLM-5.2 Demonstrated via OpenAI-Compatible API: A Practical Guide to Long-Context Reasoning and Tool Integration
The tutorial opens by defining a dictionary of provider options. Each entry lists a base URL, the model name, and the environment variable that holds the key. The options include Z.ai’s own endpoint, OpenRouter, Together, Requesty, and Hugging Face. The code then pulls the key from the chosen variable or prompts the user, creates an OpenAI client with the selected base URL, and sets up token‑cost tracking variables. A chat function wraps the OpenAI chat.completions.create call, accepting parameters that control GLM‑5.2’s unique features: a thinking flag to enable or disable internal reasoning, a reasoning_effort setting that can be “high” or “max”, and a tool_stream flag that lets the model stream the results of function calls.
The first demo, demo_basic, performs a sanity check by asking the model to describe what it does best. demo_effort then compares the same arithmetic problem under three conditions—thinking off, high effort, and max effort—showing how latency, token usage, and a hidden reasoning trace (extracted with get_reasoning) differ. demo_streaming streams a response that contains both a reasoning channel and an answer channel, letting users watch the model’s internal logic unfold.
Tool calling is highlighted in demo_tools and demo_agent. Two simple tools—a calculator that evaluates basic arithmetic expressions and a city‑population lookup that returns a metro population—are defined and registered in an OpenAI‑spec function‑calling schema. The run_tool_loop function sends a prompt to the model, captures any tool calls, executes the corresponding Python function, and feeds the result back to the model. demo_tools asks the model to compute the ratio of Tokyo’s population to Mexico City’s, while demo_agent performs a multi‑step task that ranks three cities by population and then sums the top two.
The notebook also covers structured JSON output. A helper function attempts to parse the model’s response as JSON and retries once if the first attempt fails. To test long‑context retrieval, a synthetic document containing a hidden “needle” is fed to the model, demonstrating its ability to pull exact text from a 1‑million‑token context window.
Cost accounting is handled by the cost_summary function. As the demos run, the notebook tracks input and output token counts and calculates an estimated spend using the pricing of the chosen provider. For example, on OpenRouter the cost is $1.40 per million input tokens and $4.40 per million output tokens. The summary prints the total number of calls, token counts, and the estimated dollar amount.
In closing, the tutorial recaps the workflow and suggests that the same pattern can be extended to build research assistants, document‑analysis tools, coding agents, or any application that demands long‑horizon reasoning. The code lives on GitHub, and readers are encouraged to experiment with different providers, effort settings, and maximum token limits.
GLM‑5.2 is an open‑weight model that offers a 1‑million‑token context window and a maximum output of 32,768 tokens. Marketed as a reasoning model capable of tackling long‑horizon tasks—such as building compilers or optimizing kernels—it is available through multiple third‑party providers, each with slightly different pricing. The tutorial demonstrates that, with only a few lines of Python, developers can integrate GLM‑5.2 into existing workflows, leverage its reasoning and tool‑calling features, and monitor usage and cost in real time.