ThoughtWeaver: Structured Chain-of-Thought in Markdown for Enhanced Reasoning

Vansh Vazirani·Jan 02, 2026·Reasoning models

Abstract

I present ThoughtWeaver, a fine-tuned language model that produces a structured chain‑of‑thought (CoT) in Markdown. By enforcing a fixed sequence of reasoning steps and deliberately omitting iterative self‑revision loops, ThoughtWeaver can explore many candidate solutions while consuming substantially fewer tokens than conventional reasoning models. The approach preserves the ability to perform deeper planning when required, but it emphasises breadth over depth, making it well‑suited for tasks where diverse ideas are more valuable than exhaustive verification. Experiments on arithmetic, logic puzzles, and open‑ended brainstorming prompts show that ThoughtWeaver exhibits decent accuracy across various tasks while reducing token usage and inference time.

In a preliminary proof‑of‑concept evaluation on a small set of linguistic questions, mathematical and logic puzzles, and open‑ended brainstorming prompts our experiments show that ThoughtWeaver dramatically reduces token usage (and thus inference time) relative to an unstructured CoT baseline for a large portion of tasks, while still exhibiting reasonably high response quality. A full accuracy and diversity assessment on standard arithmetic, logical‑reasoning, and open‑ended brainstorming benchmarks—once the model has undergone reinforcement‑learning fine‑tuning—is left to future work.

Introduction

Training models to generate Chain-of-Thought (CoT) reasoning has become a widely embraced technique for improving the reasoning ability of large language models. By designing models to synthesise intermediate reasoning, CoT typically raises accuracy on mathematical and scientific questions, code generation, and other deterministic tasks. The downside is that most CoT pipelines include unrestricted iterative self-revision: models often repeatedly re-examine their own reasoning multiple times, which inflates token consumption and delay.

In many practical settings, including evaluative brainstorming, design thinking and creative pursuits, the importance of breadth and volume of ideas often outweighs depth and exhaustive verification. When a large fraction of the token budget is spent on self-reflection, the model can only produce a handful of candidate solutions, limiting creativity.

To address this tradeoff I introduce the first version of ThoughtWeaver: ThoughtWeaver-8B-Reasoning-Exp, a fine-tuned LLM that produces reasoning in a predefined, Markdown-based CoT structure. ThoughtWeaver dynamically selects the magnitude of reasoning required for each input and classifies as either low, medium, or high. For low-complexity queries the model produces a short analysis and a handful of ideas; for medium-complexity prompts it reasons in separate stages; and for high-complexity prompts it activates the optional validation stage. By skipping unnecessary self-revision loops, ThoughtWeaver can allocate a larger share of its token budget to generating diverse candidates while still offering a verification step when the task demands higher confidence.

I publish the model weights for ThoughtWeaver-8B-Reasoning-Exp under the Apache 2.0 license .

The development of ThoughtWeaver-8B-Reasoning-Exp builds on extensive work studying language models. Here, I briefly outline certain works that directly contributed to or inspired the creation of this model.

Reasoning with RL

The introduction of Reasoning LLMs, such as OpenAI O1 and DeepSeek R1 has deeply contributed to improvements in reasoning capabilities by scaling test-time compute, allowing models to brainstorm and reflect before generating the final answer. This has led to exceptional increases in model performance across diverse tasks and intelligence benchmarks. The DeepSeek R1 paper achieved tremendous gains in reasoning performance by scaling reinforcement learning without preliminary SFT. Subsequent works often employ SFT to bootstrap the model with a scaffolding upon which RL improves.

Structured Reasoning Paradigms

In addition, certain works explore using an enforced structure for Chain-of-Thought, in the form of graphs, trees, and tables. Employing each structure offers its own advantages and limitations.

Motivating Problem

A large majority of reasoning models which natively produce unstructed Chain-of-Thought reflect on their own reasoning iteratively. Often, in the context of simpler problems, or tasks which do not require deep reasoning into singular aspects, this reasoning is limited and redundant. Such models are often too focused on needlessly verifying obvious parts of their reasoning chain, leading to wasted computational resources and a limited number of thoughts the model can explore.

With ThoughtWeaver, I seek to build a system that is capable of generating a wide range of candidate solutions and engaging in swift computation, without getting bogged down in endless self-critique.

Apart from reasoning in markdown (which encourages clear and interpretable thoughts), ThoughtWeaver's second key innovation lies in its dynamic reasoning depth control, similar to what is seen in models like OpenAI's GPT-OSS series.

Methodology

This section describes the methodology used to develop ThoughtWeaver, a fine-tuned language model that produces structured chain-of-thought (CoT) reasoning in Markdown. The approach consists of two main phases: data collection and model fine-tuning.

Data Collection

To create a comprehensive dataset for fine-tuning, I generated and collected a large corpus of structured reasoning traces using a diverse set of prompts and responses from existing datasets. The conversation records were classified into three categories of reasoning effort: low, medium, or high. The base datasets used include:

microsoft/NextCoderDataset
open-thoughts/OpenThoughts-114k
open-r1/OpenR1-Math-220k
alibaba-pai/OmniThought-0528
SoftAge-AI/multi-turn_dataset
Josephgflowers/Finance-Instruct-500k
NousResearch/Hermes-3-Dataset
QuixiAI/dolphin-r1
glaiveai/reasoning-v1-20m
HuggingFaceTB/smoltalk2 - Split multi_turn_reasoning_if_think
PrimeIntellect/SYNTHETIC-2-SFT-verified
inclusionAI/Ring-lite-sft-data

I utilised several language models to generate the reasoning traces, including:

moonshotai/kimi-k2-instruct-0905
moonshotai/kimi-k2-instruct
openai/gpt-oss-120b
openai/gpt-oss-20b
meta-llama/llama-4-maverick-17b-128e-instruct
meta-llama/llama-4-scout-17b-16e-instruct
qwen/qwen3-235b-a22b-instruct-2507
qwen/qwen3-32b

The resulting dataset consists of approximately 11,000 records, which were used for fine-tuning the model.

Model Fine-tuning

I fine-tuned the Ministral 3 8B Instruct 2512 model using Supervised Fine-Tuning (SFT) with Unsloth on the collected dataset. The following hyperparameters were used:

Number of epochs: 1
Learning rate: $3e-6$
Optimizer: AdamW, 8-bit precision
Learning rate scheduler: Cosine annealing
Packing enabled
Batch size: 8
Maximum sequence length: 32,768

The fine-tuned model is capable of producing structured CoT reasoning in Markdown, with the ability to dynamically select the magnitude of reasoning required for each input.

Results

To sample responses from the model, it was converted to GGUF using llama.cpp and hosted using llama-server. During testing, the model was prompted with several diverse queries and the responses were manually evaluated. Then, it was benchmarked against a corpus of mathematical questions.

Samples

I present three illustrative examples that exemplify ThoughtWeaver's reasoning style in distinct scenarios. These were sampled from the GGUF model.

Query: How many of the first one hundred positive integers are divisible by $3, 4,$ and $5?$

<think>

START OF THOUGHT PROCESS

This query requires low reasoning

Goal: Count numbers ≤ 100 divisible by 3, 4, and 5.
Key insight: A number divisible by all three must be divisible by their LCM.
Compute LCM(3,4,5):
- Prime factors: 3 = 3, 4 = 2², 5 = 5.
- LCM = 2² × 3 × 5 = 4 × 3 × 5 = 60.
Find multiples of 60 ≤ 100:
- 60 × 1 = 60, 60 × 2 = 120 (exceeds 100).
- Only 60 qualifies.
Conclusion: Exactly one number meets the criteria.

END OF THOUGHT PROCESS

</think>

To determine how many of the first one hundred positive integers are divisible by 3, 4, and 5, we need to find the numbers that are divisible by all three of these numbers simultaneously.

Step 1: Understanding Divisibility by Multiple Numbers

A number that is divisible by 3, 4, and 5 must be divisible by the least common multiple (LCM) of these three numbers. This is because the LCM is the smallest number that is a multiple of each of the given numbers.

Step 2: Calculating the LCM of 3, 4, and 5

Let's find the LCM of 3, 4, and 5.

Prime factorization:
- 3 = 3
- 4 = 2²
- 5 = 5

The LCM is the product of the highest powers of all primes present in the factorizations:

LCM = 2² × 3 × 5 = 4 × 3 × 5 = 60

So, the LCM of 3, 4, and 5 is 60.

Step 3: Finding Multiples of 60 Within the First 100 Positive Integers

Now, we need to find how many multiples of 60 are there in the range from 1 to 100.

60 × 1 = 60
60 × 2 = 120 (which is greater than 100)

So, the only multiple of 60 within the first 100 positive integers is 60.

Step 4: Conclusion

There is exactly one number (60) in the first 100 positive integers that is divisible by 3, 4, and 5.

Final Answer

\boxed{1}

In this example, ThoughtWeaver immediately identifies the LCM technique and provides a direct answer with minimal reasoning steps, while refraining from verifying obvious calculations.

Query: If humans could perceive additional dimensions, how would art and architecture evolve?

<think>

START OF THOUGHT PROCESS

This query requires high reasoning

Deconstructing the Prompt

The prompt asks how art and architecture would evolve if humans could perceive additional dimensions (beyond the 3D space we currently experience).
Key constraints to consider:
- Must focus on art and architecture specifically, not other creative fields.
- "Perceive additional dimensions" implies we'd sense dimensions beyond length, width, height (e.g., time, energy, spatial curvature).
- Evolution implies gradual adaptation rather than sudden revolution.
Core objectives:
- Explore conceptual shifts in artistic expression.
- Analyze structural and experiential changes in architecture.
- Consider societal/cultural implications of dimensional perception.
Critical nuance: The prompt doesn't specify which dimensions (e.g., 4D time, 5D energy, 6D spatial curvature). I'll assume a general "higher-dimensional" framework since specifics aren't provided.

Fleshing Out Brainstorm

Art Evolution Concepts

Initial idea: If we perceive time as a dimension, art might become "temporal sculptures" where viewers experience evolution over time.
- Why this works: Time is already a dimension in physics; making it tangible in art would feel natural.
- Potential flaw: Might be too literal. Need to explore metaphorical uses too.
Alternative angle: Energy dimensions could enable "vibrational art" where colors/paintings emit subtle energy fields.
- Why this works: Energy is abstract but perceptible in some forms (e.g., heat, light).
- Potential flaw: Requires physical technology to make energy visible—might not be purely perceptual.
Third option: Spatial curvature (e.g., 4D hyperspace) could allow "non-Euclidean canvases" where shapes bend in ways we can't currently visualize.
- Why this works: Challenges traditional 2D/3D constraints.
- Potential flaw: Hard to imagine without concrete examples.
Refining ideas:
- Combine temporal and spatial concepts: "Living murals" that change based on viewer movement and time.
- Consider sensory integration: Art that triggers multiple senses (e.g., sound + visual + energy) in higher dimensions.
- Decision: Prioritize temporal and spatial curvature ideas—they feel most transformative for art.

Architecture Evolution Concepts

Initial idea: Buildings could have "time layers" where different eras coexist in the same space.
- Why this works: Solves the "time travel" paradox by making it architectural.
- Potential flaw: Might feel gimmicky if not grounded in real physics.
Alternative angle: Energy dimensions could enable "self-regulating structures" that adjust to environmental energy flows.
- Why this works: Aligns with sustainable design trends.
- Potential flaw: Requires advanced materials—might not be purely perceptual.
Third option: Spatial curvature could allow "infinite corridors" where paths loop in higher dimensions.
- Why this works: Creates psychological depth beyond physical space.
- Potential flaw: Hard to visualize without math.
Refining ideas:
- Focus on experiential architecture: Spaces that make viewers feel higher dimensions (e.g., through sound, light, or motion).
- Consider "dimensional thresholds": Entrances that trigger perception of extra dimensions (e.g., a doorway that warps space).
- Decision: Prioritize experiential and threshold concepts—they feel most innovative for architecture.

Societal/Cultural Implications

Initial idea: Art and architecture would become more collaborative, requiring teams with different dimensional perceptions.
- Why this works: Mirrors how modern design teams work.
- Potential flaw: Might overcomplicate the prompt.
Alternative angle: New art forms would emerge from cultural exchanges between "dimensional communities."
- Why this works: Explores diversity in perception.
- Potential flaw: Speculative—hard to ground in reality.
Third option: Architecture would prioritize "dimensional accessibility" (e.g., buildings that adapt to different perception styles).
- Why this works: Aligns with inclusivity trends.
- Potential flaw: Requires societal consensus on "standard" dimensions.
Refining ideas:
- Focus on how perception changes individual creativity first, then societal norms.
- Consider "dimensional art movements" as cultural phenomena (e.g., "Temporalism" vs. "Energeticism").
- Decision: Prioritize individual creativity and cultural movements—they feel most relevant to the prompt.

Pruning ideas:
- Discard overly speculative concepts (e.g., "dimensional communities") since the prompt focuses on art and architecture, not society.
- Merge "temporal sculptures" and "living murals" into a single "dynamic art" category—both share the idea of time as a dimension.
Merging concepts:
- Combine "spatial curvature" and "infinite corridors" into "non-Euclidean architecture" to avoid redundancy.
- Link "energy dimensions" to "sustainable architecture" since energy perception could enable eco-friendly designs.
Prioritizing core themes:
- Art: Focus on time and spatial curvature as the most transformative dimensions for expression.
- Architecture: Prioritize experiential design and dimensional thresholds as the most innovative structural concepts.
- Societal impact: Emphasize how individual perception shapes cultural movements rather than broad societal changes.
Why this works:
- Keeps the response focused on the prompt's core (art/architecture evolution).
- Avoids overcomplicating with physics details (e.g., 4D math) that aren't necessary for creative exploration.
- Balances concrete examples (e.g., "living murals") with abstract ideas (e.g., "non-Euclidean architecture").

Final Review and Validation

Check for consistency:
- All ideas align with the prompt's focus on perception of additional dimensions, not just use of dimensions.
- Art and architecture concepts are distinct but interconnected (e.g., dynamic art could inspire experiential architecture).
Validate against constraints:
- Addresses evolution (gradual adaptation) rather than sudden revolution.
- Avoids over-speculation by grounding ideas in real-world trends (e.g., sustainability, inclusivity).
Correct any errors:
- Initially considered "energy dimensions" as purely physical, but refined to include perceptual energy (e.g., colors emitting energy fields).
- Avoided conflating "time" and "energy" dimensions—kept them separate in the final structure.
Ensure depth:
- Explored how perception changes art/architecture (e.g., "living murals" vs. static paintings).
- Considered why these changes would occur (e.g., societal shifts in creativity).
Final validation:
- The refined ideas are cohesive, avoid redundancy, and directly answer the prompt.
- No speculative concepts remain—all are grounded in plausible extensions of current trends.

END OF THOUGHT PROCESS

</think>

If humans could perceive additional dimensions, art and architecture would undergo profound transformations, reshaping how we experience and create these fields. Here’s how they might evolve:

Art Evolution

Dynamic and Interactive Artworks
- Art would become more interactive, responding to the viewer’s movement and time. For example, a painting might change colors or patterns as the viewer walks around it, creating a living, evolving piece of art.
Temporal Art
- Artists could create works that unfold over time, allowing viewers to experience the evolution of a piece. This could include time-based installations, where the art changes subtly or dramatically as time progresses.
Non-Euclidean Art
- With the ability to perceive higher dimensions, artists might explore non-Euclidean geometries, creating art that bends and curves in ways that challenge traditional 2D and 3D representations. This could lead to new forms of abstract art that incorporate spatial curvature.
Energy-Based Art
- Artworks could incorporate energy dimensions, allowing artists to create pieces that emit or absorb energy. For example, a painting might emit a subtle energy field that interacts with the viewer’s perception, creating a unique sensory experience.
Sensory Integration
- Art would likely integrate multiple senses, such as sound, touch, and even energy, to create immersive experiences. This could lead to new forms of multimedia art that transcend traditional visual boundaries.

Architecture Evolution

Experiential Architecture
- Buildings would be designed to create immersive experiences that make viewers feel as if they are moving through higher dimensions. This could include spaces that manipulate light, sound, and even energy to evoke a sense of depth and dimension.
Non-Euclidean Structures
- Architects might design buildings that incorporate non-Euclidean geometries, such as curved or warped structures that challenge traditional spatial constraints. This could lead to new forms of architecture that are both visually striking and functionally innovative.
Temporal Architecture
- Buildings could be designed to exist in multiple time layers, allowing different eras or functions to coexist within the same space. For example, a museum might have sections that transport visitors to different historical periods.
Energy-Efficient and Sustainable Designs
- With the ability to perceive energy dimensions, architects could design buildings that are highly energy-efficient and sustainable. This might include structures that absorb and redistribute energy, reducing the need for traditional power sources.
Dimensional Thresholds
- Buildings might feature entrances or thresholds that trigger the perception of higher dimensions, creating a sense of transition or transformation as visitors enter. This could lead to new forms of architectural storytelling and experience.

Societal and Cultural Impact

New Art Movements
- The ability to perceive additional dimensions would likely give rise to new art movements, such as "Temporalism" or "Energeticism," which would explore the creative possibilities of higher dimensions.
Collaborative Creativity
- Art and architecture would become more collaborative, with teams of artists and architects working together to create pieces that integrate multiple dimensions. This could lead to new forms of creative expression that are both individual and collective.
Cultural Exchange and Diversity
- Different cultures might develop unique ways of perceiving and interpreting higher dimensions, leading to a rich tapestry of artistic and architectural styles. This could foster greater cultural exchange and diversity in creative expression.
New Forms of Storytelling
- Art and architecture could become powerful tools for storytelling, allowing creators to convey complex narratives that span multiple dimensions. This could lead to new forms of visual and spatial storytelling that are both engaging and thought-provoking.

In summary, the evolution of art and architecture in a world where humans can perceive additional dimensions would likely involve a shift toward dynamic, interactive, and experiential forms of expression. This would not only change how we create and experience art and architecture but also how we perceive and interact with the world around us.

Here, ThoughtWeaver swiftly generates several diverse ideas, presenting them as separate sections within its response. It efficiently brainstorms numerous candidate ideas, then proceeds to prune less relevant points and weigh the rest.

These responses reveal that the model:

Demonstrates a clear prioritisation of practical utility over exhaustive verification. When relevant, it chooses to generate a quick plan before diving into deep reasoning.
Structures its reasoning traces in a way that facilitates decomposition and independent evaluation of each component. This contrasts with more monolithic CoT models where the entire chain of thought flows together seamlessly.
Attempts to identify possible flaws quickly without getting lost in excessive detail—a characteristic that could be both an advantage (for increased speed) and a limitation (due to the possibility of overlooking subtle errors).

Benchmarks

The following results were collected using the LM Evaluation Harness framework with the SGLang server using a repetition penalty of $1.2$ .

Tasks	n-shot	Metric	Value ± Stderr	Total Tokens	Avg Tokens/Ques
aime24	0	exact_match	0.1333 ± 0.0631	130,522	4,350.7
aime25	0	exact_match	0.1667 ± 0.0692	89,305	2,976.8

Table 1: AIME Evaluation Results

This version of ThoughtWeaver performs considerably worse on the AIME benchmarks compared to traditional reasoning models such as Qwen3 8B and even Ministral's own reasoning models. This is because the model has not yet undergone reinforcement learning.

In comparison to other reasoning models, though, it consumes far fewer total tokens. Artificial Analysis, which hosts benchmark details for several language models, shows that Ministral 3 8B Instruct spent a total of approximately 180,000 tokens on the AIME 2025 questions, and Qwen3 8B Reasoning spent a massive 540,000 tokens on reasoning alone. In contrast, ThoughtWeaver generated only 90,000 tokens across all questions, even though it selected "high reasoning" effort for each one.

Limitations

While this model introduces a novel approach to Chain-of-Thought architectures, several critical limitations must be acknowledged to provide context for its current performance and appropriate use cases.

Because ThoughtWeaver has not undergone further safety training, it is not guaranteed to reject offensive or dangerous queries when prompt engineered by skilled red teamers.
This version of the model is strictly unimodal (chat-only). It lacks numerous features found in contemporary frontier models, including multimodal inputs and agentic tool usage.
It is found that certain models often drastically forget previously learned concepts, a phenomenon called Catastrophic Forgetting. It is highly plausible that ThoughtWeaver may perform worse than its base model, Ministral, on certain tasks which it may have "forgotten" during fine-tuning.

Future Work: Reinforcement Learning

The resultant model from SFT can be fine-tuned with reinforcement learning using a verifiable rewards-based algorithm, such as Group Relative Policy Optimization (GRPO), Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), or Group Sequence Policy Optimization (GSPO). This step is crucial for further improving the model's performance and is left for future work. The reinforcement learning process will enable the model to learn from verifiable rewards, leading to enhanced reasoning accuracy.

To further improve ThoughtWeaver's usability in certain contexts, it can be reintegrated with the vision encoder from the Ministral base model and trained on extensive vision data. Furthermore, future versions of ThoughtWeaver can be trained on agentic tool usage with integrated reasoning.

Conclusion

ThoughtWeaver represents an initial step towards a more efficient and diverse approach to Chain-of-Thought reasoning in large language models. By deliberately limiting iterative self-revision loops and enforcing a structured Markdown CoT format, ThoughtWeaver demonstrates the potential to significantly reduce token consumption while still producing valuable candidate solutions—particularly beneficial for tasks which prioritise breadth over exhaustive verification. While current benchmarks reveal limitations in accuracy compared to more extensively trained models like Qwen3 8B and Ministral 3 8B, ThoughtWeaver's reduced resource usage offers a compelling tradeoff for applications where efficiency is paramount. Future work will focus on rigorously training and evaluating ThoughtWeaver across standard benchmarks with RL, integrating multimodal inputs, and exploring agentic tool usage. ThoughtWeaver's core contribution lies in a novel paradigm that prioritises diverse reasoning, paving the way for more adaptable and insightful reasoning capabilities.

For citations and references, view the paper below:

Feel free to download and use this model.
I'd love to hear your feedback.
You can contact me .

Thanks,
Vansh

ThoughtWeaver: Structured Chain-of-Thought in Markdown for Enhanced Reasoning

Abstract

Introduction

Reasoning with RL

Structured Reasoning Paradigms

Motivating Problem

Methodology

Data Collection

Model Fine-tuning

Results

Samples

START OF THOUGHT PROCESS

END OF THOUGHT PROCESS

Step 1: Understanding Divisibility by Multiple Numbers

Step 2: Calculating the LCM of 3, 4, and 5

Step 3: Finding Multiples of 60 Within the First 100 Positive Integers

Step 4: Conclusion

Final Answer

START OF THOUGHT PROCESS

Deconstructing the Prompt

Fleshing Out Brainstorm

Art Evolution Concepts

Architecture Evolution Concepts

Societal/Cultural Implications

Refinement

Final Review and Validation

END OF THOUGHT PROCESS

Art Evolution

Architecture Evolution

Societal and Cultural Impact

Benchmarks

Limitations

Future Work: Reinforcement Learning

Conclusion

Related Articles

ThoughtWeaver: Structured Chain-of-Thought in Markdown for Enhanced Reasoning

Abstract

Introduction

Related Work

Reasoning with RL

Structured Reasoning Paradigms

Motivating Problem

Methodology

Data Collection

Model Fine-tuning

Results

Samples

START OF THOUGHT PROCESS

END OF THOUGHT PROCESS

Step 1: Understanding Divisibility by Multiple Numbers

Step 2: Calculating the LCM of 3, 4, and 5

Step 3: Finding Multiples of 60 Within the First 100 Positive Integers

Step 4: Conclusion

Final Answer

START OF THOUGHT PROCESS

Deconstructing the Prompt

Fleshing Out Brainstorm

Art Evolution Concepts

Architecture Evolution Concepts

Societal/Cultural Implications

Refinement

Final Review and Validation

END OF THOUGHT PROCESS

Art Evolution

Architecture Evolution

Societal and Cultural Impact

Benchmarks

Limitations

Future Work: Reinforcement Learning

Conclusion

Related Articles