LLMs as Arbitrary Forecasters
The following pre-amble was written with AI with minor edits. The rest of the blog is human written. I will be exploring something I know well but may need to write vaguely about - LLM in time series. I’ve built a niche product (failed) which served billions of tokens over API. It failed because the value proposition was not viable. Scaling and market positioning dominate. I share essentially everything you need for arbitrary forecasting directly then I share a non-linear forecasting harness and some thoughts.
Preamble (LLMs and TS in literature)
Time-series forecasting is a domain where numerical history rarely contains the full set of constraints, regimes, calendar effects, operational scenarios, and causal structure that practitioners routinely use to produce reliable forecasts. Two recent studies make this gap explicit by treating a large language model as the primary forecasting substrate and asking what happens when we express both history and context directly in the prompt. (arXiv)
LLM Processes (Requeima et al., NeurIPS 2024) takes the most literal stance: an LLM can be elicited into behaving like a probabilistic regressor. Formally, given observed pairs (D_{\text{train}}={(x_i,y_i)}_{i=1}^M) and optional natural-language text (T) encoding prior knowledge, they aim to extract the model’s predictive distribution over one or many targets (y^) at query inputs (x^). Practically, this means serializing numeric pairs into text and using the LLM’s own token distribution to (a) sample numeric continuations (rejecting non-numeric outputs) and (b) compute continuous(-ish) likelihoods by assigning probability mass to numeric string “bins” and masking out non-numeric tokens. The paper emphasizes that things we often dismiss as “prompt cosmetics”—how you delimit pairs, how you order training points, and how you sample—materially change the implied inductive bias and forecast quality.
For time series specifically, LLM Processes frames multi-shot prompting as a nonparametric learning mechanism: you can feed the LLM additional historical exemplars “in context” and observe performance improvements consistent with in-context adaptation. In their precipitation experiment, the model forecasts a future window given a recent short history, and then receives 1–12 additional historical multi-year examples from the same location; adding examples improves performance up to a saturation point (and can degrade slightly beyond it), suggesting that multi-shot ICL can help the model lock onto repeating structure like seasonality.
If LLM Processes asks “what distribution does an LLM imply when you serialize the dataset?”, Context is Key (CiK) (Williams et al., arXiv:2410.18959; ICML 2025) asks the complementary question: “when is text context necessary for forecasting?” CiK is built around context-aided forecasting (P(X_F\mid X_H, C)) where (C) is natural-language context that is deliberately constructed to be essential—tasks are designed so the numerical history alone is insufficient, and accurate prediction requires integrating both modalities. The benchmark comprises 71 manually designed forecasting tasks spanning multiple real-world domains, with context types that encode constraints, scenarios, causal relationships, and other nontrivial information that “unlocks” good forecasts. (arXiv)
Methodologically, CiK is also a useful reality check on what “using an LLM for forecasting” actually entails in practice. Their Direct Prompt baseline makes the interface explicit: a rigid template with a <context> block plus a <history> block rendered as (timestamp, value) pairs, and a requested <forecast> block for specific future timestamps. They then use constrained decoding (regex enforcement via lm-format-enforcer) to keep generations on-format, highlighting a core tension: the best prompting strategies often depend as much on generation control and structure adherence as on “forecasting intelligence.” (Note from Matt: I find constrained decoding to not be necessary in practice. Good ICL is sufficient.) They also contrast Direct Prompt against an LLMP-style approach that is closer to pure autoregressive next-value continuation. (arXiv)
Together, these two studies bracket the emerging design space for “ICL + time series”:
- LLM Processes shows how far you can go by treating the prompt as an in-context dataset and reading the LLM as an implicit stochastic process—where prompt construction becomes a form of model specification.
- CiK shows that for many realistic forecasting problems the real signal is not in (X_H) alone, and that rigorous evaluation requires tasks where text context is not decorative but identifiability-critical. (arXiv)
The rest of the post will use these two works as anchors to dig into in-context learning for forecasting: how multi-shot exemplars behave like a data-dependent prior, how numeric serialization and delimiter choices act like “tokenizer-level feature engineering,” and how adding natural-language context changes the effective conditional (P(X_F\mid X_H,C))—sometimes dramatically improving forecasts, sometimes introducing brittle failure modes when the model misreads or over-trusts the text.
Normalized ICL
The general idea for time-series forecasting with ICL is to provide examples and then ask the LLM to continue the generation. I have found that base models are especially adept at this. You don’t even need any words per se. Provide upwards of 5 examples and then allow the model to predict the 6th output. ICL here is powerful but it is minimally divergent from the above experiments. This was an obvious path in 2023.
One aspect overlooked is that normalizing the input can be helpful. If you normalize your input well then you get the added benefit of domain overlap. Now, inter-dataset time series will lie on the same scale. You get a much stronger ICL selection process. The scaling is not trivial. If you leave no buffer then the model can have clipped responses. There are other tidbits to address for optimal forecasting but these are relatively easy to find and address.
The natural next question revolves around deciding how to select the ICL examples. Even better, the above studies use language as conditioning which makes the selection process more difficult.
Selecting ICL
The boring answer is that you can use embedding models. The more specific answer is you can combine time-series techniques with embedding for a better selection process. For instance cosine distance or various correlation techniques (dynamic time warping) work well. I’ve tested many methods and cosine similarity tends to win the effectiveness-time tradeoff.
The point of the embedding is to use the language component. However, I’ve also found that embeddings work well out of the box with time series data. For instance, if you use Jegadeesh–Titman allocation strategies and compare with a static embedding model selector with a linear head on stocks then the static model can perform better in naive backtests. Ideally you will need a position encoding schema and a few other quick edits but the results are impressive. This does not mean it’s a valid stock picking strategy! It will fail due to a number of very obvious reasons in practice. The point is the adaptability of language pretraining.
I’ve been successful in both pure embedding and numerical approaches for LLM forecasting ICL selection. The ideal choice depends on your end goal and the data specifically. If you are staying pure numerical forecasting without linguistic context then numerical methods are easier and very effective. If you have the context and need to keep the time series with the context then the embedding will be better. Embedding approaches are even better when incorporated with RL retrieval harnesses.
Pitfalls
There are 3 major pitfalls with this approach: timescale, noise and locality of LLMs. Timescale is why you will never have an LLM as a forecaster employed with quant firms. It’s too slow. Locality is a challenge beyond timeseries. It is an artifact of the training objective of next token prediction with relations to looping. Noise is always interesting in and of itself.
If your timeseries task can be modelled linearly then you should not be using an LLM. The most public case of this is high frequency trading (HFT). When you are working with HFT systems you are assuming linear approximations because they work on this timescale. It’s similar to the small angle theorem where the sine of an angle can be assumed to be equal to the angle at small angles. The actual sine component does not matter at that scale. On HFT timescales, you are making linear approximations of non-linear systems because the difference is negligible at those time scales. LLMs are both too large and over powered to operate in this scale. Under small timescales and time constraints, it makes much more sense to use a specialized, small model.
Noise is also another constraint. If you try to forecast on very noisy data using an LLM and ICL then it will pick up on the noise in the ICL and produce poor predictions. There is a noise threshold where the system is viable. However, the largest models are incredible noise generators in noisy regimes. Limited in domain training exacerbates this problem where models will fail to use any signal in the ICL examples. With multiple samples and filtering, there are few approaches which produce higher quality in-domain noise. There is a market here for noise but for forecasting, you tend to prefer the signal.
The final pitfall is the locality. As hinted, this is not just related to timeseries applications of LLMs. Looping happens because of a mismatch between training and inference which results in the model determining the optimal local consistency is to loop. The model remains locally coherent but globally the looping is against our goals. In time series predictions, this is primarily an issue with smaller models. As you try to perform inference, the model will output the same value for each new time value (which is often multiple tokens). Larger models suffer from this less because they are more expressive. The expressivity is correlated with but not a guarantee of higher global coherence.
Looping in timeseries is funny because if you don’t plot the exact numbers and only look at eval summaries then you might think a looped model is doing very well. Instead of predicting wildly wrong, it may just predict a straight line average of what it had seen/predicts (since the loops are multiple tokens making up a single prediction point). I point this out here as a warning to always look at your plots and individual values. The studies in our preamble chose the evaluation criteria exactly because of the looping issue.
The first two pitfalls are hard constraints on where these models can be deployed. However, local vs global coherence can be addressed with training. One method, adding a fourier loss term to the next token prediction. When fine-tuned under these conditions, the models perform much better at global coherence because frequencies are global and per token output can be evaluated on whole sequence frequencies. The other solution is to use RL which again has a tendency towards global coherence when implemented well.
Comparison with SoTA
I am unfortunately not at liberty to publish specific numbers from my own experience. That is alright! This paper (from our preamble) does a phenomenal job and hits on many of the important aspects: https://arxiv.org/abs/2410.18959. This is also why I still believe that research/OSS is lagging industry by 12 months. They had about a year lag from my own experiences.
The “with” context is not as interesting to me. The models effectively use context to improve their time series forecasting. Which implies you already have more capable models. However, it’s not convincing on its own to use LLMs directly. For instance, one example the study gives is if an ATM is going to be closed then it should have 0 customers on that day. The LLM is given that close information which is significant lookahead. If you are giving that lookahead then you should also auto-zero the dates for the numerical only models.
This raises a major contention for me. If you can’t beat SoTA numerical methods then the LLM should not be doing the forecasting and instead should be abstracted one step. You can incorporate your non-numerical context and then have the LLM oversee your bespoke timeseries model. This idea is why I believe Deepseek wants a powerful reasoning LLM. You can have an LLM allocated to your HFT system and adjust in a human-like manner without incurring the costs of LLMs applied directly to HFT.
This chart gets more interesting! I just wish they had used Llama-3.1-405B (base). That model was incredible for time series modelling. Under the assumption there is not a noise limit at \~0.3, it would have beaten out everything on this chart. In general, you see a common theme with LLMs on this chart. The TS foundation models don’t beat out “just make it bigger”. Scale wins once again even when the LLMs aren’t trained for time series!
One trend you don’t see on this chart (because it’s a model snapshot in time) is the benefit that code as training data had on LLMs and forecasting. The more numerical data they were trained on, the better they would get at time series tasks. For the above tests, the LLMP is from our other preamble study. It uses ICL with numerical values only. I like appendix H as an example of how ICL helps:
You can see the failure modes rather clearly! As you increase the number of examples then you drastically decrease the mean average error and increase adherence to expected outcomes. The LLM picks up on seasonal trends and stops “looping”. The blue is the given values, the black are the true values and the red is the predicted values. You can read more about it here: https://proceedings.neurips.cc/paper_files/paper/2024/file/c5ec22711f3a4a2f4a0a8ffd92167190-Paper-Conference.pdf.
Non-Linear Forecasting
Forecasting for time-series is interesting but it has limited use. If you have the data then it is almost always better to train a small, bespoke time series model. However, LLMs can be used for back of the envelope predictions or even for synthetic data generation. This is an under utilized unlock for the forecasting subset of machine learning. However, I personally believe the LLMs are better at non-linear forecasting and as abstracted layers on forecasting in general. To show this, I have published a new github repo with a harness for stock evaluation: https://github.com/Matthew-agi/AI-Streetview-Financial-/tree/main. In the words of an anonymous Jeff, it is better than a recent MBA grad but not comparable to a professional analyst. However, it points at a tangible future.
Example Output: https://uror.io/posts/aapl-report-20260205-moon-hotai-kimi-k2-in-truct
I’ve already mentioned I believe that DeepSeek wants a quant enabled LLM for HFT and similar. The LLM will be abstracted from the time series aspect. I also believe that reasoning models will be able to provide equivalent human level forecasting on stocks at non-linear timeframes (years). Despite the enormous data contamination issue, this is a great problem for reinforcement learning. It is also interesting as a test harness for recurrent language models. The harness I released is old! I wrote it nearly a year ago at this point. I do plan to implement more current research initiatives and add an RL env. I have no target date on these releases.
Anyway, this works because LLMs have enhanced reasoning and understand numerical values. When looking at something non-linear and difficult to predict like the stock market, you essentially look at comparative values within industry and across industries. The idea of this harness is to do a lot of the leg work of an analyst and give a “street” view of the underlying company. I don’t expect this harness to replicate Buffet out of the box (although it did pick out Google at nearly the bottom last summer). It has biases and needs to box in the underlying model to work. We specify explicit sections.
This harness is compelling but has obvious limitations. The results are decently high quality but the content and context length challenge many models. Often you will find nonsensical outputs. The nonsense is from reasoning about numerical values or in manners which seem obviously incorrect to industry personnel. The “taste” of models is exposed. They sound smart. The actual decision is often poor quality because it overlooked a recent output or lacks experience in domain. Taste in this area has minimally improved with OSS LLMs over the past year.
Despite all of this, there is something I find captivating here. With more industry specific training I believe you will get a very competent model. I don’t think you get a very valuable model. In speaking with people in the industry, a “street” (Wallstreet) level view is useful initially. However, people make money from having a unique perspective. This model can quickly dole out company and industry basics. However, until taste is solved, you will quickly saturate the value. Even if taste is solved then you are limited by input data. Public data and real-world gated perspectives on that data saturate as well.
The model is as good as the data you provide in context. This idea is not new. With proper training, a financial LLM will eat this harness and become super-human at using financial data. Deployed to entities which seed new information in the financial network then you have something very valuable. For me, it’s an excellent long context play pen for LLMs and soon to be an excellent RL environment. I’m releasing this because I am not in a position to deploy this harness to maximize value.
Closing, this project influenced my belief that AI will continue the trend of value accruing at the edge of networks. It also influenced many of my views on ICL. ICL is incredible when used well. You also often can see when people started using ICL. Those who are big fans tend to have a broader experience with base models. The fact that past generation LLMs could forecast is crazy to me and it clearly scales with LLM size. Base models have unexplored general computation capacities which is evidenced by forecasting capacity. Despite this, products with LLMs are self-defeating if limited to in-network information. They need to be deployed where the value can be captured. This has informed my belief that all information in-network will be commodified.