More

Tarrosion · 2026-01-14T21:30:26 1768426226

sparsely updated blog: https://evanfields.net

Tarrosion · 2025-12-27T12:28:42 1766838522

> Is multi-agent collaboration actually useful or am I just solving my own niche problem?

I often write with Claude, and at work we have Gemini code reviews on GitHub; definitely these two catch different things. I'd be excited to have them working together in parallel in a nice interface.

If our ops team gives this a thumbs-up security wise I'll be excited to try it out when back at work.

bahaAbunojaim · 2025-12-27T13:12:54 1766841174

Would love to hear your feedback! Please let me know if I can make it any better or if there is anything that would make it very useful

Tarrosion · 2025-05-25T20:13:12 1748203992

I'm finding 4 Opus good, but 4 Sonnet a bit underwhelming: https://evanfields.net/Claude-4/

Tarrosion · 2025-02-20T20:55:00 1740084900

I can confirm this has been true since at least v0.4!

Tarrosion · on Jan 9, 2025

I'm sorry for your loss, and I hope that helping others through this project helps you find some solace. IMHO, it's a mark of character that your response to having a problem is "I want to help other people so they suffer this problem less than I did."

Tarrosion · on Dec 21, 2024

This makes sense in the context of trying to maximize log wealth (or I think any concave function of wealth, though the arithmetic is different). But in one of OP's other articles [1], he says the Kelly criterion doesn't require trying to maximize log-wealth, that this is just a common misconception -- all that's required is maximizing something growing geometrically over time.

This I don't understand, maybe someone help me out? Say the real growth rate of capital (or interest rate available to me, whatever) is 2%/year and I have a 10 year time horizon. So $1.00 today is ~$1.22 in 10 years. More generally, if I have wealth X today I will have 1.22X in 10 years. And if X is not a constant but a random variable and I want to maximize future expected wealth (not log wealth), that's just max(E[1.22X]) and by linearity of expectation I should just maximize wealth today to maximize in 10 years time.

So Kelly being appropriate must have some other conditions, right? Wanting to maximize log wealth is surely sufficient (and individually probably ~rational). What else?

[1]

Tarrosion · on Dec 21, 2024

Whoops forgot the link https://entropicthoughts.com/the-misunderstood-kelly-criteri...

Tarrosion · on Dec 11, 2024

OP, you have a comment there about mouse support in the Keyboardio. I've been using a Keyboardio since 2019 and haven't much tried the mouse support -- any advice? How did you set it up?

fipar · on Dec 12, 2024

Not OP, but I have a Model 100 and I just assigned mouse keys in Chrysallis. It used to be the case that you had to build your firmware to change the speed, but now that's possible through Chrysallis too.

I don't use them very often because I don't mind using a trackpad, but it is useful sometimes.

Tarrosion · on Aug 30, 2024

I'm curious what kind of slow IO is a pain point for you -- I was surprised to read this comment because I normally think of Julia IO being pretty fast. I don't doubt there are cases where the Julia experience is slower than in other languages, I'm just curious what you're encountering since my experience is the opposite.

Tiny example (which blends Julia-the-language and Julia-the-ecosystem, for better and worse): I just timed reading the most recent CSV I generated in real life, a relatively small 14k rows x 19 columns. 10ms in Julia+CSV+DataFrames, 37ms in Python+Pandas...ie much faster in Julia but also not a pain point either way.

barbarr · on Aug 30, 2024

My use case was a program involving many calls to an external program that generated an XYZ file format to read in (computational chemistry). It's likely I was doing something wrong or inefficient, but I remember the whole process was rate-limited by this step in a way that Python wasn't.

ChrisRackauckas · on Aug 31, 2024

IO is thread-safe by default, but that does slow it down. There's a keyword argument to turn that off (if you know you're running it single threaded) and right now it's a rather large overhead. It needs some GC work IIRC to reduce that overhead.

Tarrosion · on May 13, 2024

How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?

elcomet · on May 24, 2024

They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.

This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.

heavenlyblue · on May 13, 2024

They don't do global optimisation of all layers at the same time, instead training all layers independently of each other.

mk67 · on May 13, 2024

I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.

Tarrosion · on May 14, 2024

Why don't gradients vanish in large scale LLMs?

mk67 · on May 19, 2024

Not easy to give a concise answer here, but let me try:

The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.

Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.

Tarrosion · on May 1, 2024

Location: Cambridge, MA, USA [Boston/Somerville in case anyone is text-searching]

Remote: strongly prefer in person or hybrid

Willing to relocate: I'm very happy in the Boston area, so only under extraordinary circumstances

Technologies: data science, machine learning, technical communication, simulation, optimization, physically motivated models, heuristics; Julia, Python (Django, numpy, scipy, sklearn, cupy, pandas), AWS, Postgres/Postgis, Snowflake, Gurobi, NLopt, etc.

Resume: https://www.dropbox.com/scl/fi/wevhopcf81vx6vdoarn5a/efields...

Email: in the resume

Blurb: PhD in operations research -> startup employee #2 -> scale startup, gradually moving from data science IC to VP of data science + software engineering -> now looking for mission-meaningful hands-on technical work. I'm particularly good at technical communication as well as translating from the physical world (science/business) to math models. Open to companies of any size, especially motivated by the biosecurity, biotech, robotics, and clean energy sectors.

Something cool I built that's not work related: https://nomai-writing.com/ in the spirit of Outer Wilds