Hacker Newsnew | past | comments | ask | show | jobs | submit | evanhu_'s commentslogin

Thank you so much, yes I will have that up soon as well


I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.


There is a cache! You hit a new PDF but at least you will not have to wait for that one again ;)


Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.


Maybe you could use: https://github.com/facebookresearch/nougat/tree/main or https://github.com/VikParuchuri/marker

Both are tools to convert pdfs into Latex or Markup with latex formulas. Maybe that helps


Reading the motivation for the second:

"Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages."

When these are fed in the next levels as inputs, isn't it even less surprising to get even more hallucinations/repetitions?


Yes, using the LaTeX source code (or HTML, once that becomes reliable and widely used) should be much more robust than PDF parsing.


That is a very good resource to save


Oops sorry for the miscommunication, actually you don't need to enter an API key for now. Feel free to just try it out!


Thank you :). I updated the README to have some more explanation of the steps.

The chunking algorithm chunks by logical section (intro, abstract, authors, etc.) and also utilizes recursive subdivision chunking (chunk at 512 characters, then 256, then 128...). It is quite naive still but it works OK for now. An improvement would perhaps involve more advanced techniques like knowledge graph precomputation.

Reranking works by instead of embedding each text chunk as a vector and performing cosine similarity nearest neighbor search, you use a Cross-Encoder model that compares two texts and outputs a similarity score. Specifically, I chose Cohere's Reranker that specializes in comparing Query and Answer chunk pairs.


Thank you! Thanks for pointing that out, since the underlying RAG is rather naive (simple embedding cosine similarity lookup, as opposed to knowledge graph / advanced techniques), I opted to embed both "small" (512 character and below) chunks as well as entire section chunks (embedding the entire introduction) in order to support questions such as "Please summarize the introduction". Since I also use 5 chunks for each context, I suspect this can add up to a massive amount on papers with huge sections.


This is the paper that would reliably trigger context overflows. https://arxiv.org/abs/1811.03116 It otherwise did an admirable job on this brainbender.


Yes! I'll set up talk2biorxiv.org very soon as it would be simple to port over. I also plan on making the underlying research PDF RAG framework available as an independent module


I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!


I'm super impressed with what you've managed to create, do you have any further plans with this project? I'm curious now that it's finished and documented to such an extent will you try to bring it publicity and actual usage or was this just a passion project. Thanks


Thank you! I do get that itch to jump in and improve things whenever I see it lose a game, but I don't have further plans (development or commercial) in the near-term. The goal originally was to see whether I liked ML, to decide on my next industry/career move, but there was a lot of "one more month".

I'm actually hopeful that some search techniques such as SBLE-PUCT[1] or better derivations can make their way into other open source projects, but they've had big teams working for a while on similar, often better ideas, so we'll have to see.

[1] https://chrisbutner.github.io/ChessCoach/high-level-explanat...


So, do you like ML?


Haha - I dislike how much of a black box it is, despite the statistical basis (for example, the back and forth on batch normalization rationale). But lots of interesting problems and tech to dig into.


You estimate it’s rated 3400-ish and it loses games????


It loses some games to Stockfish 13 and 14, and Lc0 - rarely at slow time control, and more often at blitz and bullet (actually, it has losses all the way down to Stockfish 9 in blitz).

Partly because of the way it tries to search more widely to avoid tactical traps, it can also be a little sloppy in holding advantages or minimizing losses (this could use some more work and tuning). This ends up making it a little drawish, so it loses less than you'd expect to Stockfish 14, but also doesn't beat up weaker engines as well as Stockfish 14 does.

You can see some of this in the raw tournament results[1]. At 40 moves per 15 minutes, repeating, each engine draws with the ones above and below it, but starts to win and lose at a distance of 2 or 3.

At 5+3 time control, ChessCoach goes 1-0-29 vs. Stockfish 12, but Stockfish 12 is better at beating Stockfish 8-11 than ChessCoach is, so CC ends up between SF11 and SF12 in the end.

On Lichess, where there's no "free time" to get ready for searches, ChessCoach's naïve node allocation/deallocation makes it waste time, and means it can't ponder for very long on the opponent's time - a big opportunity for improvement (it needs a multi-threaded pool deallocator that can feed nodes back to local pools for the long-lived search threads). I think it's also hitting a bug with Syzygy memory mapping that Stockfish works around via reloading every "ucinewgame" (which I don't trigger on Lichess). So, overall, its performance on Lichess is worse.

Also, you can't read too much into this data - very few games, and no opening book.

[1] https://chrisbutner.github.io/ChessCoach/data.html#appendix-...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: