Extracting Hacker News Book Recommendations with the ChatGPT API

Created: 2023-10-04

2023-10-12: I have written a followup with some details about the process used to generate this list: Extracting HN Book Recommendations with the ChatGPT API, Vol. II

I love books and I enjoy reading through the Hacker News (HN) book recommendation threads. On HN, there’s almost 200 stories so far this year that have the separate word “book” in the title, and aren’t linked to another page. I wondered what the most commonly recommended or mention books are. Mainly wondering if SICP or PCL would be the top recommendation.

After reading of the man who categorised his favourite podcast into dewey decimal using GPT, I was aware that the GPT API could be used to categorise data and output the information in json format. So using the HN data fetched from the hackernews API, I used the subset of stories that seem to be book recommendation threads and extracted book titles, authors and urls from the text using calls to the Chat Completions API.

Here’s the top 50 book recommendations:

#	Title	Author	Count	First Mention
1	Structure and Interpretation of Computer Programs	Abelson and Sussman	376	5675
2	Gödel, Escher, Bach	Douglas Hofstadter	365	56795
3	How to Win Friends and Influence People	Dale Carnegie	292	5584
4	The C Programming Language	Brian Kernighan, Dennis Ritchie	284	135262
5	Dune	Frank Herbert	261	57231
6	Thinking, Fast and Slow	Daniel Kahneman	244	3277457
7	Meditations	Marcus Aurelius	231	134993
8	Atlas Shrugged	Ayn Rand	222	86114
9	The Art of Computer Programming	Donald E. Knuth	213	135245
10	Sapiens: A Brief History of Humankind	Yuval Harari	205	10028239
11	The Pragmatic Programmer	Andrew Hunt	203	5704
12	Zen and the Art of Motorcycle Maintenance	Robert M Pirsig	203	56941
13	Introduction to Algorithms	Charles E. Leiserson, Clifford Stein, Ronald Rivest, Thomas H. Cormen	171	55391
14	The Selfish Gene	Richard Dawkins	168	85867
15	Code: The Hidden Language of Computer Hardware and Software	Charles Petzold	160	135906
16	The Mythical Man-Month	Fred Brooks	159	5725
17	The Black Swan	Nassim Nicholas Taleb	158	56763
18	Designing Data-Intensive Applications	Martin Kleppman	153	8671875
19	1984	George Orwell	152	85938
20	Code Complete	Steve McConnell	149	56709
21	Snow Crash	Neal Stephenson	146	85862
22	Ender’s Game	Orson Scott Card	143	56704
23	The Three-Body Problem	Cixin Liu	143	8867599
24	The Design of Everyday Things	Don Norman	136	85860
25	Bible	Unknown	134	85859
26	Founders at Work	Jessica Livingston	133	5613
27	Antifragile	Nassim Nicholas Taleb	130	4966437
28	Man’s Search for Meaning	Victor E. Frankl	129	1634144
29	The Hitchhiker’s Guide to the Galaxy	Douglas Adams	127	56709
30	Cryptonomicon	Neal Stephenson	127	85940
31	The Fountainhead	Ayn Rand	127	135463
32	Fooled by Randomness	Nassim Nicholas Taleb	125	57595
33	Surely You’re Joking, Mr. Feynman!	Richard Feynman	125	85858
34	Siddhartha	Herman Hesse	124	86337
35	Foundation	Isaac Asimov	123	140379
36	The Lord of the Rings	J. R. R. Tolkien	121	56629
37	Zero to One	Peter Thiel	115	7968392
38	Neuromancer	William Gibson	112	56663
39	The Lean Startup	Eric Ries	110	1570888
40	The Phoenix Project	Gene Kim	110	5569687
41	Never Split the Difference	Chris Voss	108	12245967
42	Guns, Germs, and Steel	Jared Diamond	107	56777
43	Design Patterns	Addy Osmani	107	80916
44	Clean Code	Robert C. Martin	106	1945860
45	Deep Work	Cal Newport	105	11702897
46	The Elements of Computing Systems	Noam Nisan, Shimon Schocken	104	1295307
47	The Little Schemer	Daniel P. Friedman, Matthias Felleisen	102	56629
48	Influence: The Psychology of Persuasion	Robert B. Cialdini	101	193848
49	The Soul of a New Machine	Tracy Kidder	99	135262
50	Masters of Doom	David Kushner	97	135284

Edit: Some corrections,

Dune was by Frank not Brian Herbet.
Meditations listed is by Marcus Aurelius, not Descartes.
Calculus is by Michael Spivak, although there are 14 mentions of Calculus by Apostol included in that count too. (Whoops.) Thanks to the Hacker News commenters who pointed out these errors. My sql query should have returned the most common author for each title, instead of min(author).

Update, 2023-10-12: The top 50 list has been revised, Javascript and Calculus were there in error. Those books listed were actually multiple different titles with the same name. For example Javascript: The Good Parts was renamed to ‘javascript’ by my data processing code, but there are other books with the same title, and there a number of books called “Calculus”. I have de-duped those, and they dropped off the top 50 as a result.

Some things I discovered while doing this project:

When the API doesn’t return valid JSON, usually this is when chatGPT is saying things like “I apologize for the confusion…” or “You’re welcome! If you have any more questions, feel free to ask.”, in response to a HN comment that just says “thanks” or asks a question.
Designed the prompt so that I can discard responses with empty titles. This is because I was unable to get chatgpt to stop including mentions of Authors without a title of a particular book.
Processing 57k comments cost about $40 using gpt 3.5 turbo API.
Even with a temperature of 0, GPT’s results vary from call to call. Others have noticed this effect (HN discussion), it’s not just GPT-4 that is non-deterministic - GPT 3.5 turbo exhibits greater variability compared to earlier GPT-3 models.
It can identify links from the text, but I had to remove the html tag and just leave the url otherwise GPT would pick up the truncated link text instead of the URL.

Here’s an example of the json output by chatgpt, for this comment, it got everything wrong except the link, but it shows the format of the data:

[
  {
    "match": "Hitchhiker's Guide Vms Unsupported Undocumented Can Go Away At Any Time Feature",
    "title": "The Hitchhiker's Guide to the Galaxy",
    "author": "Douglas Adams",
    "link": "http://www.amazon.com/Hitchhikers-Guide-Vms-Unsupported-Undocumented-Can-Go-Away-At-Any-Time-Feature/dp/1878956000"
  }
]

Edit: someone has asked for the prompt, here it is:

prompt = [
            {"role": "system", "content": "Assistant that identifies book titles and authors in the following document and shows the words you match to a book title from. Some titles may be abbreviated, please expand the abbreviated title. If the document talks about an author but doesn't mention a book, leave \"title\" blank. If you know who the author is, provide the author. Don't include the book's subtitle. If the text is asking for a recommendation, without mentioning a book, then return an empty array. Provide your answer in a json array."}
            {"role": "user", "content": 'Wren\'s Explosion https://www.amazon.com/gp/395, and any Plath."'},
            {"role": "assistant", "content": '''[{"match":"'Wren's Explosion","title":"Explosion","author":"P.C. Wren","link":"https://www.amazon.com/gp/395"}, {"match":"any Plath","title":"", "author":"Sylvia Plath"}]'''},
            {"role":"user", "content":"3-days free trial isn't freemium."},
            {"role":"assistant", "content":"[]"},
            {"role":"user", "content": "Miranda Hamilton"},
            {"role":"assistant", "content": '''[{"match":"Miranda Hamilton","title":"Hamilton","author":"Lin-Manuel Miranda, Jeremy McCarter","link":""}]'''},
        ]

The Data

Because I enjoy working with data and think you might find it interesting to analyse the results, here’s the raw data produced by GPT , sorted by title. Note there’s a match column in there which includes an excerpt of the comment where the book was identified. I also normalised the book titles, lowercasing and removing ‘the’ if present at the start, and removed any subtitles. This enabled me to query the top books without missing too many items due to inconsistency in the names that gpt came up with.

Here is the input data in zipped csv format, it expands out to a 24 MB file.

Note

I have added an amazon affiliate links to amazon urls in the tables above, mainly as a learning exercise.