Extracting Hacker News Book Recommendations with the ChatGPT API

Created:

2023-10-12: I have written a followup with some details about the process used to generate this list: Extracting HN Book Recommendations with the ChatGPT API, Vol. II

I love books and I enjoy reading through the Hacker News (HN) book recommendation threads. On HN, there’s almost 200 stories so far this year that have the separate word “book” in the title, and aren’t linked to another page. I wondered what the most commonly recommended or mention books are. Mainly wondering if SICP or PCL would be the top recommendation.

After reading of the man who categorised his favourite podcast into dewey decimal using GPT, I was aware that the GPT API could be used to categorise data and output the information in json format. So using the HN data fetched from the hackernews API, I used the subset of stories that seem to be book recommendation threads and extracted book titles, authors and urls from the text using calls to the Chat Completions API.

Here’s the top 50 book recommendations:

# Title Author Count First Mention
1 Structure and Interpretation of Computer Programs Abelson and Sussman 376 5675
2 Gödel, Escher, Bach Douglas Hofstadter 365 56795
3 How to Win Friends and Influence People Dale Carnegie 292 5584
4 The C Programming Language Brian Kernighan, Dennis Ritchie 284 135262
5 Dune Frank Herbert 261 57231
6 Thinking, Fast and Slow Daniel Kahneman 244 3277457
7 Meditations Marcus Aurelius 231 134993
8 Atlas Shrugged Ayn Rand 222 86114
9 The Art of Computer Programming Donald E. Knuth 213 135245
10 Sapiens: A Brief History of Humankind Yuval Harari 205 10028239
11 The Pragmatic Programmer Andrew Hunt 203 5704
12 Zen and the Art of Motorcycle Maintenance Robert M Pirsig 203 56941
13 Introduction to Algorithms Charles E. Leiserson, Clifford Stein, Ronald Rivest, Thomas H. Cormen 171 55391
14 The Selfish Gene Richard Dawkins 168 85867
15 Code: The Hidden Language of Computer Hardware and Software Charles Petzold 160 135906
16 The Mythical Man-Month Fred Brooks 159 5725
17 The Black Swan Nassim Nicholas Taleb 158 56763
18 Designing Data-Intensive Applications Martin Kleppman 153 8671875
19 1984 George Orwell 152 85938
20 Code Complete Steve McConnell 149 56709
21 Snow Crash Neal Stephenson 146 85862
22 Ender’s Game Orson Scott Card 143 56704
23 The Three-Body Problem Cixin Liu 143 8867599
24 The Design of Everyday Things Don Norman 136 85860
25 Bible Unknown 134 85859
26 Founders at Work Jessica Livingston 133 5613
27 Antifragile Nassim Nicholas Taleb 130 4966437
28 Man’s Search for Meaning Victor E. Frankl 129 1634144
29 The Hitchhiker’s Guide to the Galaxy Douglas Adams 127 56709
30 Cryptonomicon Neal Stephenson 127 85940
31 The Fountainhead Ayn Rand 127 135463
32 Fooled by Randomness Nassim Nicholas Taleb 125 57595
33 Surely You’re Joking, Mr. Feynman! Richard Feynman 125 85858
34 Siddhartha Herman Hesse 124 86337
35 Foundation Isaac Asimov 123 140379
36 The Lord of the Rings J. R. R. Tolkien 121 56629
37 Zero to One Peter Thiel 115 7968392
38 Neuromancer William Gibson 112 56663
39 The Lean Startup Eric Ries 110 1570888
40 The Phoenix Project Gene Kim 110 5569687
41 Never Split the Difference Chris Voss 108 12245967
42 Guns, Germs, and Steel Jared Diamond 107 56777
43 Design Patterns Addy Osmani 107 80916
44 Clean Code Robert C. Martin 106 1945860
45 Deep Work Cal Newport 105 11702897
46 The Elements of Computing Systems Noam Nisan, Shimon Schocken 104 1295307
47 The Little Schemer Daniel P. Friedman, Matthias Felleisen 102 56629
48 Influence: The Psychology of Persuasion Robert B. Cialdini 101 193848
49 The Soul of a New Machine Tracy Kidder 99 135262
50 Masters of Doom David Kushner 97 135284

Edit: Some corrections,

  • Dune was by Frank not Brian Herbet.
  • Meditations listed is by Marcus Aurelius, not Descartes.
  • Calculus is by Michael Spivak, although there are 14 mentions of Calculus by Apostol included in that count too. (Whoops.) Thanks to the Hacker News commenters who pointed out these errors. My sql query should have returned the most common author for each title, instead of min(author).

Update, 2023-10-12: The top 50 list has been revised, Javascript and Calculus were there in error. Those books listed were actually multiple different titles with the same name. For example Javascript: The Good Parts was renamed to ‘javascript’ by my data processing code, but there are other books with the same title, and there a number of books called “Calculus”. I have de-duped those, and they dropped off the top 50 as a result.

Some things I discovered while doing this project:

  • When the API doesn’t return valid JSON, usually this is when chatGPT is saying things like “I apologize for the confusion…” or “You’re welcome! If you have any more questions, feel free to ask.”, in response to a HN comment that just says “thanks” or asks a question.
  • Designed the prompt so that I can discard responses with empty titles. This is because I was unable to get chatgpt to stop including mentions of Authors without a title of a particular book.
  • Processing 57k comments cost about $40 using gpt 3.5 turbo API.
  • Even with a temperature of 0, GPT’s results vary from call to call. Others have noticed this effect (HN discussion), it’s not just GPT-4 that is non-deterministic - GPT 3.5 turbo exhibits greater variability compared to earlier GPT-3 models.
  • It can identify links from the text, but I had to remove the html tag and just leave the url otherwise GPT would pick up the truncated link text instead of the URL.

Here’s an example of the json output by chatgpt, for this comment, it got everything wrong except the link, but it shows the format of the data:

[
  {
    "match": "Hitchhiker's Guide Vms Unsupported Undocumented Can Go Away At Any Time Feature",
    "title": "The Hitchhiker's Guide to the Galaxy",
    "author": "Douglas Adams",
    "link": "http://www.amazon.com/Hitchhikers-Guide-Vms-Unsupported-Undocumented-Can-Go-Away-At-Any-Time-Feature/dp/1878956000"
  }
]

Edit: someone has asked for the prompt, here it is:

prompt = [
            {"role": "system", "content": "Assistant that identifies book titles and authors in the following document and shows the words you match to a book title from. Some titles may be abbreviated, please expand the abbreviated title. If the document talks about an author but doesn't mention a book, leave \"title\" blank. If you know who the author is, provide the author. Don't include the book's subtitle. If the text is asking for a recommendation, without mentioning a book, then return an empty array. Provide your answer in a json array."}
            {"role": "user", "content": 'Wren\'s Explosion https://www.amazon.com/gp/395, and any Plath."'},
            {"role": "assistant", "content": '''[{"match":"'Wren's Explosion","title":"Explosion","author":"P.C. Wren","link":"https://www.amazon.com/gp/395"}, {"match":"any Plath","title":"", "author":"Sylvia Plath"}]'''},
            {"role":"user", "content":"3-days free trial isn't freemium."},
            {"role":"assistant", "content":"[]"},
            {"role":"user", "content": "Miranda Hamilton"},
            {"role":"assistant", "content": '''[{"match":"Miranda Hamilton","title":"Hamilton","author":"Lin-Manuel Miranda, Jeremy McCarter","link":""}]'''},
        ]

The Data

Because I enjoy working with data and think you might find it interesting to analyse the results, here’s the raw data produced by GPT , sorted by title. Note there’s a match column in there which includes an excerpt of the comment where the book was identified. I also normalised the book titles, lowercasing and removing ‘the’ if present at the start, and removed any subtitles. This enabled me to query the top books without missing too many items due to inconsistence in the names that gpt came up with.

Here is the input data in zipped csv format, it expands out to a 24 MB file.

Note

I have added an amazon affiliate links to amazon urls in the tables above, mainly as a learning exercise.