Extracting HN Book Recommendations with the ChatGPT API, Vol. II

Created:

This is a follow-on from the original blog post Extracting HN Book Recommendations with the ChatGPT API.

Method used

In the original post I didn’t describe the process used to arrive at the results. I originally had a much longer post with much discussion about the process, but it was too long and the structure was poor, so I rewrote it in a more concise form. In the process I excluded the explanatory stuff. So in brief, here’s the process:

  • Download all data from Hacker News (HN) api
  • Load hn data into a duckdb database
  • Query stories matching “book” as an individual word. Don’t include stories with a URL, as they are not usually Ask HNs.
  • Get comments for those stories
  • Use chatGPT chat API to extract book from comments
  • Load the results into a Pandas dataframe to clean titles, remove subtitle (after colon), remove punctuation, lowercase, remove leading ‘the’.
  • Load the book data into duckdb, and use sql to determine the top book titles.

OpenAI ChatGPT API timeouts

When using the chatGPT API, you definitely want to deal with timeout and not ready errors. If you call the api multiple times, there is a good chance there will be a timeout. (For my sentiment job below, there were 2 timeouts when calling the API 220 times.) You should also set the request_timeout, because the default seems to be 600, which means your program will be waiting around for 10 minutes if openai decides not to respond to your request. Here’s the code I used to deal with timeouts and to do retries:

Class Sentiment:

  def get_completion(self, messages, retry_time=15):
    model = "gpt-3.5-turbo"
    temp = 0
    try:
        completion = openai.ChatCompletion.create(model=model, messages=messages, temperature=temp, request_timeout=120)
        return completion
    except Exception as e:
        retry_time = e.retry_after if hasattr(e, 'retry_after') else retry_time * 2
        if retry_time > 60:
            retry_time = 60
        print(f"Exception {str(e)}. Retrying in {retry_time} seconds...")
        sys.stdout.flush()
        time.sleep(retry_time)
        return self.get_completion(messages, retry_time)

Sentiment

There was some discussion of the presence of Ayn Rand’s books on the top 50. Look, there are people who like them, and I think they appeal to a significant (enthusiastic?) minority of the HN crowd, possibly for the following reasons:

  • people of an engineering mindset identify with the idea of building something great themselves.
  • the American belief in individualism
  • when you’re young, new ideas, books, movies, music, etc can be really mind-blowing and that generates enthusiasm. It might only be later that one’s education becomes more rounded in other areas and you might reconsider the message. Here’s a relevant XKCD

(Edit 2023-11-12) I found that this comment on HN is a good one for the common enthusiasm towards Rand’s writing.

There’s always people who are enthusiastic about niche things, it’s not necessarily controversial that any community contains enthusiasts for something as relatively harmless as those books.

Someone mentioned that the list doesn’t account for people saying BAD things about a book, it will still get counted towards its ranking. I thought it might be an idea to see what the numbers are. I have used the chatGPT API to identify the sentiment of comments that recommend “Atlas Shrugged”, just to see how many mentions of it are negative. Note that we don’t need a model as powerful as Chatgpt 3 to determine sentiment. Any Machine Learning learning course will take you through using classification to determine sentiment, which will run fine on any laptop. However throwing together a few calls to chatGPT is really convenient, (much more so than digging through my old jupyter notebooks). It cost 6c to run the 220 API calls (plus 2 retries).

Here’s the prompt I used, where {book} is the book’s title and comment is the text of the individual comment from hacker news:

prompt = [
  {"role": "system", "content": f"Identify the sentiment of the following comment about the book {book}. Provide your answer as a number, where positive=1, neutral=0 and negative=-1. Answer in csv format, 1 column, no header"},
  {"role": "user", "content": comment},
]

So for “Atlas Shrugged”, the sentiment counts turn out to be:

Sentiment Count
-1 32
0 74
1 114

So 14.5% of the comments were classed as negative by ChatGPT 3.5. 184 comments were positive or neutral. If we took that number, Atlas Shrugged would still be the 13th most-recommended book in my data.

Here’s the comments where the API didn’t return a valid number for sentiment, it looks like chatgpt returns multiple sentiment numbers if the comment contains a list. I used 0 for sentiment if the api didn’t return a valid number:

comment string returned by API
1752756 ‘1) 1\n2) 0’
1971432 ‘0,1,-1’

Top 51-100

Here are the top 51 to 100 ranked recommended books from my data: Again, I have thrown in some Amazon affiliate links to the books. Maybe I can earn a few measly dollars towards domain name subscriptions and cloud GPU usage.

# Title Author Count First Mention
51 Peopleware Tim Lister, Tom DeMarco 93 5613
52 The Four Steps to the Epiphany Steve Blank 93 65155
53 Getting Things Done David Allen 92 5626
54 Rework 37 Signals 89 280887
55 The Goal Eli Goldratt 88 142947
56 Harry Potter and the Methods of Rationality Eliezer Yudkowsky 86 1753069
57 The 4-Hour Workweek Tim Ferris 84 56667
58 Anathem Neal Stephenson 84 304034
59 The Martian Andy Weir 84 7968396
60 The Algorithm Design Manual Skiena 82 55485
61 Ready Player One Ernest Cline 82 3389883
62 The Art of War Sun Tzu 80 5574
63 The Count of Monte Cristo Alexander Dumas 79 875752
64 Brave New World Aldous Huxley 78 86041
65 Blindsight Matt Johnson & Prince Ghuman 77 1226847
66 JavaScript: The Good Parts Douglas Crockford 75 259986
67 The Hard Thing About Hard Things Ben Horowitz 75 8106953
68 The Prince Niccolo Machiavelli 74 56777
69 Calculus Michael Spivak 74 299751
70 Tao Te Ching Lao Tzu 73 134993
71 The Brothers Karamazov Fyodor Dostoevsky 72 875839
72 The Master and Margarita Mikhail Bulgakov 72 876293
73 Why We Sleep Albrecht Vorster 72 17230875
74 The Power of Habit Charles Duhigg 71 4513380
75 Rich Dad Poor Dad Robert Kiyosaki 70 143001
76 The Art of Electronics Paul Horowitz 70 149605
77 The Alchemist Paul Coelho 69 85957
78 The E-Myth Revisited Michael E. Gerber 68 9660
79 Permutation City Greg Egan 68 140585
80 Don’t Make Me Think Steve Krug 68 295874
81 Principles of Mathematical Analysis Rudin 68 772012
82 Debt: The First 5,000 Years David Graeber 68 3395524
83 The 7 Habits of Highly Effective People Stephen Covey 66 5617
84 Programming Pearls Jon Bentley 66 176964
85 The Elements of Style Strunk & White 65 56671
86 Catch-22 Joseph Heller 65 56709
87 Hackers: Heroes of the Computer Revolution Steven Levy 65 86106
88 Hyperion Dan Simmons 65 86442
89 Infinite Jest David Foster Wallace 63 85974
90 The Intelligent Investor Ben Graham 63 138044
91 Coders at Work Peter Seibel 63 1026346
92 Refactoring Kent Beck, Martin Fowler 63 1945860
93 Compilers: Principles, Techniques, and Tools Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman 62 135280
94 Concrete Mathematics Donald E. Knuth 62 260190
95 Moby Dick Herman Melville 62 4966535
96 How to Read a Book Mortimer Adler 61 240688
97 Principles Ray Dalio 61 6414260
98 Nonviolent Communication Marshall B. Rosenberg 61 9793418
99 How to Design Programs Felleisen et al 60 1015973
100 Crime and Punishment Fyodor Dostoevsky 59 86022