Extracting HN Book Recommendations with the ChatGPT API, Vol. II

Created: 2023-10-12

This is a follow-on from the original blog post Extracting HN Book Recommendations with the ChatGPT API.

Method used

In the original post I didn’t describe the process used to arrive at the results. I originally had a much longer post with much discussion about the process, but it was too long and the structure was poor, so I rewrote it in a more concise form. In the process I excluded the explanatory stuff. So in brief, here’s the process:

Download all data from Hacker News (HN) api
Load hn data into a duckdb database
Query stories matching “book” as an individual word. Don’t include stories with a URL, as they are not usually Ask HNs.
Get comments for those stories
Use chatGPT chat API to extract book from comments
Load the results into a Pandas dataframe to clean titles, remove subtitle (after colon), remove punctuation, lowercase, remove leading ‘the’.
Load the book data into duckdb, and use sql to determine the top book titles.

OpenAI ChatGPT API timeouts

When using the chatGPT API, you definitely want to deal with timeout and not ready errors. If you call the api multiple times, there is a good chance there will be a timeout. (For my sentiment job below, there were 2 timeouts when calling the API 220 times.) You should also set the request_timeout, because the default seems to be 600, which means your program will be waiting around for 10 minutes if openai decides not to respond to your request. Here’s the code I used to deal with timeouts and to do retries:

Class Sentiment:

  def get_completion(self, messages, retry_time=15):
    model = "gpt-3.5-turbo"
    temp = 0
    try:
        completion = openai.ChatCompletion.create(model=model, messages=messages, temperature=temp, request_timeout=120)
        return completion
    except Exception as e:
        retry_time = e.retry_after if hasattr(e, 'retry_after') else retry_time * 2
        if retry_time > 60:
            retry_time = 60
        print(f"Exception {str(e)}. Retrying in {retry_time} seconds...")
        sys.stdout.flush()
        time.sleep(retry_time)
        return self.get_completion(messages, retry_time)

Sentiment

There was some discussion of the presence of Ayn Rand’s books on the top 50. Look, there are people who like them, and I think they appeal to a significant (enthusiastic?) minority of the HN crowd, possibly for the following reasons:

people of an engineering mindset identify with the idea of building something great themselves.
the American belief in individualism
when you’re young, new ideas, books, movies, music, etc can be really mind-blowing and that generates enthusiasm. It might only be later that one’s education becomes more rounded in other areas and you might reconsider the message. Here’s a relevant XKCD

(Edit 2023-11-12) I found that this comment on HN is a good one for the common enthusiasm towards Rand’s writing.

There’s always people who are enthusiastic about niche things, it’s not necessarily controversial that any community contains enthusiasts for something as relatively harmless as those books.

Someone mentioned that the list doesn’t account for people saying BAD things about a book, it will still get counted towards its ranking. I thought it might be an idea to see what the numbers are. I have used the chatGPT API to identify the sentiment of comments that recommend “Atlas Shrugged”, just to see how many mentions of it are negative. Note that we don’t need a model as powerful as Chatgpt 3 to determine sentiment. Any Machine Learning learning course will take you through using classification to determine sentiment, which will run fine on any laptop. However throwing together a few calls to chatGPT is really convenient, (much more so than digging through my old jupyter notebooks). It cost 6c to run the 220 API calls (plus 2 retries).

Here’s the prompt I used, where {book} is the book’s title and comment is the text of the individual comment from hacker news:

prompt = [
  {"role": "system", "content": f"Identify the sentiment of the following comment about the book {book}. Provide your answer as a number, where positive=1, neutral=0 and negative=-1. Answer in csv format, 1 column, no header"},
  {"role": "user", "content": comment},
]

So for “Atlas Shrugged”, the sentiment counts turn out to be:

Sentiment	Count
-1	32
0	74
1	114

So 14.5% of the comments were classed as negative by ChatGPT 3.5. 184 comments were positive or neutral. If we took that number, Atlas Shrugged would still be the 13th most-recommended book in my data.

Here’s the comments where the API didn’t return a valid number for sentiment, it looks like chatgpt returns multiple sentiment numbers if the comment contains a list. I used 0 for sentiment if the api didn’t return a valid number:

comment	string returned by API
1752756	‘1) 1\n2) 0’
1971432	‘0,1,-1’

Top 51-100

Here are the top 51 to 100 ranked recommended books from my data: Again, I have thrown in some Amazon affiliate links to the books. Maybe I can earn a few measly dollars towards domain name subscriptions and cloud GPU usage.

#	Title	Author	Count	First Mention
51	Peopleware	Tim Lister, Tom DeMarco	93	5613
52	The Four Steps to the Epiphany	Steve Blank	93	65155
53	Getting Things Done	David Allen	92	5626
54	Rework	37 Signals	89	280887
55	The Goal	Eli Goldratt	88	142947
56	Harry Potter and the Methods of Rationality	Eliezer Yudkowsky	86	1753069
57	The 4-Hour Workweek	Tim Ferris	84	56667
58	Anathem	Neal Stephenson	84	304034
59	The Martian	Andy Weir	84	7968396
60	The Algorithm Design Manual	Skiena	82	55485
61	Ready Player One	Ernest Cline	82	3389883
62	The Art of War	Sun Tzu	80	5574
63	The Count of Monte Cristo	Alexander Dumas	79	875752
64	Brave New World	Aldous Huxley	78	86041
65	Blindsight	Matt Johnson & Prince Ghuman	77	1226847
66	JavaScript: The Good Parts	Douglas Crockford	75	259986
67	The Hard Thing About Hard Things	Ben Horowitz	75	8106953
68	The Prince	Niccolo Machiavelli	74	56777
69	Calculus	Michael Spivak	74	299751
70	Tao Te Ching	Lao Tzu	73	134993
71	The Brothers Karamazov	Fyodor Dostoevsky	72	875839
72	The Master and Margarita	Mikhail Bulgakov	72	876293
73	Why We Sleep	Albrecht Vorster	72	17230875
74	The Power of Habit	Charles Duhigg	71	4513380
75	Rich Dad Poor Dad	Robert Kiyosaki	70	143001
76	The Art of Electronics	Paul Horowitz	70	149605
77	The Alchemist	Paul Coelho	69	85957
78	The E-Myth Revisited	Michael E. Gerber	68	9660
79	Permutation City	Greg Egan	68	140585
80	Don’t Make Me Think	Steve Krug	68	295874
81	Principles of Mathematical Analysis	Rudin	68	772012
82	Debt: The First 5,000 Years	David Graeber	68	3395524
83	The 7 Habits of Highly Effective People	Stephen Covey	66	5617
84	Programming Pearls	Jon Bentley	66	176964
85	The Elements of Style	Strunk & White	65	56671
86	Catch-22	Joseph Heller	65	56709
87	Hackers: Heroes of the Computer Revolution	Steven Levy	65	86106
88	Hyperion	Dan Simmons	65	86442
89	Infinite Jest	David Foster Wallace	63	85974
90	The Intelligent Investor	Ben Graham	63	138044
91	Coders at Work	Peter Seibel	63	1026346
92	Refactoring	Kent Beck, Martin Fowler	63	1945860
93	Compilers: Principles, Techniques, and Tools	Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman	62	135280
94	Concrete Mathematics	Donald E. Knuth	62	260190
95	Moby Dick	Herman Melville	62	4966535
96	How to Read a Book	Mortimer Adler	61	240688
97	Principles	Ray Dalio	61	6414260
98	Nonviolent Communication	Marshall B. Rosenberg	61	9793418
99	How to Design Programs	Felleisen et al	60	1015973
100	Crime and Punishment	Fyodor Dostoevsky	59	86022