Extracting HN Book Recommendations with the ChatGPT API, Vol. II
Created:This is a follow-on from the original blog post Extracting HN Book Recommendations with the ChatGPT API.
Method used
In the original post I didn’t describe the process used to arrive at the results. I originally had a much longer post with much discussion about the process, but it was too long and the structure was poor, so I rewrote it in a more concise form. In the process I excluded the explanatory stuff. So in brief, here’s the process:
- Download all data from Hacker News (HN) api
- Load hn data into a duckdb database
- Query stories matching “book” as an individual word. Don’t include stories with a URL, as they are not usually Ask HNs.
- Get comments for those stories
- Use chatGPT chat API to extract book from comments
- Load the results into a Pandas dataframe to clean titles, remove subtitle (after colon), remove punctuation, lowercase, remove leading ‘the’.
- Load the book data into duckdb, and use sql to determine the top book titles.
OpenAI ChatGPT API timeouts
When using the chatGPT API, you definitely want to deal with timeout
and not ready
errors. If you call the api multiple times, there is a good chance there will be a timeout. (For my sentiment job below, there were 2 timeouts when calling the API 220 times.) You should also set the request_timeout, because the default seems to be 600, which means your program will be waiting around for 10 minutes if openai decides not to respond to your request. Here’s the code I used to deal with timeouts and to do retries:
Class Sentiment:
def get_completion(self, messages, retry_time=15):
model = "gpt-3.5-turbo"
temp = 0
try:
completion = openai.ChatCompletion.create(model=model, messages=messages, temperature=temp, request_timeout=120)
return completion
except Exception as e:
retry_time = e.retry_after if hasattr(e, 'retry_after') else retry_time * 2
if retry_time > 60:
retry_time = 60
print(f"Exception {str(e)}. Retrying in {retry_time} seconds...")
sys.stdout.flush()
time.sleep(retry_time)
return self.get_completion(messages, retry_time)
Sentiment
There was some discussion of the presence of Ayn Rand’s books on the top 50. Look, there are people who like them, and I think they appeal to a significant (enthusiastic?) minority of the HN crowd, possibly for the following reasons:
- people of an engineering mindset identify with the idea of building something great themselves.
- the American belief in individualism
- when you’re young, new ideas, books, movies, music, etc can be really mind-blowing and that generates enthusiasm. It might only be later that one’s education becomes more rounded in other areas and you might reconsider the message. Here’s a relevant XKCD
(Edit 2023-11-12) I found that this comment on HN is a good one for the common enthusiasm towards Rand’s writing.
There’s always people who are enthusiastic about niche things, it’s not necessarily controversial that any community contains enthusiasts for something as relatively harmless as those books.
Someone mentioned that the list doesn’t account for people saying BAD things about a book, it will still get counted towards its ranking. I thought it might be an idea to see what the numbers are. I have used the chatGPT API to identify the sentiment of comments that recommend “Atlas Shrugged”, just to see how many mentions of it are negative. Note that we don’t need a model as powerful as Chatgpt 3 to determine sentiment. Any Machine Learning learning course will take you through using classification to determine sentiment, which will run fine on any laptop. However throwing together a few calls to chatGPT is really convenient, (much more so than digging through my old jupyter notebooks). It cost 6c to run the 220 API calls (plus 2 retries).
Here’s the prompt I used, where {book} is the book’s title and comment is the text of the individual comment from hacker news:
prompt = [
{"role": "system", "content": f"Identify the sentiment of the following comment about the book {book}. Provide your answer as a number, where positive=1, neutral=0 and negative=-1. Answer in csv format, 1 column, no header"},
{"role": "user", "content": comment},
]
So for “Atlas Shrugged”, the sentiment counts turn out to be:
Sentiment | Count |
---|---|
-1 | 32 |
0 | 74 |
1 | 114 |
So 14.5% of the comments were classed as negative by ChatGPT 3.5. 184 comments were positive or neutral. If we took that number, Atlas Shrugged would still be the 13th most-recommended book in my data.
Here’s the comments where the API didn’t return a valid number for sentiment, it looks like chatgpt returns multiple sentiment numbers if the comment contains a list. I used 0 for sentiment if the api didn’t return a valid number:
comment | string returned by API |
---|---|
1752756 | ‘1) 1\n2) 0’ |
1971432 | ‘0,1,-1’ |
Top 51-100
Here are the top 51 to 100 ranked recommended books from my data: Again, I have thrown in some Amazon affiliate links to the books. Maybe I can earn a few measly dollars towards domain name subscriptions and cloud GPU usage.
# | Title | Author | Count | First Mention |
---|---|---|---|---|
51 | Peopleware | Tim Lister, Tom DeMarco | 93 | 5613 |
52 | The Four Steps to the Epiphany | Steve Blank | 93 | 65155 |
53 | Getting Things Done | David Allen | 92 | 5626 |
54 | Rework | 37 Signals | 89 | 280887 |
55 | The Goal | Eli Goldratt | 88 | 142947 |
56 | Harry Potter and the Methods of Rationality | Eliezer Yudkowsky | 86 | 1753069 |
57 | The 4-Hour Workweek | Tim Ferris | 84 | 56667 |
58 | Anathem | Neal Stephenson | 84 | 304034 |
59 | The Martian | Andy Weir | 84 | 7968396 |
60 | The Algorithm Design Manual | Skiena | 82 | 55485 |
61 | Ready Player One | Ernest Cline | 82 | 3389883 |
62 | The Art of War | Sun Tzu | 80 | 5574 |
63 | The Count of Monte Cristo | Alexander Dumas | 79 | 875752 |
64 | Brave New World | Aldous Huxley | 78 | 86041 |
65 | Blindsight | Matt Johnson & Prince Ghuman | 77 | 1226847 |
66 | JavaScript: The Good Parts | Douglas Crockford | 75 | 259986 |
67 | The Hard Thing About Hard Things | Ben Horowitz | 75 | 8106953 |
68 | The Prince | Niccolo Machiavelli | 74 | 56777 |
69 | Calculus | Michael Spivak | 74 | 299751 |
70 | Tao Te Ching | Lao Tzu | 73 | 134993 |
71 | The Brothers Karamazov | Fyodor Dostoevsky | 72 | 875839 |
72 | The Master and Margarita | Mikhail Bulgakov | 72 | 876293 |
73 | Why We Sleep | Albrecht Vorster | 72 | 17230875 |
74 | The Power of Habit | Charles Duhigg | 71 | 4513380 |
75 | Rich Dad Poor Dad | Robert Kiyosaki | 70 | 143001 |
76 | The Art of Electronics | Paul Horowitz | 70 | 149605 |
77 | The Alchemist | Paul Coelho | 69 | 85957 |
78 | The E-Myth Revisited | Michael E. Gerber | 68 | 9660 |
79 | Permutation City | Greg Egan | 68 | 140585 |
80 | Don’t Make Me Think | Steve Krug | 68 | 295874 |
81 | Principles of Mathematical Analysis | Rudin | 68 | 772012 |
82 | Debt: The First 5,000 Years | David Graeber | 68 | 3395524 |
83 | The 7 Habits of Highly Effective People | Stephen Covey | 66 | 5617 |
84 | Programming Pearls | Jon Bentley | 66 | 176964 |
85 | The Elements of Style | Strunk & White | 65 | 56671 |
86 | Catch-22 | Joseph Heller | 65 | 56709 |
87 | Hackers: Heroes of the Computer Revolution | Steven Levy | 65 | 86106 |
88 | Hyperion | Dan Simmons | 65 | 86442 |
89 | Infinite Jest | David Foster Wallace | 63 | 85974 |
90 | The Intelligent Investor | Ben Graham | 63 | 138044 |
91 | Coders at Work | Peter Seibel | 63 | 1026346 |
92 | Refactoring | Kent Beck, Martin Fowler | 63 | 1945860 |
93 | Compilers: Principles, Techniques, and Tools | Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman | 62 | 135280 |
94 | Concrete Mathematics | Donald E. Knuth | 62 | 260190 |
95 | Moby Dick | Herman Melville | 62 | 4966535 |
96 | How to Read a Book | Mortimer Adler | 61 | 240688 |
97 | Principles | Ray Dalio | 61 | 6414260 |
98 | Nonviolent Communication | Marshall B. Rosenberg | 61 | 9793418 |
99 | How to Design Programs | Felleisen et al | 60 | 1015973 |
100 | Crime and Punishment | Fyodor Dostoevsky | 59 | 86022 |