Microsoft released Phi-3-mini, a 3.8 billion parameter model, today without much fanfare. What is really surprising is that according to the paper, despite being a smaller model, it outperforms the recently released Llama 3 8B model. Note that it has less than 50% of the parameters and can potentially run on your hand held device with quantisation and other neat tricks and still outperform Llama 3 8B. Seems like Llama 3’s run of being the best small model didn’t really last long.
So I decided to test this out myself, where I asked both Llama 3 and Phi-3-mini 3 different questions to make a qualitative evaluation on whether it’s really better than Llama 3 or is it the case that Phi-3 has been overfitted to perform well on the leaderboards.
I’m using Phi-3-mini instruct with 4k context length and Llama 3 8B Instruct.
Here is the link to the collab notebook if you want to play around with it.
The Needle in the Haystack
I presented a prompt consisting of approximately 2000 tokens, in which I inserted the sentence "Paneer Kadhai is the dish of the day" at a random location as a reference point. At the end of the prompt, I posed the question “What is the dish of the day?” as a means of assessing the model’s ability to recall information within a large context. As anticipated, Llama 3 8B gets the correct answer, however Phi-3-mini fails to return the answer. It returns some text from the prompt, but fails to return the answer. So this round goes to Llama 3.
Pattern Recognition
Now, this is a challenging problem. The models have to complete this sequence – 1,8,9,64,25,216,_.
As you can observe, the sequence is that odd place numbers follow the function x2, and even place numbers follow the function x3 . Knowing this, the answer is fairly obvious that the missing number is 7*7 = 49.
Here Llama 3 was not able to find the answer, it inferred correctly that the response has something to do with squaring the numbers, but was unable to arrive at the correct response.
Fill in the blank letter '_' in the pattern 1,8,9,64,25,216,_.
The pattern is obtained by squaring the numbers. So, the next number in the pattern is 1,8,9,64,25,216, 441.
Final Answer: The final answer is 441. I hope it is correct.
Phi-3-mini was equal in performance to Llama 3 at this task, which I thought was very impressive given its small size. In its response, it tried to reason that the sequence might contain cubing the numbers, though it didn’t arrive at the correct response. Phi-3-mini response –
#### Response
The pattern appears to be based on the sequence of cubes of natural numbers, but with alternating signs and skipping one number each time. The sequence is:
1^3 = 1
(-2)^3 = -8
3^3 = 27 (but given as 9, so it seems there's a pattern of subtracting 18)
(-4)^3 = -64
5^3 = 125 (but given as 64, so it seems there's a pattern of subtracting 61)
7^3 = 343 (but given as 25, so it seems there's a pattern of subtracting 318)
Following this pattern, the next number should be the cube of 8, but with the same subtraction pattern. Since the last operation was subtracting 318, the next operation should be subtracting 336 (318 + 18).
8^3 = 512
512 - 336 = 176
So the missing number is 17
This round is a tie.
SQL Coding
Finally, I asked both models to provide me with a coding solution to this leetcode problem. Now there are many solutions to this problem, but I was interested in seeing how both the LLMs approach this problem.
Both Llama 3 and Phi 3 didn’t answer the problem correctly, but Llama 3 gave me SQL to test, which is better than returning no code.
Phi-3 took an interesting approach. Rather than providing an SQL solution, it generated more examples—erroneous examples and due to the max tokens limitations, the answer was truncated.
Example 2: Input: Employee table: +-----+-------+------------+-----------+ | id | name | department | managerId | +-----+-------+------------+-----------+ | 101 | John | A | null | | 102 | Dan | A | 101 | | 103 | James | A | 101 | | 104 | Amy | A | 101 | | 105 | Anne | A | 101 | | 106 | Ron | B | 101 | | 107 | Tom | B | 106 | | 108 | Bob | B | 106 | | 109 | Tim | B | 106 | | 110 | Sam | B | 106 | +-----+-------+------------+-----------+ Output: +------+ | name | +------+ | John | | Ron | +------+ Example 3: Input: Employee table: +-----+-------+------------+-----------+ | id | name | department | managerId | +-----+-------+------------+-----------+ | 101 | John | A | null | | 102 | Dan | A | 101 | | 103 | James | A | 101 | | 104 | Amy | A | 101 | | 105 | Anne | A | 101 | | 106 | Ron | B | 101 | | 107 | Tom | B | 106 | | 108 | Bob | B | 106 | | 109 | Tim | B | 106 | | 110 | Sam | B | 106 | | 111 | Mike | B | 107 | | 112 | Jack | B | 107 | | 113 | Jill | B | 107 | +-----+-------+------------+-----------+ Output: +------+ | name | +------+ | John | | Ron | | Tom | +------+ Example 4: Input: Employee table: +-----+-------+------------+-----------+ | id | name | department | managerId | +-----+-------+------------+-----------+ | 101 | John | A | null | | 102 | Dan | A | 101 | | 103 | James | A | 101 | | 104 | Amy | A | 101 | | 105 | Anne | A | 101 | | 106 | Ron | B | 101 | | 107 | Tom | B | 106 | | 108 | Bob | B | 106 | | 109 | Tim | B | 106 | | 110 | Sam | B | 106 | | 111 | Mike | B | 107 | | 112 | Jack | B | 107 | | 113 | Jill | B | 107 | | 114 | Mark | B | 108 | | 115 | John | B | 108 | +-----+-------+------------+-----------+ Output: +------+ | name | +------+ | John | | Ron | | Tom | | Bob | +------+ Example 5: Input: Employee table: +-----+-------+------------+-----------+ | id | name | department | managerId | +-----+
Conclusion
| Question | Llama 3 8B Instruct | Phi-3-mini 4K |
| Needle in Haystack | X | |
| Pattern Recognition | X | X |
| SQL Code | X |
While 3 questions are not enough to determine the strength of these small but still massive models, it still gave us some indication of the model’s strength, and to be honest, I think Phi-3-mini is overfitted to the benchmark. It may have faster inferencing and can fit on your device, but if you want better results, I’ll stick to using either Mistral or Llama 3B.


