[leetcode] top-k-frequent-words (Python3)

Algorithm (PS)

[leetcode] top-k-frequent-words (Python3)

minjiwoo 2024. 6. 2. 11:20

728x90

https://leetcode.com/problems/top-k-frequent-words/

Given an array of strings words and an integer k, return the k most frequent strings.

Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

Example 1:

Input: words = ["i","love","leetcode","i","love","coding"], k = 2
Output: ["i","love"]
Explanation: "i" and "love" are the two most frequent words.
Note that "i" comes before "love" due to a lower alphabetical order.

Example 2:

Input: words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4
Output: ["the","is","sunny","day"]
Explanation: "the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrence being 4, 3, 2 and 1 respectively.

Constraints:

1 <= words.length <= 500
1 <= words[i].length <= 10
words[i] consists of lowercase English letters.
k is in the range [1, The number of unique words[i]]

Follow-up: Could you solve it in O(n log(k)) time and O(n) extra space?

문자열 배열에서 가장 자주 등장하는 K 개를 찾는 문제이다.

첫번째 풀이

Counter() 라는 함수를 사용하여 간단하게 word 개수를 반환한 후 value 큰 순으로 정렬, word 알파벳 순으로 정렬하고 있다.

from collections import Counter

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]: 
        words_cnt = Counter(words)
        answer = [w[0] for w in sorted(words_cnt.items(), key=lambda x:(-x[1], x[0]))]
        
        return answer[:k]

코딩이 짧아진다는 장점이 있으나 문제에서 follow up 으로 요구하는 사항을 만족하지 못하고 있다.

O(N) extra space 를 사용해서 Counter 값을 저장하고, Python 의 sort 알고리즘은 퀵소트로 O(nlogn) 시간복잡도가 소요된다.

그러나 Counter 함수에서 O(N) 시간복잡도가 소요되므로 결론적으로는 O(N) + O(NlogN) 이 걸리게 된다.

두번째 풀이

시간 복잡도 O(nlog(k)) 에서 힌트를 얻을 수 있다.

우선 순위 큐에서 queue 에 element 를 넣고, 빼는데 걸리는 시간이 logN 이다. 그리고 이 queue의 사이즈를 k 개로 제한하면된다.

k 개로 제한 하는 이유는 ?

priority queue 는 heap 이라는 자료구조로 이루어져있다. 가장 낮거나 / 높은 우선순위를 가진 element 가 tree의 root node 자리로 오게 되면서 정렬된다. heappush() 함수는 O(log(n)) 의 시간복잡도를 가진다. heap 이 완전 이진 트리의 형태를 가지기 때문에 heap 자료구조의 가능한 최대 높이는 logN 이다. 따라서 재정렬을 하는 작업이 일어날 때마다 트리의 높이에 시간 복잡도가 비례하게 된다. 따라서 문제 조건에서 O(Nlog(k)) 에 풀라는 조건에서 priority queue의 길이를 k 개로 제한해서 쓰라는 힌트로 해석할 수 있다.

<출처: https://www.geeksforgeeks.org/heap-data-structure/minheapandmaxheap/>

from collections import Counter
import heapq
    
class WordCnt:
    def __init__(self, word:str, cnt:int):
        self.word = word
        self.cnt = cnt
    
    def __lt__(self, other):
        if self.cnt == other.cnt: # cnt 같으면 word 순으로 정렬 
            return self.word > other.word # 알파벳은 역순으로 정렬한다. 사전순으로 뒤에 오는 단어가 더 작다 (heappop 연산에서 작은 것을 먼저 제거할 목적)
        return self.cnt < other.cnt 

class Solution:
    
    def topKFrequent(self, words: List[str], k: int) -> List[str]: 
        words_cnt = Counter(words) # O(N) 의 extra 공간 사용 , word : WordCnt
        pq = [] # priority queue 
        heapq.heapify(pq) # min heap 
        print(words_cnt)
        for word, cnt in words_cnt.items():
            # priority queue 는 우선순위에 따라서 원소를 꺼내는 특성이 있음
            # priority queue의 우선순위를 WordCnt class 에서 정의함. 
            heapq.heappush(pq, WordCnt(word, cnt)) 
            
            if len(pq) > k: # evict least count element , k 개만 가지고 있어야 logk 시간 복잡도로 정렬됨 
                heapq.heappop(pq)
        
        # 그대로 k개를 priority queue에서 heappop 하여 정렬한다. 
        answer = []
        for _ in range(k):
            answer.append(heapq.heappop(pq).word)
        
        return answer[::-1]

heapq.heappush() 함수가 실행될 떄 class의 magic method __lt__ (더 작다는 의미를 정의하는 함수) 에 따라서, 우선순위에 따라 정렬이 일어난다. 아래는 magic method 에 대한 예시이다.

객체 간의 '<' 연산자를 통한 비교를 정의한다. 이후 Person 객체가 sort() 정렬 메서드나 sorted() 정렬 함수에서 정렬이 일어날 때 이 메서드를 자동으로 사용하게 된다. 마찬가지로 heapq 모듈에서 최소 힙 , 최대 힙을 유지하기 위해서도 __lt__ 메서드가 자동으로 사용된다.

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __lt__(self, other):
        return self.age < other.age

# 두 Person 객체 생성
person1 = Person('Alice', 30)
person2 = Person('Bob', 25)

# < 연산자를 사용한 비교
print(person1 < person2)  # False (30 < 25가 False이므로)
print(person1 > person2)  # True (자동으로 __lt__() 반대의 의미)

728x90