187. Repeated DNA Sequences
The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
- For example, "ACGAATTCCG" is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Constraints:
- 1 < = s . l e n g t h < = 1 0 5 1 <= s.length <= 10^5 1<=s.length<=105
- s[i] is either 'A', 'C', 'G', or 'T'.
From: LeetCode
Link: 187. Repeated DNA Sequences
Solution:
Ideas:
1. Problem Understanding:
- The DNA sequence consists of nucleotides represented by 'A', 'C', 'G', and 'T'.
- We need to find all 10-letter-long sequences that appear more than once in the DNA string.
2. Approach:
- Use hashing to efficiently track the 10-letter-long sequences.
- Use a hash table to store and count sequences, leveraging the hash value as an index.
- Use a linked list to handle hash collisions.
3. Hash Function:
- A custom hash function converts a 10-letter sequence into an integer hash value.
- Each nucleotide is mapped to a unique 2-bit value:
- 'A' -> 00
- 'C' -> 01
- 'G' -> 10
- 'T' -> 11
- The hash function shifts the hash value left by 2 bits for each nucleotide and combines the nucleotide's value using bitwise OR.
4. Hash Table:
- A fixed-size hash table (HASH_TABLE_SIZE = 2^20) is used to store linked lists of sequences.
- Each list node in the hash table contains a 10-letter sequence and a pointer to the next node in the list.
5. Steps:
- Initialization:
- Allocate memory for the hash table and the result array.
- Initialize counters to track the counts of sequences.
- Iterate through DNA Sequence:
- For each 10-letter-long substring, compute its hash value.
- Check if the sequence already exists in the hash table at the computed index.
- If found and it's the second occurrence, add it to the result array.
- If not found, add it to the hash table.
- Result Preparation:
- Return the array of repeated sequences and the count of such sequences.
6. Memory Management:
- Ensure all allocated memory is properly freed to avoid memory leaks.
- Use malloc to allocate memory for sequences and the hash table nodes.
Code:
c
/**
* Note: The returned array must be malloced, assume caller calls free().
*/
#define SEQUENCE_LENGTH 10
#define HASH_TABLE_SIZE 1048576 // 2^20 for hashing 10-letter sequences
typedef struct {
char* sequence;
struct ListNode* next;
} ListNode;
unsigned int hash(char* s) {
unsigned int hashValue = 0;
for (int i = 0; i < SEQUENCE_LENGTH; i++) {
hashValue <<= 2; // Shift left by 2 bits
switch (s[i]) {
case 'A': hashValue |= 0; break;
case 'C': hashValue |= 1; break;
case 'G': hashValue |= 2; break;
case 'T': hashValue |= 3; break;
}
}
return hashValue % HASH_TABLE_SIZE;
}
char** findRepeatedDnaSequences(char* s, int* returnSize) {
*returnSize = 0;
int len = strlen(s);
if (len < SEQUENCE_LENGTH) {
return NULL;
}
ListNode** hashTable = (ListNode**)calloc(HASH_TABLE_SIZE, sizeof(ListNode*));
char** result = (char**)malloc(sizeof(char*) * (len - SEQUENCE_LENGTH + 1));
int resultCount = 0;
int* counts = (int*)calloc(HASH_TABLE_SIZE, sizeof(int));
for (int i = 0; i <= len - SEQUENCE_LENGTH; i++) {
char sequence[SEQUENCE_LENGTH + 1];
strncpy(sequence, s + i, SEQUENCE_LENGTH);
sequence[SEQUENCE_LENGTH] = '\0';
unsigned int hashValue = hash(sequence);
ListNode* node = hashTable[hashValue];
int found = 0;
while (node != NULL) {
if (strcmp(node->sequence, sequence) == 0) {
found = 1;
break;
}
node = node->next;
}
if (found) {
if (counts[hashValue] == 1) {
result[resultCount] = (char*)malloc((SEQUENCE_LENGTH + 1) * sizeof(char));
strcpy(result[resultCount], sequence);
resultCount++;
}
counts[hashValue]++;
} else {
ListNode* newNode = (ListNode*)malloc(sizeof(ListNode));
newNode->sequence = (char*)malloc((SEQUENCE_LENGTH + 1) * sizeof(char));
strcpy(newNode->sequence, sequence);
newNode->next = hashTable[hashValue];
hashTable[hashValue] = newNode;
counts[hashValue] = 1;
}
}
*returnSize = resultCount;
for (int i = 0; i < HASH_TABLE_SIZE; i++) {
ListNode* node = hashTable[i];
while (node != NULL) {
ListNode* temp = node;
node = node->next;
free(temp->sequence);
free(temp);
}
}
free(hashTable);
free(counts);
return result;
}