LeetCode //C - 187. Repeated DNA Sequences

187. Repeated DNA Sequences

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: "AAAAACCCCC","CCCCCAAAAA"

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: "AAAAAAAAAA"

Constraints:
  • 1 < = s . l e n g t h < = 1 0 5 1 <= s.length <= 10^5 1<=s.length<=105
  • si is either 'A', 'C', 'G', or 'T'.

From: LeetCode

Link: 187. Repeated DNA Sequences


Solution:

Ideas:

1. Problem Understanding:

  • The DNA sequence consists of nucleotides represented by 'A', 'C', 'G', and 'T'.
  • We need to find all 10-letter-long sequences that appear more than once in the DNA string.

2. Approach:

  • Use hashing to efficiently track the 10-letter-long sequences.
  • Use a hash table to store and count sequences, leveraging the hash value as an index.
  • Use a linked list to handle hash collisions.

3. Hash Function:

  • A custom hash function converts a 10-letter sequence into an integer hash value.
  • Each nucleotide is mapped to a unique 2-bit value:
    • 'A' -> 00
    • 'C' -> 01
    • 'G' -> 10
    • 'T' -> 11
  • The hash function shifts the hash value left by 2 bits for each nucleotide and combines the nucleotide's value using bitwise OR.

4. Hash Table:

  • A fixed-size hash table (HASH_TABLE_SIZE = 2^20) is used to store linked lists of sequences.
  • Each list node in the hash table contains a 10-letter sequence and a pointer to the next node in the list.

5. Steps:

  • Initialization:
    • Allocate memory for the hash table and the result array.
    • Initialize counters to track the counts of sequences.
  • Iterate through DNA Sequence:
    • For each 10-letter-long substring, compute its hash value.
    • Check if the sequence already exists in the hash table at the computed index.
    • If found and it's the second occurrence, add it to the result array.
    • If not found, add it to the hash table.
  • Result Preparation:
    • Return the array of repeated sequences and the count of such sequences.

6. Memory Management:

  • Ensure all allocated memory is properly freed to avoid memory leaks.
  • Use malloc to allocate memory for sequences and the hash table nodes.
Code:
c 复制代码
/**
 * Note: The returned array must be malloced, assume caller calls free().
 */
#define SEQUENCE_LENGTH 10
#define HASH_TABLE_SIZE 1048576 // 2^20 for hashing 10-letter sequences

typedef struct {
    char* sequence;
    struct ListNode* next;
} ListNode;

unsigned int hash(char* s) {
    unsigned int hashValue = 0;
    for (int i = 0; i < SEQUENCE_LENGTH; i++) {
        hashValue <<= 2; // Shift left by 2 bits
        switch (s[i]) {
            case 'A': hashValue |= 0; break;
            case 'C': hashValue |= 1; break;
            case 'G': hashValue |= 2; break;
            case 'T': hashValue |= 3; break;
        }
    }
    return hashValue % HASH_TABLE_SIZE;
}

char** findRepeatedDnaSequences(char* s, int* returnSize) {
    *returnSize = 0;
    int len = strlen(s);
    if (len < SEQUENCE_LENGTH) {
        return NULL;
    }

    ListNode** hashTable = (ListNode**)calloc(HASH_TABLE_SIZE, sizeof(ListNode*));
    char** result = (char**)malloc(sizeof(char*) * (len - SEQUENCE_LENGTH + 1));
    int resultCount = 0;
    int* counts = (int*)calloc(HASH_TABLE_SIZE, sizeof(int));

    for (int i = 0; i <= len - SEQUENCE_LENGTH; i++) {
        char sequence[SEQUENCE_LENGTH + 1];
        strncpy(sequence, s + i, SEQUENCE_LENGTH);
        sequence[SEQUENCE_LENGTH] = '\0';

        unsigned int hashValue = hash(sequence);
        ListNode* node = hashTable[hashValue];
        int found = 0;

        while (node != NULL) {
            if (strcmp(node->sequence, sequence) == 0) {
                found = 1;
                break;
            }
            node = node->next;
        }

        if (found) {
            if (counts[hashValue] == 1) {
                result[resultCount] = (char*)malloc((SEQUENCE_LENGTH + 1) * sizeof(char));
                strcpy(result[resultCount], sequence);
                resultCount++;
            }
            counts[hashValue]++;
        } else {
            ListNode* newNode = (ListNode*)malloc(sizeof(ListNode));
            newNode->sequence = (char*)malloc((SEQUENCE_LENGTH + 1) * sizeof(char));
            strcpy(newNode->sequence, sequence);
            newNode->next = hashTable[hashValue];
            hashTable[hashValue] = newNode;
            counts[hashValue] = 1;
        }
    }

    *returnSize = resultCount;

    for (int i = 0; i < HASH_TABLE_SIZE; i++) {
        ListNode* node = hashTable[i];
        while (node != NULL) {
            ListNode* temp = node;
            node = node->next;
            free(temp->sequence);
            free(temp);
        }
    }
    free(hashTable);
    free(counts);

    return result;
}
相关推荐
2601_957884845 小时前
面向内容合规性的短视频矩阵分发机制:感知哈希去重与语义检索优化实践
矩阵·音视频·哈希算法
sheeta19986 小时前
LeetCode 每日一题笔记 日期:2026.06.02 题目:3635. 最早完成陆地和水上游乐设施的时间 II
笔记·算法·leetcode
Lsk_Smion6 小时前
力扣实训 _ [102].层序遍历--前序--后续_递归与非递归的实现
数据结构·算法·leetcode
小欣加油8 小时前
leetcode3751 范围内总波动值I
java·数据结构·c++·算法·leetcode
玖玥拾8 小时前
C/C++ 基础笔记(七)
c语言·c++
2023自学中10 小时前
Linux虚拟机 CMakeLists.txt:x86 与 ARM 双架构编译脚本
linux·c语言·c++·嵌入式
himobrinehacken12 小时前
C/C++中字符编码与指针应用全解析
c语言·逆向
圣保罗的大教堂13 小时前
leetcode 3751. 范围内总波动值 I 中等
leetcode
182******208313 小时前
2026年学C语言还有出路吗?学习需要报班吗?
c语言·开发语言·学习
8Qi813 小时前
LeetCode 416:分割等和子集 —— (0-1背包)
java·算法·leetcode·动态规划·背包问题·01背包