PostgreSQL源码分析——表名，列名的长度限制实现

在PostgreSQL中，表名，列名长度是有限制的，不能任由客户端定义无限长的名称，这个长度限制被限制在63字符长度，超过的部分会被truncate掉，只保留63个字符长度。我们看一下其源码是怎么实现的。

如果是你自己实现这个限制，你会怎么做呢？肯定是先想好限制的长度，然后定义一个常量，在某个地方进行判断，超过这个常量的长度就进行截断处理。是的，这个思路没有错，PostgreSQL是在src/include/pg_config_manual.h中定义了标识符的最大命名长度64，因为包含一个\0，所以，实际上的最大长度是63。

c 复制代码

// Maximum length for identifiers (e.g. table names, column names, function names).  Names actually are limited to one less byte than this,
// because the length must include a trailing zero byte.
#define NAMEDATALEN 64

下一个问题就是，在哪里做这个判断呢？对于表名，列名，肯定是在定义阶段进行就要进行合法性检查，比如create table tablename(column type, ......)，也就是说，数据库在收到客户端发送来的建表语句的时候进行处理。我们想一下，收到SQL语句后数据库是怎么处理的，先进行词法语法分析，优化，执行...。我们后面根据这个流程，分析一下是在哪里进行的处理（一般，命名的合法性检查都是在前期就要做的，所以能够想到肯定是下词法语法分析阶段处理的，后面会详细说明处理过程）。因为表名，列名都是在SQL语句中定义，SQL语句由客户端发起，所以我们先看一下相关的处理流程，起点是src/backend/main/main.c。接收到一个新的客户端请求，fork一个子进程单独处理该客户端的请求。

c 复制代码

// Any Postgres server process begins execution here.
main(int argc, char *argv[])
    // Postmaster main entry point
--> PostmasterMain(argc, argv);
    // Main idle loop of postmaster ,New connection pending on any of our sockets? If so, fork a child process to deal with it.
    --> ServerLoop();
        // BackendStartup -- start backend process
        --> BackendStartup(port);
        	// BackendRun -- set up the backend's argument list and invoke PostgresMain()
		    --> BackendRun(port);
                // postgres main loop -- all backends, interactive or otherwise start here
                --> PostgresMain(ac, av, port->database_name, port->user_name);
                	// queries loop here.
	                --> for (;;)        // 在这里不断接收客户端的请求，处理
                        // tell dest that we are ready for a new query
                    	--> ReadyForQuery(whereToSendOutput);       // 执行完毕后，输出结果返回给客户端
                        // Execute a "simple Query" protocol message.
                        --> exec_simple_query(const char *query_string)
                            --> pg_parse_query(query_string);

源码分析

在处理客户端发起的SQL语句的时候，需要先进行词法语法分析。我们看一下raw_parser()的源码：

c 复制代码

// raw_parser	Given a query in string form, do lexical and grammatical analysis.
List *raw_parser(const char *str) {
	core_yyscan_t yyscanner;
	base_yy_extra_type yyextra;
	int			yyresult;

	/* initialize the flex scanner */
	yyscanner = scanner_init(str, &yyextra.core_yy_extra, &ScanKeywords, ScanKeywordTokens);  // 词法分析器

	/* base_yylex() only needs this much initialization */
	yyextra.have_lookahead = false;

	/* initialize the bison parser */
	parser_init(&yyextra);

	/* Parse! */
	yyresult = base_yyparse(yyscanner);

	/* Clean up (release memory) */
	scanner_finish(yyscanner);

	if (yyresult)				/* error */
		return NIL;

	return yyextra.parsetree;
}

在调用base_yyparse时，会调用yylex，也就是下面的base_yylex，

c 复制代码

int yyparse (core_yyscan_t yyscanner) { 
// 忽略部分代码...

  /* YYCHAR is either YYEMPTY or YYEOF or a valid lookahead symbol.  */
  if (yychar == YYEMPTY) {
      YYDPRINTF ((stderr, "Reading a token: "));
      yychar = yylex (&yylval, &yylloc, yyscanner);         // 调用yylex
    }

  if (yychar <= YYEOF) {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    } else {
      yytoken = YYTRANSLATE (yychar);
      YY_SYMBOL_PRINT ("Next token is", yytoken, &yylval, &yylloc);
    }

// 忽略部分代码...
}

#define yylex           base_yylex      // 调用base_yylex

// 这里会调用`core_yylex`，在scan.c文件中，由scan.l生成.
// Intermediate filter between parser and core lexer (core_yylex in scan.l).
int base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner) {
	base_yy_extra_type *yyextra = pg_yyget_extra(yyscanner);
	int			cur_token;
	int			next_token;
	int			cur_token_length;
	YYLTYPE		cur_yylloc;

	/* Get next token --- we might already have it */
	if (yyextra->have_lookahead) 
	{
		cur_token = yyextra->lookahead_token;
		lvalp->core_yystype = yyextra->lookahead_yylval;
		*llocp = yyextra->lookahead_yylloc;
		*(yyextra->lookahead_end) = yyextra->lookahead_hold_char;
		yyextra->have_lookahead = false;
	} else
		cur_token = core_yylex(&(lvalp->core_yystype), llocp, yyscanner);

	/* Check for special handling of PARTITION keyword. (see OptFirstPartitionSpec rule in the grammar)*/
	if (yyextra->tail_partition_magic)
	{
		if (cur_token == PARTITION)
		{
			yyextra->tail_partition_magic = false;
			return PARTITION_TAIL;
		}
	}

	/* If this token isn't one that requires lookahead, just return it.  If it
	 * does, determine the token length.  (We could get that via strlen(), but
	 * since we have such a small set of possibilities, hardwiring seems
	 * feasible and more efficient --- at least for the fixed-length cases.)*/
	switch (cur_token)
	{
		case NOT:
			cur_token_length = 3;
			break;
		case NULLS_P:
			cur_token_length = 5;
			break;
		case WITH:
			cur_token_length = 4;
			break;
		case UIDENT:
		case USCONST:
			cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
			break;
		default:
			return cur_token;
	}

	/* Identify end+1 of current token.  core_yylex() has temporarily stored a
	 * '\0' here, and will undo that when we call it again.  We need to redo
	 * it to fully revert the lookahead call for error reporting purposes. */
	yyextra->lookahead_end = yyextra->core_yy_extra.scanbuf +
		*llocp + cur_token_length;
	Assert(*(yyextra->lookahead_end) == '\0');

	/* Save and restore *llocp around the call.  It might look like we could
	 * avoid this by just passing &lookahead_yylloc to core_yylex(), but that
	 * does not work because flex actually holds onto the last-passed pointer
	 * internally, and will use that for error reporting.  We need any error
	 * reports to point to the current token, not the next one. */
	cur_yylloc = *llocp;

	/* Get next token, saving outputs into lookahead variables */
	next_token = core_yylex(&(yyextra->lookahead_yylval), llocp, yyscanner);
	yyextra->lookahead_token = next_token;
	yyextra->lookahead_yylloc = *llocp;

	*llocp = cur_yylloc;

	/* Now revert the un-truncation of the current token */
	yyextra->lookahead_hold_char = *(yyextra->lookahead_end);
	*(yyextra->lookahead_end) = '\0';

	yyextra->have_lookahead = true;

	/* Replace cur_token if needed, based on lookahead */
	switch (cur_token)
	{
		case NOT:
			/* Replace NOT by NOT_LA if it's followed by BETWEEN, IN, etc */
			switch (next_token)
			{
				case BETWEEN:
				case IN_P:
				case LIKE:
				case ILIKE:
				case SIMILAR:
					cur_token = NOT_LA;
					break;
			}
			break;

		case NULLS_P:
			/* Replace NULLS_P by NULLS_LA if it's followed by FIRST or LAST */
			switch (next_token)
			{
				case FIRST_P:
				case LAST_P:
					cur_token = NULLS_LA;
					break;
			}
			break;

		case WITH:
			/* Replace WITH by WITH_LA if it's followed by TIME or ORDINALITY */
			switch (next_token)
			{
				case TIME:
				case ORDINALITY:
					cur_token = WITH_LA;
					break;
			}
			break;

		case UIDENT:
		case USCONST:
			/* Look ahead for UESCAPE */
			if (next_token == UESCAPE)
			{
				/* Yup, so get third token, which had better be SCONST */
				const char *escstr;

				/* Again save and restore *llocp */
				cur_yylloc = *llocp;

				/* Un-truncate current token so errors point to third token */
				*(yyextra->lookahead_end) = yyextra->lookahead_hold_char;

				/* Get third token */
				next_token = core_yylex(&(yyextra->lookahead_yylval), llocp, yyscanner);

				/* If we throw error here, it will point to third token */
				if (next_token != SCONST)
					scanner_yyerror("UESCAPE must be followed by a simple string literal", yyscanner);

				escstr = yyextra->lookahead_yylval.str;
				if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
					scanner_yyerror("invalid Unicode escape character", yyscanner);

				/* Now restore *llocp; errors will point to first token */
				*llocp = cur_yylloc;

				/* Apply Unicode conversion */
				lvalp->core_yystype.str = str_udeescape(lvalp->core_yystype.str, escstr[0], *llocp, yyscanner);

				/* We don't need to revert the un-truncation of UESCAPE.  What
				 * we do want to do is clear have_lookahead, thereby consuming all three tokens. */
				yyextra->have_lookahead = false;
			}
			else
			{
				/* No UESCAPE, so convert using default escape character */
				lvalp->core_yystype.str = str_udeescape(lvalp->core_yystype.str, '\\', *llocp, yyscanner);
			}

			if (cur_token == UIDENT)
			{
				/* It's an identifier, so truncate as appropriate */
				truncate_identifier(lvalp->core_yystype.str, strlen(lvalp->core_yystype.str), true);
				cur_token = IDENT;
			} else if (cur_token == USCONST)
			{
				cur_token = SCONST;
			}
			break;
	}

	return cur_token;
}

在进行词法分析的时候，会调用如下代码：

c 复制代码

// 通过这个函数处理前面规则段定义
extern int core_yylex(YYSTYPE * yylval_param,YYLTYPE * yylloc_param ,yyscan_t yyscanner);

#define YY_DECL int core_yylex \
               (YYSTYPE * yylval_param, YYLTYPE * yylloc_param , yyscan_t yyscanner)


/** The main scanner function which does all the work. */
YY_DECL
{
	register yy_state_type yy_current_state;
	register char *yy_cp, *yy_bp;
	register int yy_act;
    struct yyguts_t * yyg = (struct yyguts_t*)yyscanner;

#line 416 "scan.l"


#line 5414 "scan.c"

    yylval = yylval_param;

    yylloc = yylloc_param;

	if ( !yyg->yy_init )
		{
		yyg->yy_init = 1;

#ifdef YY_USER_INIT
		YY_USER_INIT;
#endif

		if ( ! yyg->yy_start )
			yyg->yy_start = 1;	/* first start state */

		if ( ! yyin )
			yyin = stdin;

		if ( ! yyout )
			yyout = stdout;

		if ( ! YY_CURRENT_BUFFER ) {
			core_yyensure_buffer_stack (yyscanner);
			YY_CURRENT_BUFFER_LVALUE =
				core_yy_create_buffer(yyin,YY_BUF_SIZE ,yyscanner);
		}

		core_yy_load_buffer_state(yyscanner );
		}

	while ( 1 )		/* loops until end-of-file is reached */
		{
		yy_cp = yyg->yy_c_buf_p;

		/* Support of yytext. */
		*yy_cp = yyg->yy_hold_char;

		/* yy_bp points to the position in yy_ch_buf of the start of the current run. */
		yy_bp = yy_cp;

		yy_current_state = yy_start_state_list[yyg->yy_start];
yy_match:
		{
		register yyconst struct yy_trans_info *yy_trans_info;

		register YY_CHAR yy_c;

		for ( yy_c = YY_SC_TO_UI(*yy_cp);
		      (yy_trans_info = &yy_current_state[(unsigned int) yy_c])->
		yy_verify == yy_c;
		      yy_c = YY_SC_TO_UI(*++yy_cp) )
			yy_current_state += yy_trans_info->yy_nxt;
		}

yy_find_action:
		yy_act = yy_current_state[-1].yy_nxt;

		YY_DO_BEFORE_ACTION;

do_action:	/* This label is used only to access EOF actions. */

		switch ( yy_act )
	{ /* beginning of action switch */
case 1:
/* rule 1 can match eol */
YY_RULE_SETUP
#line 418 "scan.l"
{
					/* ignore */
				}
	YY_BREAK
case 2:
// 中间其他规则省略......
case 62:
YY_RULE_SETUP
#line 1012 "scan.l"
{
					int			kwnum;
					char	   *ident;

					SET_YYLLOC();

					/* Is it a keyword? */
					kwnum = ScanKeywordLookup(yytext,
											  yyextra->keywordlist);   //see if a given word is a keyword
					if (kwnum >= 0)
					{
						yylval->keyword = GetScanKeyword(kwnum,
														 yyextra->keywordlist);
						return yyextra->keyword_tokens[kwnum];
					}

					/* No.  Convert the identifier to lower case, and truncate if necessary. */
					ident = downcase_truncate_identifier(yytext, yyleng, true);     // 在这个地方，如果超出了63字符长度，会将其截断
					yylval->str = ident;
					return IDENT;
				}
	YY_BREAK

	default:
		YY_FATAL_ERROR(
			"fatal flex scanner internal error--no action found" );
	} /* end of action switch */
		} /* end of scanning one token */
} /* end of core_yylex */


char* downcase_truncate_identifier(const char *ident, int len, bool warn) {
	return downcase_identifier(ident, len, warn, true);
}

// a workhorse for downcase_truncate_identifier
char *downcase_identifier(const char *ident, int len, bool warn, bool truncate) {
	char	   *result;
	int			i;
	bool		enc_is_single_byte;

	result = palloc(len + 1);
	enc_is_single_byte = pg_database_encoding_max_length() == 1;

	/* SQL99 specifies Unicode-aware case normalization, which we don't yet
	 * have the infrastructure for.  Instead we use tolower() to provide a
	 * locale-aware translation.  However, there are some locales where this
	 * is not right either (eg, Turkish may do strange things with 'i' and
	 * 'I').  Our current compromise is to use tolower() for characters with
	 * the high bit set, as long as they aren't part of a multi-byte
	 * character, and use an ASCII-only downcasing for 7-bit characters. */
	for (i = 0; i < len; i++) {
		unsigned char ch = (unsigned char) ident[i];

		if (ch >= 'A' && ch <= 'Z')
			ch += 'a' - 'A';
		else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
			ch = tolower(ch);
		result[i] = (char) ch;
	}
	result[i] = '\0';

	if (i >= NAMEDATALEN && truncate)           // 如果超出了NAMEDATALEN的长度，则截断
		truncate_identifier(result, i, warn);

	return result;
}

/* truncate_identifier() --- truncate an identifier to NAMEDATALEN-1 bytes.
 * The given string is modified in-place, if necessary.  A warning is issued if requested.
 * We require the caller to pass in the string length since this saves a strlen() call in some common usages. */
void truncate_identifier(char *ident, int len, bool warn) {
	if (len >= NAMEDATALEN)
	{
		len = pg_mbcliplen(ident, len, NAMEDATALEN - 1);
		if (warn)
		{
			/* We avoid using %.*s here because it can misbehave if the data
			 * is not valid in what libc thinks is the prevailing encoding. */
			char		buf[NAMEDATALEN];

			memcpy(buf, ident, len);
			buf[len] = '\0';
			ereport(NOTICE, (errcode(ERRCODE_NAME_TOO_LONG), errmsg("identifier \"%s\" will be truncated to \"%s\"", ident, buf)));
		}
		ident[len] = '\0';
	}
}

总结一下，识别表名，列名肯定是在词法分析阶段去做的，那么判断其长度是否需要截断也一定是在这个处理过程中实现的。