First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
- C Country code 1-10 digits (right now 4 or less, but that may change)
- A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
- E Exchange (prefix, or switch) code 0-10 digits
- L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity
numbers such as (800) Lucky-Guy
, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
Pros:
Because you indicate that some UDFs
have values for a small portion of
the overall data set, a separate
table would give you the best
performance because that table will
be only as large as it needs to be
to support the UDF. The same holds true for the related indices.
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
You can use table/column names that
reflect what the data actually is.
You have complete control to use data types,
check constraints, default values, etc.
to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such
constraints also help RDBMS query
optimizers develop more effective
plans.
Should you ever need to use foreign
keys, built-in declarative
referential
integrity is rarely out-performed by
trigger-based or application level
constraint enforcement.
Cons:
This could create a lot of tables.
Enforcing schema separation and/or a
naming convention would alleviate
this.
There is more application code
needed to operate the UDF definition
and management. I expect this is
still less code needed than for the
original options 1, 3, & 4.
Other Considerations:
If there is anything about the
nature of the data that would make
sense for the UDFs to be grouped,
that should be encouraged. That way,
those data elements can be combined
into a single table. For example,
let's say you have UDFs for color,
size, and cost. The tendency in the
data is that most instances of this
data looks like
'red', 'large', 45.03
rather than
NULL, 'medium', NULL
In such a case, you won't incur a
noticeable speed penalty by
combining the 3 columns in 1 table
because few values would be NULL and
you avoid making 2 more tables,
which is 2 fewer joins needed when
you need to access all 3 columns.
If you hit a performance wall from a
UDF that is heavily populated and
frequently used, then that should be
considered for inclusion in the
master table.
Logical table design can take you to
a certain point, but when the record
counts get truly massive, you also
should start looking at what table
partitioning options are provided by your RDBMS of choice.
Best Answer
If you are always going to have a number of a certain length (say, it will always be 10 characters), then you can just get the length of the number in the database (after it is converted to a string) and then add the appropriate 0's.
However, if this is an arbitrary amount of leading zeros, then you will have to store the content as a string in the database so you can capture the leading zeros.