This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels.It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain.We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods.The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.