Inspired by humans' remarkable ability to master arithmetic and generalize to unseen problems, we present a new dataset, HINT, to study machines' capability of learning generalizable concepts at three different levels: perception, syntax, and semantics. Learning agents are tasked to reckon how concepts are perceived from raw signals such as images (i.e., perception), how multiple concepts are structurally combined to form a valid expression (i.e., syntax), and how concepts are realized to afford various reasoning tasks (i.e., semantics), all in a weakly supervised manner. With a focus on systematic generalization, we carefully design a five-fold test set to evaluate both the interpolation} and the extrapolation of learned concepts w.r.t. the three levels. We further design a few-shot learning split to test whether models could quickly learn new concepts and generalize them to more complex scenarios. To understand existing models' limitations, we conduct extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3 (with the chain of thought prompting). The results suggest that current models still struggle in extrapolation to long-range syntactic dependency and semantics. Models show a significant gap toward human-level generalization when tested with new concepts in a few-shot setting. Moreover, we find that it is infeasible to solve HINT by simply scaling up the dataset and the model size; this strategy barely helps the extrapolation over syntax and semantics. Finally, in zero-shot GPT-3 experiments, the chain of thought prompting shows impressive results and significantly boosts the test accuracy. We believe the proposed dataset together with the experimental findings are of great interest to the community on systematic generalization.