中医电子病历（electronic medical record,EMR）包含大量的医疗知识和患者健康信息，对该类信息的抽取和挖掘对中医药的传承和创新有着重要意义。然而以纯文本形式记录的中医电子病历是一种非结构化信息，阻碍了中医药临床经验的总结与挖掘。本文主要讨论如何用机器学习算法对非结构化的中医电子病历文本进行信息的分类抽取，抽取出症状、处方、治法等有用信息。先将电子病历文本进行分词，然后进行标签标注，采用朴素贝叶斯和word2vec算法训练形成模型,最后进行模型测试。实验结果表明，该算法模型的信息抽取查准率可达80%以上。该研究在中医电子病历文本信息抽取领域做出了初步探索，为进一步进行中医药领域的数据挖掘和科研工作提供了良好的基础。
Electronic medical records (EMRs) of traditional Chinese medicine (TCM) contain a lot of medical knowledge and patients personal health conditions. Extracting and mining this kind of information are of great significance to the inheritance and innovation of TCM. However, the EMRs of TCM recorded in plain text are a kind of unstructured information, which hinders the summary and mining of clinical experience of TCM. This paper mainly discusses how to use machine learning algorithm to extract and classify useful information from unstructured EMRs texts of TCM, such as symptoms, prescriptions, treatments and so on. Firstly, the texts of EMRs are segmented and labeled. Then the model can be trained by Naive Bayes and Word2vec algorithm. Finally, the model is tested by the test set data. Experimental results show that the accuracy of information extraction can reach more than 80%. This research has made a preliminary exploration in the field of extracting text information from EMRs of TCM, and provided a favorable foundation for further data mining.