Brief:
This is a windows application which reads a word document or RTF file (Resume file) and on the basis of predefined rules, does information extraction. For example, name, mobile number, phone number, birth date, email id, education, qualification, summary, awards, achievements, total experience, previous experience, etc. It uses Interop provided by MS Office. It uses ranking mechanism to cross check & make sure if it has found right information or not.
Need:
There was a requirement with a recruitment firm. They use to process lot of CVs for any particular requirement. Extracting the information manually was very tedeous task if done manually. Also there are chances of mistakes as it is done manually. A tool was required to at least extract some major information from a CV which can speed up their requirement process.
How It Works:
Resume Parser is a windows application. It uses Microsoft Office Interop Assemblies to read MS Word or RTF files. We must have MS Office installed in order to use this application. The assemblies internally triggers MS Word to perform any actions. Here is rough logic of Resume Parser.
1. When you open a document through code then it opens Word and loads the given file. There are parameters available to make it invisible. In invisible mode it starts Word and opens the file in background.
2. After opening the file, read the content through Document.Content.Text
3. We should create set of all Regex Patterns and keep in an external file and read once the word is loaded. It will help easy maintenance of the app if a new pattern of any field (Phone, Mobile, DOB, Degree, etc.) is found.
4. Also create dictionaries for field which are fixed or limited. It helps cross checking the extracted information.
5. Before applying any pattern we should make sure that the data is cleaned and scope is narrowed. Divide the data in smallest chunk possible. It increases chances to get right information.
6. Apply Regex Patterns identify desired information.
7. Performs checks to confirm if the found information is correct. It does not guarantee but increases hit rate.
8. If the information is not in desired format then format it. For eaxmple, eduction in tabular structure or as simple text.
9. Some times the information has formatting like, font size, font family, bold, italic, underline, etc. It also helps to cross check if it is right information.
10. Close the document at last with Document.Close method. Make sure to discard changes if any through parameter to Close method.
This is a windows application which reads a word document or RTF file (Resume file) and on the basis of predefined rules, does information extraction. For example, name, mobile number, phone number, birth date, email id, education, qualification, summary, awards, achievements, total experience, previous experience, etc. It uses Interop provided by MS Office. It uses ranking mechanism to cross check & make sure if it has found right information or not.
Need:
There was a requirement with a recruitment firm. They use to process lot of CVs for any particular requirement. Extracting the information manually was very tedeous task if done manually. Also there are chances of mistakes as it is done manually. A tool was required to at least extract some major information from a CV which can speed up their requirement process.
How It Works:
Resume Parser is a windows application. It uses Microsoft Office Interop Assemblies to read MS Word or RTF files. We must have MS Office installed in order to use this application. The assemblies internally triggers MS Word to perform any actions. Here is rough logic of Resume Parser.
1. When you open a document through code then it opens Word and loads the given file. There are parameters available to make it invisible. In invisible mode it starts Word and opens the file in background.
2. After opening the file, read the content through Document.Content.Text
3. We should create set of all Regex Patterns and keep in an external file and read once the word is loaded. It will help easy maintenance of the app if a new pattern of any field (Phone, Mobile, DOB, Degree, etc.) is found.
4. Also create dictionaries for field which are fixed or limited. It helps cross checking the extracted information.
5. Before applying any pattern we should make sure that the data is cleaned and scope is narrowed. Divide the data in smallest chunk possible. It increases chances to get right information.
6. Apply Regex Patterns identify desired information.
7. Performs checks to confirm if the found information is correct. It does not guarantee but increases hit rate.
8. If the information is not in desired format then format it. For eaxmple, eduction in tabular structure or as simple text.
9. Some times the information has formatting like, font size, font family, bold, italic, underline, etc. It also helps to cross check if it is right information.
10. Close the document at last with Document.Close method. Make sure to discard changes if any through parameter to Close method.