Apache Solr – Indexing ZIP files



My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk.

The database is similar to:

----------------------------------------------------------------------
| id | sender | receiver | subject | body | attach_dir | attachments |
----------------------------------------------------------------------
| 2  | 444    | 555      | Apples  | Hey! | /mnt/emails| att1.docrn|
|    |        |          |         |      |            | att2.docrn|
----------------------------------------------------------------------
| 3  | 77     | 22       | Pears   | Hola!| /mnt/emails| att1.ziprn|
----------------------------------------------------------------------

I index it with the following data-config.xml:

<dataConfig>
<dataSource name="mysql"
            type="JdbcDataSource" 
            driver="com.mysql.jdbc.Driver"
            url="jdbc:mysql://localhost:3306/email?
              useUnicode=true&
              characterEncoding=UTF-8&
              useTimezone=true&
              serverTimezone=UTC"
            user="user" 
            password="pass"/>

<dataSource name="files"
            type="BinFileDataSource" />
<document>
  <entity name="email" dataSource="mysql"
    query="SELECT id, subject, body, date, attach, attach_dir FROM email"
    transformer="RegexTransformer"
   >
     <field column="id" name="id"/>
     <field column="subject" name="subject"/>
     <field column="body" name="content"/>
     <field column="date" name="last_modified"/>
     <field column="attach" name="attach" splitBy="rn" />
     <field column="attach_dir" name="attach_dir"/>
     <entity name="attach_glob" dataSource="null" 
     processor="FileListEntityProcessor" 
     baseDir="/mnt/attach/${email.attach_dir}" fileName=".*" 
     recursive="false" onError="skip">
         <entity name="email_attachment" dataSource="files" 
         processor="TikaEntityProcessor" 
         url="${attach_glob.fileAbsolutePath}">
             <field column="text" name="attach_content"/>
         </entity>
     </entity>         
  </entity>
</document>
</dataConfig>

This is working good with all the files except compressed files such as .zip. For .zip files the attach_content field gets filled only with the file names from the zip archive instead of content of the extracted files from the zip archives.

However if I use SimplePostTool like this:

/opt/solr/bin/post -c mycollection /mnt/attach/message3/att1.zip

then I get all content extracted from all the files inside of the zip archive and this is what I need. But I would need this content to be part of the documents added by Data Import Handler with the data-config.xml above.

Is this possible?

Answer

You need need to set extractEmbedded to true on the TikaEntityProcessor configuration for it to set the appropriate Parser in the Apache Tika ParseContext for it to parse embedded documents.

For example, you can change you configuration from the question to have this set like the below:

 <entity name="email_attachment" dataSource="files" 
     processor="TikaEntityProcessor" 
     url="${attach_glob.fileAbsolutePath}" extractEmbedded="true">
         <field column="text" name="attach_content"/>
  </entity>

See here for more details.



Source: stackoverflow